Normally I wouldn't have written this, but I was compelled by a post I saw on someone else's blog about the importance of writing down what we learn, even if it's not very good. I'm just that gullible I suppose.
I've been using FFmpeg as a component of a real-time video processing program for a while now. I am NOT an expert in FFmpeg, or even very well versed in the project as a whole, but I do have a working piece of code that uses it. So there's that. The situation is better now, but when I was first writing my media player I pieced it together from old outdated example code, half-finished gists, and fucking doxygen. So here's a gift from one internet stranger to another: just more of the same.
Video files have a number of different types of data that need to be presented to the user at the same time: the video frames, the audio, and sometimes embedded subtitles. In fact, video files often have multiple different audio streams in them (one for each language), and multiple subtitle streams (normal subtitles, colored subtitles, hearing impaired). The user is in charge of choosing the streams they want to come out of their screens and speakers.
The obvious way to do this is to just have all of the video frames listed sequentially in the video file, followed by all of the audio samples listed sequentially, on and on for all of the relevant data streams. Then, you would just maintain three or so pointer heads into the file, incrementing them as needed, and do whatever it is you need.
But no, apparently that's just a little too easy for some people. What actually happens is multiplexing, where the different streams are broken into small packets and interleaved throughout the file. The idea is that you would be able to maintain just a single pointer head into the file and just read each packet one by one and do your thing. We'll come back to that later.
In all honestly, this decision isn't all that bad. It makes some coding in the future more difficult and error-prone, but it also means that streaming video, or playing video files that aren't finished being written yet and whose total length isn't yet known, starts to look a lot like playing normal video files. This viewpoint is emphasized in the latest video file formats. But I digress.
Another important point is that the video and audio data streams are, in 100% of situations, going to be compressed. There are many different compression and decompression algorithms, called "codecs", and they should all be considered a black box. Modern video compression is quite complex. The spec for the vp8 codec, while short, was famously difficult to write a decompressor for. The new fancier version, vp9, doesn't even have an official spec, 9 years later it's still just a draft! The other major video codecs like HVEC or AV1 aren't much better, they keep shoving more stuff in the bitstream. Audio codecs are usually much simpler but are still, in my opinion, not worth implementing if your final goal is just to read and write video files. Just don't roll your own.
So a video file is a bunch of different data streams, broken into packets and interleaved. Okay, so we know how to deal with the contents of the packets: we feed them into the codec and out comes the real data we're trying to read and write. So what about the outer file structure? What does the file format for the stream and packet metadata look like?
Well like how there are many different codecs, there are many different so called "container" formats. The job of these containers formats is to define all of the data streams, stream metadata, and the packet metadata that comes before the actual packet bytes (that then get immediately passed to a codec, remember). These I think are instructional to try to code yourself. Go ahead, write a matroska parser. I think it's instructional to see the ways that a file format that gets used by billions of people (.mkv and .webm) isn't all that different from something a junior programmer would write. And then you should go back to using FFmpeg.
So the situation so far:
------------------------------------------------
| Container |
| ----------------- -------------- -------- |
| | Stream metadata | Seeking info | etc... | |
| ----------------- -------------- -------- |
| |
| -------------- -------------- -------------- |
| | Video packet | Audio packet | Audio packet | |
| ----------------- -------------- ----------- |
| | Subtitle packet | Video packet | etc... | |
| ----------------- -------------- -------- |
------------------------------------------------
Using FFmpeg to read a video file, to open up the container, extract the packets, and decompress the packets takes a little bit of work. Very soon I'm going to be showing you some very heavily annotated code, but at the risk of sounding repetitive I'm going to go over the major steps first.
First, we need to open the container. In FFmpeg, containers are called "formats", and the relevant struct here is AVFormatContext
. You create one by calling avformat_alloc_context()
. From here you call avformat_open_input()
taking in the filename of the file you want to use.
Then you go through a few hoops to load the stream data and the codecs for the releavant streams. An AVStream
is a struct containing some very basic information about the stream contents. It's mostly used just to populate an AVCodec
, a struct that contains a bunch of function pointers to specialized codec functionality. This is, in turn, mostly just used to generate an AVCodecContext
, a struct that contains all of the actually useful information about a particular data stream, and is the object you'll be passing around to most functions to do any reading.
Then it's time to read and decompress the packets. Reading is a packet is pretty simple, av_read_frame()
.
FFmpeg has a synchronous message pump model to do decompression: you send packets to the decompressor and whenever it has a complete segment of data (like a complete video frame or an ADPCM frame) it will give it back when you. The sending happens with avcodec_send_packet()
, and you can get a complete segment of data, called a "frame" in FFmpeg parlance, by calling avcodec_receive_frame()
.
One last thing to note: certain data structures in FFmpeg are reference counted, so if used more than once they must be unreferenced manually.
Here's some code, following the age-old tradition of omitting error handling. Implying that my code does do the appropriate error handling. Yup.
// Depending on what version of FFmpeg you're using, either the compiler will
// yell at you and try to make you feel bad for calling deprecated functions
// like this, or it's a required function for the later API calls to not crash.
// Unfortunately the amount of time between these two versions is only a couple
// years. With FFmpeg, you'll find this happens a lot. If you care about
// warnings and being able to compile your code with different library versions,
// #if guard this line.
av_register_all();
AVFormatContext *fmt_ctx = avformat_alloc_context();
avformat_open_input(&fmt_ctx, filename, NULL, NULL);
// Now even though you've called avformat_open_input(), the function whose
// sole job it is to open up the file and fill out the fields of the file
// format's context struct, some of the fields of the file format's context
// struct are not filled out. So we fill them out with
// avformat_find_stream_info(). And now you're set, all of the streams are
// available to you!
avformat_find_stream_info(fmt_ctx, NULL);
// Er, well, some of the stream metadata, that is. To find all of the metadata
// for a given data stream, you'll need to get the stream's `codec_id` and pass
// it to `avcodec_find_decoder()`, then get a codec context by passing the
// result to `avcodec_alloc_context3()`, and then that just filled out the codec
// context defaults you have to pass the result of *that* to
// avcodec_paramters_to_context()
AVCodecContext *dec_ctx[256];
for (int i = 0; i < fmt_ctx->nb_streams; ++i) {
AVStream *stream = fmt_ctx->streams[i];
AVCodec *dec = avcodec_find_decoder(stream->codecpar->codec_id);
AVCodecContext *codec_ctx = avcodec_alloc_context3(dec);
avcodec_parameters_to_context(codec_ctx, stream->codecpar);
// ??
// I remember writing this line to fix a bug at some point, but I honestly
// couldn't tell you what that bug was, or if it's even releavant anymore.
codec_ctx->framerate = av_guess_frame_rate(fmt_ctx, stream, NULL);
avcodec_open2(codec_ctx, dec, NULL);
dec_ctx[i] = codec_ctx;
// Different stream types are stored at stream->codecpar->codec_type:
// AVMEDIA_TYPE_VIDEO, AVMEDIA_TYPE_AUDIO, or AVMEDIA_TYPE_SUBTITLE
// stream->time_base is important, you multiply the packet timestamp by this
// value to get the real packet timestamp
// Video streams have
// codec_ctx->width, codec_ctx->height, stream->avg_frame_rate
// Audio streams have
// codec_ctx->sample_rate, codec_ctx->channels
}
{
AVPacket *packet = av_packet_alloc();
while (1) {
int err = av_read_frame(fmt_ctx, packet);
if (err < 0) {
// This happens on read error, and on EOF
av_packet_unref(packet); // Unnecessary but whatever.
break;
}
avcodec_send_packet(dec_ctx[packet->stream_index], packet);
AVFrame *frame = av_frame_alloc();
int response = 0;
while (response >= 0) {
response = avcodec_receive_frame(dec_ctx[packet->stream_index], frame);
// Common video frame formats, accessed with frame->format, are
// AV_PIX_FMT_YUV420P, AV_PIX_FMT_YUV420P10LE
// Most audio frame formats are AV_SAMPLE_FMT_FLTP
// More here: https://ffmpeg.org/doxygen/5.1/pixfmt_8h_source.html
// Other good fields for video are
// frame->data, frame->width, frame->height, frame->linesize
// For audio, we have
// frame->nb_samples, av_get_channel_layout_nb_channels(frame->channel_layout)
av_frame_unref(frame);
}
av_frame_free(&frame);
av_packet_unref(packet);
}
av_packet_free(&packet);
}
// Flushing the decompressor. A silly quirk of these kinds of message-pump
// models is that they tend to have epilogue loops like this. Technically
// though you don't have to write an epilogue loop: av_read_frame() and
// avcodec_send_packet() have intentionally staggered behavior that makes it
// possible to write it all as a single loop if you write your tests in the
// correct order. But personally I prefer this.
for (int i = 0; i < fmt_ctx->nb_streams; ++i) {
// Sending a 0 for the packet is the signal to the compressor to flush its
// remaining unsent frames.
avcodec_send_packet(dec_ctx[i], 0);
AVFrame *frame = av_frame_alloc();
int response = 0;
while (response >= 0) {
response = avcodec_receive_frame(dec_ctx[i], frame);
av_frame_unref(frame);
}
av_frame_free(&frame);
}
Instead of passing in a filename, you may be tempted by the docs to fopen
and fread
the file yourself and pass in the bytes by using a feature called AVIO. Don't do it, it's not worth it. At least on the versions I've used it has all kinds of problems, like not being able to reliably open up certain files.
The docs say that a lot of parameters in the context structs are optional, which is true. But some are more optional than others. For video and audio files, fmt_ctx->duration
seems to always be present, but stream->nb_frames
is not for certain major containers like webm. I'm sorry, I wish I could list all of the pitfalls like these but I only know what I know.
When I said that the packets are interleaved, notice how I didn't say anything about the ordering. Within a single stream data is ordered correctly, but there's no guarentee that packets from different streams are ordered sequentially, despite potentially appearing sequentially in the file. In fact, it's rarely the case. Depending on the file and the encoder and the phase of the moon, audio packets may be consistantly ahead of the video packets, consistantly behind, or somewhere inbetween. Video is high throughput + high latency, whereas audio is low throughput + low latency. So if you're making a media player or something you're going to have to buffer packets far in advance anyways, at least like a couple seconds worth of data, so this isn't that big of a deal except right after seeking.
This should go without saying for old libraries like this, but FFmpeg is generally not threadsafe. Have it do its own little thing in its own thread and pipe the results wherever else with a SPMC FIFO.
For writing files, most of the things are the same, except how most of the things are different. Refer to this example source file.
As a last note, more and different kinds sources are useful, so here: