File:  [DVB] / libsoftmpeg / IDEAS
Revision 1.2: download - view: text, annotated - select for diffs
Fri Feb 6 12:04:54 2004 UTC (20 years, 4 months ago) by hunold
Branches: MAIN
CVS tags: HEAD
- reformat docs
- add constant offset and a note that we need to fix it
- add fusionsound realtimepriority patch and a note to the docs

1) The problem with audio/video synchronization
-----------------------------------------------

DVB consists of encoding/broadcasting and reception/decoding.
Fortunately, encoding, broadcasting and reception does not bother use.

One problem is, however, that the decoding must be syncronized with the
encoding process. Because the data comes into the system with a defined
rate, you should not consume the data too fast. In that case, you will
get buffer underruns in your system. You shouldn't consume the datat
too slow either, or your buffers will overflow. So you basically need
to decode and display just with the speed the broadcaster has encoded
the data. To achieve this, the broadcaster adds a program clock
reference (PCR) to the stream. What basically happens is that the
system time clock (STC) of the decoder is compared to the PCR on a
regular basis. If it's running too fast or too slow, the main clock of
the system is  adjusted slightly. Addinionally, the decoder has full
control of the audio/video decoding and audio/video display and all
components use the same clock.

On a x86 system, however, things are really bad on the first glance You
have three different clocks: the system clock, the "clock" inside your
sound card and the "clock" inside your gfx card that drives your tv
out.

You cannot rely on either of these clocks. Let's assume that you have
video and audio prebuffered. Even if you tell your sound card that it
should playback the data at a rate of 48kHz, it might play it out with
48010 or 47990Hz. This depends on the quartz on the sound card and
there is no way you can find that out, not to speak of adjusting it.
The same goes for video: let's assume that you want to display frames
with a defined rate of 25 full-frames per second, ie. one frame each
40ms. If the video encoder operates with 25.01 frames per second,
sooner or later you've skip a frame. You cannot rely on the system
clock either, you might have noticed that you need to adjust  it for
multiple seconds after a few days. Additionally, you cannot adjust the
quartz that drives the clock.

All this does not really matter for "mplayer" and "xine" for example.
They "only" need to play out the audio and sync the video frames
accordingly. The can skip and double frames as they like and overcome
the problem of unprecise audio and video output. The heavy rely on the
fact that they can seek inside the stream, ie. if the stream is not
properly interleaved, the simply use one file pointer to access the 
audio data  and one file pointer to access the video pointer.

For live DVB playback, however, this approach is not feasible. Of
course you can simply prebuffer a few megabytes of data and then use
"mplayer" to play that back, but prebuffering always results in high
channel switching latencies far above one second. For live tv, this is
unbearable.

But even if you don't care, there is still the problem of buffer
underruns and overflows. When the soundcard plays back the audio data
too fast, the buffer will underflow sooner or later. Because the
application cannot simply seek in the stream but needs to take what's
coming off the air, the application will stop playback and prebuffer a
few megabytes of data. Now imagine you're watching the showdown of some
action movie and your tv freezes because data needs to be prebuffered.
Annoying!

"libsoftmpeg" currently takes the following approach for short term
audio/video syncronization:

Audio is taken as a master source and is simply played back with the
original sampling rate (mostly 48kHz). Fluent audio is most important;
you'll notice glitches and skips in audio more likely than a few skips
in the video  playback.

Playback is started when a specified amount of audio data (currently
500ms) is available. In the meantime, the coded mpeg video frames are
cached. For DVB, video is mostly transmitted a few hundred milliseconds
in advance (usually 100-300ms). If we prebuffer 500ms of audio, we
should be able to cache 800ms of video, ie. at least 20 compressed mpeg
frames. Audio and video frames carry presentation time stamps (PTS), so
if we know which byte of audio data is currently played back by the
sound card, we can calculate which video frame matches best next. This
is done once for initial audio/video sync. This is what I called
"sync-to-audio-pts" earlier.

After that, video frames are decoded "just in time" and displayed with
the fixed rate of the video encoder, ie. 25fps for PAL video. ("free
flowing video display"). Because of the different playback speeds,
sooner or later video and audio will most likely drift apart. 

We can always calculate the PTS of the audio frame that's currently
played back and look at the PTS of video frame that's going to be
displayed. If we notice that video is playing too fast, we can double
one field (20ms); if it's too slow, we can skip one field and gain
20ms. It's important to do this *not* too often, otherwise you will
notice jerky video (look at the news banner on CNN for example).

This idea works very well and gives a very good short term audio/video
sync.

2) How to achieve long-term playback stability
----------------------------------------------

As I've already explained above, if we consume the audio data too fast
or too slow, our audio buffer will overflow soonder or later.

The first idea might be to simply do what a set-top-box does: have a
look at the PCRs in the stream and use that data. One big problem,
however, is that you don't have a chance to really compare the PCR to
anything.

In the moment the userspace application sees a transport stream packet
with a PCR in it, the transport packet has already passed three
buffers: the buffer used for dma transfer to the kernel memory, the
ringbuffer used to provide the transport packets to user space and the
buffer of the user application. If you now extract the PCR informations
and compare it to your local system time ("gettimeofday()"), you get a
huge jitter in that comparison. Even worse, you have a big burstiness
inside these measurements. Let's assume your application uses a buffer
of 1024 transport stream packets. If packet 1 and 1023 contain a PCR
and your application processes these buffer in one chunk, then the
gettimeofday()s will be closely together, although the PCRs might be
several 10ms apart. You would neet to low pass filter the results
heavily and event  then is questionable if this will ever tell you that
your system clock is 0.01% too fast. Even more worse, this does not
tell you anything about the quartz on your sound card or one your video
encoder. 8-(

But this is not really a problem: in contrast to net streaming
applications we don't need a PCR. The data rate with DVB is fixed. For
PAL we *know* that audio is coming with 48kHz and video is coming with
25 frames per second.

Because "libsoftmpeg" uses audio as a master sync source and video is
already synced to the audio on a short-term basis, the basic idea is to
use a "buffer fullness strategy" for the audio buffer. If we can
achieve that the audio buffer is always half full, then the we'll never
have problems with buffer under- or overflow.

The (currently unimplemented) idea is to monitor the buffer fullness of
the audio buffer. If we're in normal operation and the fullness drops
below - say - 25% in x seconds, then  we can calculate the rate the
sound card is consuming data too fast. To overcome this problem, we can
*slightly* adjust the pitch of the sound to come back to a fill grade
of 50% perhaps in 5*x seconds. Of course this should be in the area of
only a few Hertz, otherwise looking at an opera can be annoying, too.
;-)

LinuxTV legacy CVS <linuxtv.org/cvs>