This version adds a few minor fixes + cutting speed improvement.
well, vdr w/ the recent cUnbufferedFile changes was flushing the data buffers in huge burst; this was even worse than slowly filling up the caches -- the large (IIRC ~10M) bursts caused latency problems (apps visibly freezing etc).
This patch makes vdr use a much more aggressive disk access strategy. Writes are flushed out almost immediately and the IO is more evenly distributed. While recording and/or replaying the caches do not grow and when vdr is done accessing a video file all cached data from that file is dropped.
I've tested this w/ both local disks and NFS mounted ones, and it seems to do the right thing. Writes get flushed every 1..2s at a rate of .5..1M/s instead of the >10M bursts. For async mounted NFS servers the writes get collected by the NFS server and normally written out. Local disks get an extra feature -- you can use the HD activity LED as a "recording" indicator :^)
As posix_advice requires kernel v2.5.60 and glibc v2.2, you'll need at least those versions to see any difference. (w/o posix_advice you will not get the fdatasyncs every 10M - if somebody really wants them they should be controlled by a config option)
Possible further improvements could be:
switch from POSIX_FADV_SEQUENTIAL to POSIX_FADV_RANDOM, since we're doing manual readahead anyway (or just leave this as is, and drop the readahead) (not using POSIX_FADV_RANDOM is probably one of the causes of the "leaks", so some of the workarounds could then go too)
artur
--- vdr-1.3.36.org/cutter.c 2005-10-31 13:26:44.000000000 +0100 +++ vdr-1.3.36/cutter.c 2005-11-18 03:20:50.000000000 +0100 @@ -66,6 +66,7 @@ void cCuttingThread::Action(void) toFile = toFileName->Open(); if (!fromFile || !toFile) return; + fromFile->setreadahead(MEGABYTE(10)); int Index = Mark->position; Mark = fromMarks.Next(Mark); int FileSize = 0; @@ -90,6 +91,7 @@ void cCuttingThread::Action(void) if (fromIndex->Get(Index++, &FileNumber, &FileOffset, &PictureType, &Length)) { if (FileNumber != CurrentFileNumber) { fromFile = fromFileName->SetOffset(FileNumber, FileOffset); + fromFile->setreadahead(MEGABYTE(10)); CurrentFileNumber = FileNumber; } if (fromFile) { --- vdr-1.3.36.org/tools.c 2005-11-04 17:33:18.000000000 +0100 +++ vdr-1.3.36/tools.c 2005-11-18 20:33:46.000000000 +0100 @@ -851,8 +851,7 @@ bool cSafeFile::Close(void)
// --- cUnbufferedFile -------------------------------------------------------
-#define READ_AHEAD MEGABYTE(2) -#define WRITE_BUFFER MEGABYTE(10) +#define WRITE_BUFFER KILOBYTE(800)
cUnbufferedFile::cUnbufferedFile(void) { @@ -869,7 +868,15 @@ int cUnbufferedFile::Open(const char *Fi Close(); fd = open(FileName, Flags, Mode); begin = end = ahead = -1; + readahead = 16*1024; + pendingreadahead = 0; written = 0; + totwritten = 0; + if (fd >= 0) { + // we really mean POSIX_FADV_SEQUENTIAL, but we do our own readahead + // so turn off the kernel one. + posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM); + } return fd; }
@@ -880,10 +887,10 @@ int cUnbufferedFile::Close(void) end = ahead; if (begin >= 0 && end > begin) { //dsyslog("close buffer: %d (flush: %d bytes, %ld-%ld)", fd, written, begin, end); - if (written) + if (0 && written) fdatasync(fd); - posix_fadvise(fd, begin, end - begin, POSIX_FADV_DONTNEED); } + posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); begin = end = ahead = -1; written = 0; } @@ -899,35 +906,89 @@ off_t cUnbufferedFile::Seek(off_t Offset return -1; }
+// when replaying and going eg FF->PLAY the position jumps back 2..8M +// hence we might not want to drop that data at once. +// Ignoring for now to avoid making this even more complex, but we could +// at least try to handle the common cases +// (PLAY->FF->PLAY, small jumps, moving editing marks etc) + ssize_t cUnbufferedFile::Read(void *Data, size_t Size) { if (fd >= 0) { off_t pos = lseek(fd, 0, SEEK_CUR); - // jump forward - adjust end position - if (pos > end) - end = pos; - // after adjusting end - don't clear more than previously requested - if (end > ahead) - end = ahead; - // jump backward - drop read ahead of previous run - if (pos < begin) - end = ahead; + off_t jumped = pos-end; // nonzero means we're not at the last offset - some kind of jump happened. + if (jumped) { + pendingreadahead += ahead-end+KILOBYTE(64); + // jumped forward? - treat as if we did read all the way to current pos. + if (pos > end) { + end = pos; + // but clamp at ahead so we don't clear more than previously requested. + // (would be mostly harmless anyway, unless we got more than one reader of this file) + // add a little extra readahead, JIC the kernel prefethed more than we requested. + if (end > (ahead+KILOBYTE(128))) + end = ahead+KILOBYTE(128); + } + // jumped backward? - drop both last read _and_ read-ahead + if (pos < begin) + end = ahead+KILOBYTE(128); + // jumped backward, but still inside prev read window? - pretend we read less. + if ((pos >= begin) && (pos < end)) + end = pos; + } + + ssize_t bytesRead = safe_read(fd, Data, Size); + + // now drop all data accesed during _previous_ Read(). if (begin >= 0 && end > begin) - posix_fadvise(fd, begin - KILOBYTE(200), end - begin + KILOBYTE(200), POSIX_FADV_DONTNEED);//XXX macros/parameters??? + posix_fadvise(fd, begin, end-begin, POSIX_FADV_DONTNEED); + begin = pos; - ssize_t bytesRead = safe_read(fd, Data, Size); if (bytesRead > 0) { pos += bytesRead; - end = pos; // this seems to trigger a non blocking read - this // may or may not have been finished when we will be called next time. // If it is not finished we can't release the not yet filled buffers. // So this is commented out till we find a better solution. - //posix_fadvise(fd, pos, READ_AHEAD, POSIX_FADV_WILLNEED); - ahead = pos + READ_AHEAD; + + // Hmm, it's obviously harmless if we're actually going to read the data + // -- the whole point of read-ahead is to start the IO early... + // The comment above applies only when we jump somewhere else _before_ the + // IO started here finishes. How common would that be? Could be handled eg + // by posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) called some time after + // we detect a jump. Ignoring this for now. /AS + + // Ugh, it seems to cause some "leaks" at every jump... Either the + // brute force approach mentioned above should work (it's not like this is + // much different than O_DIRECT) or keeping notes about the ahead reads and + // flushing them after some time. the latter seems overkill though, trying + // the former... + + //syslog(LOG_DEBUG,"jump: %06ld ra: %06ld size: %ld", jumped, (long)readahead, (long)Size); + + // no jump? also permit small jump still inside readahead window (FF). + if (jumped>=0 && jumped<=(off_t)readahead) { + if ( readahead <= Size*4 ) // automagically tune readahead size. + readahead = Size*4; + posix_fadvise(fd, pos, readahead, POSIX_FADV_WILLNEED); + ahead = pos + readahead; + } + else { + // jumped - we really don't want any readahead now. otherwise + // eg fast-rewind gets in trouble. + ahead = pos; + + // flush it all; mostly to get rid of nonflushed readahead coming + // from _previous_ jumps. ratelimited. + // the accounting is _very_ unaccurate, i've seen ~50M get flushed + // when the limit was set to 4M. As long as this triggers after + // _some_ jumps we should be ok though. + if (pendingreadahead > MEGABYTE(2)) { + posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); + pendingreadahead = 0; + } + } } - else - end = pos; + end = pos; return bytesRead; } return -1; @@ -950,11 +1011,19 @@ ssize_t cUnbufferedFile::Write(const voi end = pos + bytesWritten; if (written > WRITE_BUFFER) { //dsyslog("flush buffer: %d (%d bytes, %ld-%ld)", fd, written, begin, end); - fdatasync(fd); - if (begin >= 0 && end > begin) - posix_fadvise(fd, begin, end - begin, POSIX_FADV_DONTNEED); + totwritten += written; + if (begin >= 0 && end > begin) { + off_t headdrop = max((long)begin&~4095,(long)WRITE_BUFFER*2); + posix_fadvise(fd, (begin&~4095)-headdrop, end - begin + headdrop, POSIX_FADV_DONTNEED); + } begin = end = -1; written = 0; + // the above fadvise() works when recording, but seems to leave cached + // data around when writing at a high rate (eg cutting). Hence... + if (totwritten > MEGABYTE(20)) { + posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); + totwritten = 0; + } } } return bytesWritten; --- vdr-1.3.36.org/tools.h 2005-11-05 11:54:39.000000000 +0100 +++ vdr-1.3.36/tools.h 2005-11-18 03:13:31.000000000 +0100 @@ -209,6 +209,9 @@ private: off_t end; off_t ahead; ssize_t written; + ssize_t totwritten; + size_t readahead; + size_t pendingreadahead; public: cUnbufferedFile(void); ~cUnbufferedFile(); @@ -218,6 +221,7 @@ public: ssize_t Read(void *Data, size_t Size); ssize_t Write(const void *Data, size_t Size); static cUnbufferedFile *Create(const char *FileName, int Flags, mode_t Mode = DEFFILEMODE); + void setreadahead(size_t ra) { readahead = ra; }; };
class cLockFile {
Artur Skawina schrieb:
well, vdr w/ the recent cUnbufferedFile changes was flushing the data buffers in huge burst; this was even worse than slowly filling up the caches -- the large (IIRC ~10M) bursts caused latency problems (apps visibly freezing etc).
Does this freezing apply to local disk access or only to network filesystems. My personal VDR is a dedicated to VDR usage system which uses a local hard disk for storage. So I don't have applications parallel to vdr which can freeze nor I can actually test behaviour on network devices. Seems you have both of this extra features so it would be nice to know more about this.
For local usage I found that IO interruptions of less then a second (10 MB burst writes on disks which give a hell lot more then 10MB/sec) have no negative side effects. But I can imagine that on 10Mbit ethernet it could be hard to have these bursts ... I did not think about this when writing the initial patch ...
This patch makes vdr use a much more aggressive disk access strategy. Writes are flushed out almost immediately and the IO is more evenly distributed. While recording and/or replaying the caches do not grow and when vdr is done accessing a video file all cached data from that file is dropped.
Actually with the patch you attached my cache _does_ grow. It does not only grow - it displaces the inode cache, to avoid this the initial patch has been created. To make it worse - when cutting a recording and have the newly cut recording replayed at the same time I have major hangs in replay.
I had a look at your patch - it looked very well. But for whatever reason it doesn't do what it is supposed to do at my VDR. I currently don't know why it doesn't work here for replay - the code there looked good.
I like the heuristics you used to deal with read ahead - but maybe these lead to the leaks I experience here. I will have a look at it. Maybe I can find out something about it ...
I've tested this w/ both local disks and NFS mounted ones, and it seems to do the right thing. Writes get flushed every 1..2s at a rate of .5..1M/s instead of the >10M bursts.
To be honest - I did not found the place where writes get flushed in your patch. posix_fadvise() doesn't seem to influence flushing at all. It only applies to already written buffers. So the normal write strategie is used with your patch - collect data until the kernel decides to write it to disk. This leads to "collect about 300MB" here and have an up to 300MB burst then. This is a bit more heavy then the 10MB bursts before ;)
Regards Ralf
Ralf Müller wrote:
Artur Skawina schrieb:
well, vdr w/ the recent cUnbufferedFile changes was flushing the data buffers in huge burst; this was even worse than slowly filling up the caches -- the large (IIRC ~10M) bursts caused latency problems (apps visibly freezing etc).
Does this freezing apply to local disk access or only to network filesystems. My personal VDR is a dedicated to VDR usage system which uses a local hard disk for storage. So I don't have applications parallel to vdr which can freeze nor I can actually test behaviour on network devices. Seems you have both of this extra features so it would be nice to know more about this.
the freezing certainly applies to NFS -- it shows clearly if you have some kind of monitor app graphing network traffic. It may just be the huge amount of data shifted and associated cpu load, but the delays are noticeable for non-rt apps running on the same machine. It's rather obvious when eg watching tv using xawtv while recording. As to the local disk case -- i'm not sure of the impact -- most of my vdr data goes over NFS, and this was what made me look at the code. There could be less of a problem w/ local disks, or I simply didn't realize the correlation w/ vdr activity as i, unlike network traffic, do not have a local IO graph on screen :)
(i _think_ i verified w/ vmstat that local disks were not immune to this, but right now i no longer remember the details, so can't really be sure)
For local usage I found that IO interruptions of less then a second (10 MB burst writes on disks which give a hell lot more then 10MB/sec) have no negative side effects. But I can imagine that on 10Mbit ethernet it could be hard to have these bursts ... I did not think about this when writing the initial patch ...
it's a problem even on 100mbit -- while the fileserver certainly can accept sustained 10M/s data for several seconds (at least), it's the client, ie vdr-box, that does not behave well -- it sits almost completely idle for minutes (zero network traffic, no writeback at all), and then goes busy for a second or so. I first tried various priority changes, but didn't see any visible improvement. Having vdr running at low prio isn't really an option anyway.
Another issue could be the fsync calls -- at least on ext3 these apparently behave very similar to sync(2)...
This patch makes vdr use a much more aggressive disk access strategy. Writes are flushed out almost immediately and the IO is more evenly distributed. While recording and/or replaying the caches do not grow and when vdr is done accessing a video file all cached data from that file is dropped.
Actually with the patch you attached my cache _does_ grow. It does not only grow - it displaces the inode cache, to avoid this the initial patch has been created. To make it worse - when cutting a recording and have the newly cut recording replayed at the same time I have major hangs in replay.
oh, the cutting-trashes-cache-a-bit isn't really such a big surprise -- i was seeing something like that while testing the code -- I had hoped the extra fadvice every 10M would fix that, but i wanted to get the recording and replay cases right first. (the issue when cutting is simply that we need: a) start the writeback, and b) drop the cached data after it has hit the disk. The problem is that we don't really know when to do b... For low write rates the heuristic seems to work, for high rates it might fail. Yes, fdatasync obviously will work, but this is the sledgehammer approach :) The fadvise(0,0) solution was a first try at using a slightly smaller hammer. Keeping a dirty-list and flushing it after some time would be the next step if fadvise isn't enough.)
How does the cache behave when _not_ cutting? Over here it looks ok, i've done several recordings while playing back others, and the cache was basically staying the same. (as this is not a dedicated vdr box it is however sometimes hard to be sure)
I had a look at your patch - it looked very well. But for whatever reason it doesn't do what it is supposed to do at my VDR. I currently don't know why it doesn't work here for replay - the code there looked good.
in v1 i was using a relatively small readahead window -- maybe for a slow disk it was _too_ small. In v2 it's a little bigger, maybe that will help (i increased it to make sure the readahead worked for fast-forward, but so far i haven't been able to see much difference). But I don't usually replay anything while cutting, so this hasn't really been tested...
(BTW, with the added readahead in the v2 patch, vdr seems to come close to saturating a 100M connection when cutting. Even when _both_ the source and destination are on the same NFSv3 mounted disk, which kind of surprised me. LocalDisk->NFS rate and v/v seems to be limited by the network. I didn't check localdisk->localdisk (lack of sufficient diskpace). Didn't do any real benchmarking, these are estimations based on observing the free diskspace decrease rate and network traffic)
I like the heuristics you used to deal with read ahead - but maybe these lead to the leaks I experience here. I will have a look at it. Maybe I can find out something about it ...
Please do, I did and posted this to get others to look at that code and hopefully come up w/ a strategy which works for everyone. For cutting I was going to switch to O_DIRECT, until i realized we then would still need a fallback strategy, for old kernels and NFS...
The current vdr behavior isn't really acceptable -- at the very least the fsyncs have to be configurable -- even a few hundred megabytes needlessly dirtied by vdr is still much better than the bursts of traffic, disk and cpu usage. I personally don't mind the cache trashing so much; it would be enough to keep vdr happily running in the background without disturbing other tasks. (one of the reasons is that while keeping the recording list in cache seems to help local disks, it doesn't really help for NFS -- you still get lots of NFS traffic every time vdr decides to reread the directory structure. As both the client and server could fit the dir tree in ram the limiting factor becomes the network latency)
I've tested this w/ both local disks and NFS mounted ones, and it seems to do the right thing. Writes get flushed every 1..2s at a rate of .5..1M/s instead of the >10M bursts.
To be honest - I did not found the place where writes get flushed in your patch. posix_fadvise() doesn't seem to influence flushing at all.
Hmm, what glibc/kernel? It works here w/ glibc-2.3.90 and linux-2.6.14.
Here's "vmstat 1" output; vdr (patched 1.3.36) is currently doing a recording to local disk:
procs -----------memory---------- ---swap-- -----io---- --system------cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 9168 202120 592 22540 0 0 0 0 3584 1350 0 1 99 0 0 0 9168 202492 592 22052 0 0 0 800 3596 1330 1 0 99 0 0 0 9168 202368 592 22356 0 0 0 0 3576 1342 1 0 99 0 0 0 9168 202492 592 21836 0 0 0 804 3628 1350 0 0 100 0 0 0 9168 202492 592 22144 0 0 0 0 3573 1346 1 1 98 0 0 0 9168 202244 592 22452 0 0 0 0 3629 1345 1 0 99 0 1 0 9168 202492 592 21956 0 0 0 800 3562 1350 0 0 100 0 0 0 9168 202368 592 22260 0 0 0 0 3619 1353 1 0 99 0 0 0 9168 202120 592 22568 0 0 0 0 3616 1357 1 1 98 0 0 0 9168 202492 592 22044 0 0 0 952 3617 1336 0 0 100 0 0 0 9168 202368 596 22352 0 0 0 0 3573 1356 1 0 99 0 1 0 9168 202616 596 21724 0 0 0 660 3609 1345 0 0 100 0 0 0 9168 202616 596 22000 0 0 0 0 3569 1338 1 1 98 0 0 0 9168 202368 596 22304 0 0 0 0 3573 1335 1 0 99 0 1 0 9168 202492 596 21956 0 0 0 896 3644 1360 0 1 99 0 0 0 9168 202492 596 22232 0 0 0 0 3592 1327 1 0 99 0 0 0 9168 202120 596 22536 0 0 0 0 3571 1333 0 0 100 0 0 0 9168 202616 596 21968 0 0 0 800 3575 1329 11 3 86 0 0 0 9168 202368 596 22244 0 0 0 0 3604 1350 1 0 99 0 0 0 9168 202492 596 21756 0 0 0 820 3585 1326 0 1 99 0 0 0 9168 202492 612 22060 0 0 8 140 3632 1369 1 1 89 9 0 0 9168 202244 612 22336 0 0 0 0 3578 1328 1 0 99 0 0 0 9168 202492 612 21796 0 0 0 784 3619 1360 0 0 100 0 0 0 9168 202492 628 22072 0 0 8 104 3559 1317 2 0 96 2 0 0 9168 202244 632 22376 0 0 0 0 3604 1348 1 0 99 0 0 0 9168 202492 632 21904 0 0 0 800 3695 1402 0 0 100 0 0 0 9168 202368 632 22180 0 0 0 0 3775 1456 1 1 98 0 0 0 9168 202120 632 22484 0 0 0 0 3699 1416 0 1 99 0 0 0 9168 202492 632 21992 0 0 0 804 3774 1465 1 0 99 0 1 0 9168 202236 632 22268 32 0 32 0 3810 1570 3 1 93 3 0 0 9168 202360 632 21776 0 0 0 820 3896 1690 1 1 98 0
the 'bo' column shows the writeout caused by vdr. Also note the 'free' and 'cache' field fluctuate a bit, but do not grow. Hmm, now i noticed the slowly growing 'buff' -- is this causing you problems? I didn't mind this here, as there's clearly plenty of free RAM around. Will have to investigate what happens under some memory pressure.
Are saying you don't get any writeback activity w/ my patch?
With no posix_fadvice and no fdatasync calls in the write path i get almost no writeout with multi-megabyte bursts every minute (triggered probably by ext3 journal commit (interval set to 60s) and/or memory pressure).
It only applies to already written buffers. So the normal write
/usr/src/linux/mm/fadvise.c should contain the implementation of the various fadvice modes in a linux 2.6 kernel. It certainly does trigger writeback here. Both in the local disk case, and on NFS, where it causes a similar traffic pattern.
strategie is used with your patch - collect data until the kernel decides to write it to disk. This leads to "collect about 300MB" here and have an up to 300MB burst then. This is a bit more heavy then the 10MB bursts before ;)
See vmstat output above. Are you sure you have a working posix_fadvise? If not, that would also explain the hang during playback as no readahead was actually taking place... (to be honest, i don't think that you need any manual readahead at all in a normal-playback situation; especially as the kernel will by default do some. It's only when the disk is getting busier that the benefits of readahead show up. At least this is what i saw here) What happens when you start a replay and then end it? is the memory freed immediately?
Thanks for testing and the feedback.
Regards,
artur
On Montag 21 November 2005 02:15, Artur Skawina wrote:
the freezing certainly applies to NFS -- it shows clearly if you have
Ok - I see.
it's a problem even on 100mbit -- while the fileserver certainly can accept sustained 10M/s data for several seconds (at least), it's the client, ie vdr-box, that does not behave well -- it sits almost completely idle for minutes (zero network traffic, no writeback at all), and then goes busy for a second or so.
But this very much sounds like a NFS-problem - and much less like a VDR problem ...
[...] I had hoped the extra fadvice every 10M would fix that, but i wanted to get the recording and replay cases right first. (the issue when cutting is simply that we need: a) start the writeback, and b) drop the cached data after it has hit the disk. The problem is that we don't really know when to do b...
Thats exactly the problem here ... without special force my kernel seems to prefer to use memory instead of disk ...
For low write rates the heuristic seems to work, for high rates it might fail. Yes, fdatasync obviously will work, but this is the sledgehammer approach :)
I know. I also don't like this approach. But at least it worked (here).
The fadvise(0,0) solution was a first try at using a slightly smaller hammer. Keeping a dirty-list and flushing it after some time would be the next step if fadvise isn't enough.)
How do you know what is still dirty in case of writes?
How does the cache behave when _not_ cutting? Over here it looks ok, i've done several recordings while playing back others, and the cache was basically staying the same. (as this is not a dedicated vdr box it is however sometimes hard to be sure)
With the active read ahead I even have leaks when only reading - the initiated non-blocking reads of the WILL_NEED seem to keep pages in the buffer caches.
in v1 i was using a relatively small readahead window -- maybe for a slow disk it was _too_ small. In v2 it's a little bigger, maybe that will help (i increased it to make sure the readahead worked for fast-forward, but so far i haven't been able to see much difference). But I don't usually replay anything while cutting, so this hasn't really been tested...
My initial intention when trying to use an active read ahead has been to have no hangs even when another disks needs to spin up. On my system I sometimes have this problem and it is annoying. So a read ahead of several megabytes would be needed here - but even without such a huge read ahead I get this annoying leaks here. For normal operation (replay) they could be avoided by increasing the region which has to be cleared to at least the size of the read ahead.
(BTW, with the added readahead in the v2 patch, vdr seems to come close to saturating a 100M connection when cutting. Even when _both_ the source and destination are on the same NFSv3 mounted disk, which kind of surprised me. LocalDisk->NFS rate and v/v seems to be limited by the network. I didn't check localdisk->localdisk (lack of sufficient diskpace). Didn't do any real benchmarking, these are estimations based on observing the free diskspace decrease rate and network traffic)
Cool!
The current vdr behavior isn't really acceptable -- at the very least the fsyncs have to be configurable -- even a few hundred megabytes needlessly dirtied by vdr is still much better than the bursts of traffic, disk and cpu usage. I personally don't mind the cache trashing so much; it would be enough to keep vdr happily running in the background without disturbing other tasks.
Depends on the use case. You are absolutely right in the NFS case. In the "dedicated to VDR standalone" case this is different. By throwing away the inode cache it makes usage of big recording archives uncomfortable - it takes up to 20 seconds to scan my local recordings directory. Thats a long time when you just want to select a recording ...
To be honest - I did not found the place where writes get flushed in your patch. posix_fadvise() doesn't seem to influence flushing at all.
Hmm, what glibc/kernel? It works here w/ glibc-2.3.90 and linux-2.6.14.
SuSE 9.1: GNU C Library stable release version 2.3.3 (20040405) Kernel 2.6.14
Here's "vmstat 1" output; vdr (patched 1.3.36) is currently doing a recording to local disk:
procs -----------memory---------- ---swap-- -----io---- ... [ ... ]
the 'bo' column shows the writeout caused by vdr. Also note the 'free' and 'cache' field fluctuate a bit, but do not grow. Hmm, now i noticed the slowly growing 'buff' -- is this causing you problems?
I don't think so - this would not fill my RAM in the next weeks ;) I usually have 300MB left on the box (yes - it has quite much memory for just a VDR ... )
I didn't mind this here, as there's clearly plenty of free RAM around. Will have to investigate what happens under some memory pressure.
As I said - at least here there is no pressure.
Are saying you don't get any writeback activity w/ my patch?
Correct. It starts writing back when memory is filled. Not a single second earlier.
With no posix_fadvice and no fdatasync calls in the write path i get almost no writeout with multi-megabyte bursts every minute (triggered probably by ext3 journal commit (interval set to 60s) and/or memory pressure).
Using reiserfs here. I remember having configured it for lazy disk operations ... maybe this is the source for the above results. The idea has been to collect system writes - to not spin up the disks if not absolutely necessary. But this obviously also results in collecting VDR writes ... anyway I think this is a valid case too. At least for dedicated "multimedia" stations ... A bit more control about VDR IO would be a great thing to have.
It only applies to already written buffers. So the normal write
/usr/src/linux/mm/fadvise.c should contain the implementation of the various fadvice modes in a linux 2.6 kernel. It certainly does trigger writeback here. Both in the local disk case, and on NFS, where it causes a similar traffic pattern.
Will have a look at the code.
See vmstat output above. Are you sure you have a working posix_fadvise?
Quite sure - the current VDR version is performing perfectly well - within its limit.
If not, that would also explain the hang during playback as no readahead was actually taking place... (to be honest, i don't think that you need any manual readahead at all in a normal-playback situation; especially as the kernel will by default do some. It's only when the disk is getting busier that the benefits of readahead show up. At least this is what i saw here)
Remember - you switched off read ahead: POSIX_FADV_RANDOM ;)
Anyway - it seems the small read ahead in your patch doesn't had the sightest chance against the multi megabyte write back triggered when buffer cache was on its limits.
What happens when you start a replay and then end it? is the memory freed immediately?
I will have a look at it again.
Thanks a lot for working on the problem Regards Ralf
Ralf Müller wrote:
On Montag 21 November 2005 02:15, Artur Skawina wrote:
client, ie vdr-box, that does not behave well -- it sits almost completely idle for minutes (zero network traffic, no writeback at all), and then goes busy for a second or so.
But this very much sounds like a NFS-problem - and much less like a VDR problem ...
this is perfectly normal behavior; it's the same as for the local disk case. The problem is that since the vdr box isn't under any memory pressure it collects all the writes. If not for the fdatasyncs it would start writing the data asynchronously after some time, when it would need some RAM or had too many dirty pages. The problem is that vdr does not let it do that -- after 10M it asks the system to commit all the data to disk and return status. So the box does just that -- flushes the data as fast as possible in order to complete the synchronous request. This is were fadvise(WONTNEED) helps -- it tells the system that we're not going to access the written data any time soon, so it starts committing that buffered data back to disk immediately. Just as it would if it was under memory pressure, except now there is none; and once the data gets to disk it no longer needs to be treated as dirty and can be easily freed.
[...] I had hoped the extra fadvice every 10M would fix that, but i wanted to get the recording and replay cases right first. (the issue when cutting is simply that we need: a) start the writeback, and b) drop the cached data after it has hit the disk. The problem is that we don't really know when to do b...
Thats exactly the problem here ... without special force my kernel seems to prefer to use memory instead of disk ...
if you have told it to do exactly that, using that reiserfs setting mentioned below, well, i guess it tries to do it's best to obey :)
For low write rates the heuristic seems to work, for high rates it might fail. Yes, fdatasync obviously will work, but this is the sledgehammer approach :)
I know. I also don't like this approach. But at least it worked (here).
The fadvise(0,0) solution was a first try at using a slightly smaller hammer. Keeping a dirty-list and flushing it after some time would be the next step if fadvise isn't enough.)
How do you know what is still dirty in case of writes?
The strategy currently is this: after writing some data to the file (~1M) we use fadvice to make the kernel start writing it to disk; after some time we call fadvice on the same data _again_, now hopefully it has already hit the disk, is clean and will be dropped. (I actually call fadvice three, not two, times just to be sure). This seems to work fine for slow sequential writes, such as when recording; for cutting we create the dirty data faster than it can be written back to disk - this is where the global fadvise(WONTNEED) was supposed to help, and in the few cutting tests i did seemed to be enough.
How does the cache behave when _not_ cutting? Over here it looks ok, i've done several recordings while playing back others, and the cache was basically staying the same. (as this is not a dedicated vdr box it is however sometimes hard to be sure)
With the active read ahead I even have leaks when only reading - the initiated non-blocking reads of the WILL_NEED seem to keep pages in the buffer caches.
maybe another reiserfs issue? does it occur when sequentially reading, ie on normal playback? Or only when also seeking around in the file? In the latter case i was seeing some small leaks too, that was the reason for the fadvice calls every X jumps.
My initial intention when trying to use an active read ahead has been to have no hangs even when another disks needs to spin up. On my system I sometimes have this problem and it is annoying. So a read ahead of several megabytes would be needed here - but even without such a huge read ahead I get this annoying leaks here. For normal operation
hmm, the readahead is only per-file -- do you have filesystems spanning several disks, _some_ of which are spun down?
(replay) they could be avoided by increasing the region which has to be cleared to at least the size of the read ahead.
Isn't this exactly what is currently happening (both w/o and with my patch)?
The current vdr behavior isn't really acceptable -- at the very least the fsyncs have to be configurable -- even a few hundred megabytes needlessly dirtied by vdr is still much better than the bursts of traffic, disk and cpu usage. I personally don't mind the cache trashing so much; it would be enough to keep vdr happily running in the background without disturbing other tasks.
Depends on the use case. You are absolutely right in the NFS case. In the "dedicated to VDR standalone" case this is different. By throwing
A config option "Write strategy: NORMAL|STREAMING|BURST" would be enough for everyone :) (where STREAMING is what my patch does, at least here, BURST is with the fdatasyncs followed by fadvice(WONTNEED), and normal is w/o both)
away the inode cache it makes usage of big recording archives uncomfortable - it takes up to 20 seconds to scan my local recordings directory. Thats a long time when you just want to select a recording ...
It seemed much longer than 20s here :) Now that vdr caches the list, it's not a big problem anymore.
Are saying you don't get any writeback activity w/ my patch?
Correct. It starts writing back when memory is filled. Not a single second earlier.
With no posix_fadvice and no fdatasync calls in the write path i get almost no writeout with multi-megabyte bursts every minute (triggered probably by ext3 journal commit (interval set to 60s) and/or memory pressure).
Using reiserfs here. I remember having configured it for lazy disk operations ... maybe this is the source for the above results. The idea has been to collect system writes - to not spin up the disks if not absolutely necessary. But this obviously also results in collecting VDR writes ... anyway I think this is a valid case too. At least for dedicated "multimedia" stations ... A bit more control about VDR IO would be a great thing to have.
reiserfs collecting all writes would explain the behavior; whether it's a good thing or not in this scenario i'm not sure. Apparently this does not give you any way to force disk writes, other than a synchronous flush (ie fdatasync)?...
i don't think that you need any manual readahead at all in a normal-playback situation; especially as the kernel will by default do some. It's only when the disk is getting busier that the benefits of readahead show up. At least this is what i saw here)
Remember - you switched off read ahead: POSIX_FADV_RANDOM ;)
Just before posting v2 :) Most test were w/ POSIX_FADV_SEQUENTIAL, but as we do the readahead manually i decided to see if the kernel wasn't interfering too much. So far haven't seen much difference. What did not work was having a large unconditional readahead -- this fails spectacularly w/ fast-rewind.
Anyway - it seems the small read ahead in your patch doesn't had the sightest chance against the multi megabyte write back triggered when buffer cache was on its limits.
well, yes, the readahead is adjusted to the write rate :)
However, one thing that could make a large difference is hardware. I have two local ATA disks in the vdr machine, both seagates, one older 80G and a newer 40M (came w/ the machine, i was too lazy to pull it out so it stayed there) Both are alone on an IDE channel, both have 2M cache, both are AFAICT identically configured, both have ext3 fs. However the 40M disk is significantly slower, and the difference is huge -- you can easily tell when vdr starts using that disk, because the increase in latency for unrelated read requests is so large. OTOH the 80G disk seems not only way faster, but also much more fair to random read requests while writes are going on. Weird.
Regards,
artur
Ralf Müller wrote:
On Montag 21 November 2005 02:15, Artur Skawina wrote:
...
Are saying you don't get any writeback activity w/ my patch?
Correct. It starts writing back when memory is filled. Not a single second earlier.
Under normal default circumstances, all dirty data should be synced to disk within 30 seconds.
This reminds me of a recent LKML thread. It noted that under some circumstances the kernel seems to forget to flush the dirty data. It didn't come to any particular conclusion but maybe there is a problem in the kernel. It might be an interesting read for you...
http://www.uwsg.iu.edu/hypermail/linux/kernel/0511.1/2043.html
Jon
Artur Skawina wrote:
This version adds a few minor fixes + cutting speed improvement. ...
When I switched to kernel 2.6 for the first time I noticed these issues, too. Selecting the CFQ I/O scheduler in the kernel solved all problems for me. Did you try that?
Oliver
Oliver Endriss wrote:
When I switched to kernel 2.6 for the first time I noticed these issues, too. Selecting the CFQ I/O scheduler in the kernel solved all problems for me. Did you try that?
yes, i use cfq too.
$ cat /sys/block/hd?/queue/scheduler noop anticipatory deadline [cfq] noop anticipatory deadline [cfq] $ cat /sys/block/hd?/queue/read_ahead_kb 4096 4096 $
Hello,
On Tue, 22 Nov 2005 01:33:22 +0100 Oliver Endriss o.endriss@gmx.de wrote:
| Selecting the CFQ I/O scheduler in the kernel solved all problems | for me. Did you try that?
Since my VDR is in the living room, i recently switched to a 100% diskless solution and until now i was having regular freezes when recording a channel and playing back a divx at the same time. (system is linux 2.6.13.2/Diskless based Debian Sid/PIII 1Ghz/VDR 1.3.36)
I, indeed, forgot to add "elevator=cfq" in the boot parameters (which i use on about all my other workstations/server :), and up to now, it definitely improved things: No more regular freezes (about every 10/12 sec), at last with the few channels/divx combinations i know there used to be problems with.
So all in all, thx :) You made my day :)
Truly yours,
Philippe
Philippe Gramoullé wrote:
Oliver Endriss o.endriss@gmx.de wrote:
| Selecting the CFQ I/O scheduler in the kernel solved all problems | for me. Did you try that?
Since my VDR is in the living room, i recently switched to a 100% diskless solution and until now i was having regular freezes when recording a channel and playing back a divx at the same time. (system is linux 2.6.13.2/Diskless based Debian Sid/PIII 1Ghz/VDR 1.3.36)
I, indeed, forgot to add "elevator=cfq" in the boot parameters (which i use on about all my other workstations/server :), and up to now, it definitely improved things: No more regular freezes (about every 10/12 sec), at last with the few channels/divx combinations i know there used to be problems with.
if your VDR really is 100% diskless how can the IO scheduler (which controls access to block devices) make any difference?
artur