UnbufferedFile improvements v2

Message ID 437F02F4.4030602@o2.pl
State New
Headers

Commit Message

Artur Skawina Nov. 19, 2005, 10:48 a.m. UTC
  This version adds a few minor fixes + cutting speed improvement.



well, vdr w/ the recent cUnbufferedFile changes was flushing the data
buffers in huge burst; this was even worse than slowly filling up the
caches -- the large (IIRC ~10M) bursts caused latency problems (apps
visibly freezing etc).

This patch makes vdr use a much more aggressive disk access strategy.
Writes are flushed out almost immediately and the IO is more evenly
distributed. While recording and/or replaying the caches do not grow and
when vdr is done accessing a video file all cached data from that file
is dropped.

I've tested this w/ both local disks and NFS mounted ones, and it seems
to do the right thing. Writes get flushed every 1..2s at a rate of
.5..1M/s instead of the >10M bursts. For async mounted NFS servers the
writes get collected by the NFS server and normally written out. Local
disks get an extra feature -- you can use the HD activity LED as a
"recording" indicator :^)

As posix_advice requires kernel v2.5.60 and glibc v2.2, you'll need at
least those versions to see any difference. (w/o posix_advice you will
not get the fdatasyncs every 10M - if somebody really wants them they
should be controlled by a config option)

Possible further improvements could be:

switch from POSIX_FADV_SEQUENTIAL to POSIX_FADV_RANDOM, since we're
doing manual readahead anyway (or just leave this as is, and drop the
readahead) (not using POSIX_FADV_RANDOM is probably one of the causes of
the "leaks", so some of the workarounds could then go too)

artur
  

Comments

Ralf Müller Nov. 20, 2005, 2:53 p.m. UTC | #1
Artur Skawina schrieb:

> well, vdr w/ the recent cUnbufferedFile changes was flushing the data
> buffers in huge burst; this was even worse than slowly filling up the
> caches -- the large (IIRC ~10M) bursts caused latency problems (apps
> visibly freezing etc).

Does this freezing apply to local disk access or only to network
filesystems. My personal VDR is a dedicated to VDR usage system which
uses a local hard disk for storage. So I don't have applications
parallel to vdr which can freeze nor I can actually test behaviour on
network devices. Seems you have both of this extra features so it would
be nice to know more about this.

For local usage I found that IO interruptions of less then a second (10
MB burst writes on disks which give a hell lot more then 10MB/sec) have
no negative side effects. But I can imagine that on 10Mbit ethernet it
could be hard to have these bursts ... I did not think about this when
writing the initial patch ...

> This patch makes vdr use a much more aggressive disk access strategy.
> Writes are flushed out almost immediately and the IO is more evenly
> distributed. While recording and/or replaying the caches do not grow and
> when vdr is done accessing a video file all cached data from that file
> is dropped.

Actually with the patch you attached my cache _does_ grow. It does not
only grow - it displaces the inode cache, to avoid this the initial
patch has been created. To make it worse - when cutting a recording and
have the newly cut recording replayed at the same time I have major
hangs in replay.

I had a look at your patch - it looked very well. But for whatever
reason it doesn't do what it is supposed to do at my VDR. I currently
don't know why it doesn't work here for replay - the code there looked good.

I like the heuristics you used to deal with read ahead - but maybe these
lead to the leaks I experience here. I will have a look at it. Maybe I
can find out something about it ...

> I've tested this w/ both local disks and NFS mounted ones, and it seems
> to do the right thing. Writes get flushed every 1..2s at a rate of
> .5..1M/s instead of the >10M bursts. 

To be honest - I did not found the place where writes get flushed in
your patch. posix_fadvise() doesn't seem to influence flushing at all.
It only applies to already written buffers. So the normal write
strategie is used with your patch - collect data until the kernel
decides to write it to disk. This leads to "collect about 300MB" here
and have an up to 300MB burst then. This is a bit more heavy then the
10MB bursts before ;)

Regards
Ralf
  
Artur Skawina Nov. 21, 2005, 1:15 a.m. UTC | #2
Ralf Müller wrote:
> Artur Skawina schrieb:
> 
>> well, vdr w/ the recent cUnbufferedFile changes was flushing the data
>> buffers in huge burst; this was even worse than slowly filling up the
>> caches -- the large (IIRC ~10M) bursts caused latency problems (apps
>> visibly freezing etc).
> 
> Does this freezing apply to local disk access or only to network
> filesystems. My personal VDR is a dedicated to VDR usage system which
> uses a local hard disk for storage. So I don't have applications
> parallel to vdr which can freeze nor I can actually test behaviour on
> network devices. Seems you have both of this extra features so it would
> be nice to know more about this.

the freezing certainly applies to NFS -- it shows clearly if you have
some kind of monitor app graphing network traffic. It may just be the
huge amount of data shifted and associated cpu load, but the delays are
noticeable for non-rt apps running on the same machine. It's rather
obvious when eg watching tv using xawtv while recording.
As to the local disk case -- i'm not sure of the impact -- most of my
vdr data goes over NFS, and this was what made me look at the code.
There could be less of a problem w/ local disks, or I simply didn't
realize the correlation w/ vdr activity as i, unlike network traffic,
do not have a local IO graph on screen :)

(i _think_ i verified w/ vmstat that local disks were not immune to this, but
right now i no longer remember the details, so can't really be sure)

> For local usage I found that IO interruptions of less then a second (10
> MB burst writes on disks which give a hell lot more then 10MB/sec) have
> no negative side effects. But I can imagine that on 10Mbit ethernet it
> could be hard to have these bursts ... I did not think about this when
> writing the initial patch ...

it's a problem even on 100mbit -- while the fileserver certainly can
accept sustained 10M/s data for several seconds (at least), it's the
client, ie vdr-box, that does not behave well -- it sits almost
completely idle for minutes (zero network traffic, no writeback at all),
and then goes busy for a second or so.
I first tried various priority changes, but didn't see any visible
improvement. Having vdr running at low prio isn't really an option
anyway.

Another issue could be the fsync calls -- at least on ext3 these
apparently behave very similar to sync(2)...

>> This patch makes vdr use a much more aggressive disk access strategy.
>> Writes are flushed out almost immediately and the IO is more evenly
>> distributed. While recording and/or replaying the caches do not grow and
>> when vdr is done accessing a video file all cached data from that file
>> is dropped.
> 
> Actually with the patch you attached my cache _does_ grow. It does not
> only grow - it displaces the inode cache, to avoid this the initial
> patch has been created. To make it worse - when cutting a recording and
> have the newly cut recording replayed at the same time I have major
> hangs in replay.

oh, the cutting-trashes-cache-a-bit isn't really such a big surprise --
i was seeing something like that while testing the code -- I had hoped
the extra fadvice every 10M would fix that, but i wanted to get the
recording and replay cases right first. (the issue when cutting is
simply that we need: a) start the writeback, and b) drop the cached data
after it has hit the disk. The problem is that we don't really know when
to do b... For low write rates the heuristic seems to work, for high
rates it might fail. Yes, fdatasync obviously will work, but this is the
sledgehammer approach :) The fadvise(0,0) solution was a first try at
using a slightly smaller hammer. Keeping a dirty-list and flushing it
after some time would be the next step if fadvise isn't enough.)

How does the cache behave when _not_ cutting? Over here it looks ok,
i've done several recordings while playing back others, and the cache
was basically staying the same. (as this is not a dedicated vdr box it
is however sometimes hard to be sure)

> I had a look at your patch - it looked very well. But for whatever
> reason it doesn't do what it is supposed to do at my VDR. I currently
> don't know why it doesn't work here for replay - the code there looked 
> good.

in v1 i was using a relatively small readahead window -- maybe for a
slow disk it was _too_ small. In v2 it's a little bigger, maybe that
will help (i increased it to make sure the readahead worked for
fast-forward, but so far i haven't been able to see much difference).
But I don't usually replay anything while cutting, so this hasn't really
been tested...

(BTW, with the added readahead in the v2 patch, vdr seems to come close
to saturating a 100M connection when cutting. Even when _both_ the
source  and destination are on the same NFSv3 mounted disk, which kind
of surprised me. LocalDisk->NFS rate  and v/v seems to be limited by the
network. I didn't check localdisk->localdisk (lack of sufficient
diskpace). Didn't do any real benchmarking, these are estimations based
on observing the free diskspace decrease rate and network traffic)

> I like the heuristics you used to deal with read ahead - but maybe these
> lead to the leaks I experience here. I will have a look at it. Maybe I
> can find out something about it ...

Please do, I did and posted this to get others to look at that code and
hopefully come up w/ a strategy which works for everyone.
For cutting I was going to switch to O_DIRECT, until i realized we then
would still need a fallback strategy, for old kernels and NFS...

The current vdr behavior isn't really acceptable -- at the very least
the fsyncs have to be configurable -- even a few hundred megabytes
needlessly dirtied by vdr is still much better than the bursts of
traffic, disk and cpu usage.
I personally don't mind the cache trashing so much; it would be enough
to keep vdr happily running in the background without disturbing other
tasks. (one of the reasons is that while keeping the recording list in
cache seems to help local disks, it doesn't really help for NFS -- you
still get lots of NFS traffic every time vdr decides to reread the
directory structure. As both the client and server could fit the dir
tree in ram the limiting factor becomes the network latency)

>> I've tested this w/ both local disks and NFS mounted ones, and it seems
>> to do the right thing. Writes get flushed every 1..2s at a rate of
>> .5..1M/s instead of the >10M bursts. 
> 
> To be honest - I did not found the place where writes get flushed in
> your patch. posix_fadvise() doesn't seem to influence flushing at all.

Hmm, what glibc/kernel?
It works here w/ glibc-2.3.90 and linux-2.6.14.

Here's "vmstat 1" output; vdr (patched 1.3.36) is currently doing a
recording to local disk:

procs -----------memory---------- ---swap-- -----io---- --system------cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
  1  0   9168 202120    592  22540    0    0     0     0 3584  1350  0  1 99  0
  0  0   9168 202492    592  22052    0    0     0   800 3596  1330  1  0 99  0
  0  0   9168 202368    592  22356    0    0     0     0 3576  1342  1  0 99  0
  0  0   9168 202492    592  21836    0    0     0   804 3628  1350  0  0 100  0
  0  0   9168 202492    592  22144    0    0     0     0 3573  1346  1  1 98  0
  0  0   9168 202244    592  22452    0    0     0     0 3629  1345  1  0 99  0
  1  0   9168 202492    592  21956    0    0     0   800 3562  1350  0  0 100  0
  0  0   9168 202368    592  22260    0    0     0     0 3619  1353  1  0 99  0
  0  0   9168 202120    592  22568    0    0     0     0 3616  1357  1  1 98  0
  0  0   9168 202492    592  22044    0    0     0   952 3617  1336  0  0 100  0
  0  0   9168 202368    596  22352    0    0     0     0 3573  1356  1  0 99  0
  1  0   9168 202616    596  21724    0    0     0   660 3609  1345  0  0 100  0
  0  0   9168 202616    596  22000    0    0     0     0 3569  1338  1  1 98  0
  0  0   9168 202368    596  22304    0    0     0     0 3573  1335  1  0 99  0
  1  0   9168 202492    596  21956    0    0     0   896 3644  1360  0  1 99  0
  0  0   9168 202492    596  22232    0    0     0     0 3592  1327  1  0 99  0
  0  0   9168 202120    596  22536    0    0     0     0 3571  1333  0  0 100  0
  0  0   9168 202616    596  21968    0    0     0   800 3575  1329 11  3 86  0
  0  0   9168 202368    596  22244    0    0     0     0 3604  1350  1  0 99  0
  0  0   9168 202492    596  21756    0    0     0   820 3585  1326  0  1 99  0
  0  0   9168 202492    612  22060    0    0     8   140 3632  1369  1  1 89  9
  0  0   9168 202244    612  22336    0    0     0     0 3578  1328  1  0 99  0
  0  0   9168 202492    612  21796    0    0     0   784 3619  1360  0  0 100  0
  0  0   9168 202492    628  22072    0    0     8   104 3559  1317  2  0 96  2
  0  0   9168 202244    632  22376    0    0     0     0 3604  1348  1  0 99  0
  0  0   9168 202492    632  21904    0    0     0   800 3695  1402  0  0 100  0
  0  0   9168 202368    632  22180    0    0     0     0 3775  1456  1  1 98  0
  0  0   9168 202120    632  22484    0    0     0     0 3699  1416  0  1 99  0
  0  0   9168 202492    632  21992    0    0     0   804 3774  1465  1  0 99  0
  1  0   9168 202236    632  22268   32    0    32     0 3810  1570  3  1 93  3
  0  0   9168 202360    632  21776    0    0     0   820 3896  1690  1  1 98  0

the 'bo' column shows the writeout caused by vdr. Also note the 'free'
and 'cache' field fluctuate a bit, but do not grow. Hmm, now i noticed
the slowly growing 'buff' -- is this causing you problems?
I didn't mind this here, as there's clearly plenty of free RAM around.
Will have to investigate what happens under some memory pressure.

Are saying you don't get any writeback activity w/ my patch?

With no posix_fadvice and no fdatasync calls in the write path i get
almost no writeout with multi-megabyte bursts every minute (triggered
probably by ext3 journal commit (interval set to 60s) and/or memory
pressure).

> It only applies to already written buffers. So the normal write

/usr/src/linux/mm/fadvise.c should contain the implementation of the various
fadvice modes in a linux 2.6 kernel. It certainly does trigger writeback here.
Both in the local disk case, and on NFS, where it causes a similar traffic pattern.

> strategie is used with your patch - collect data until the kernel
> decides to write it to disk. This leads to "collect about 300MB" here
> and have an up to 300MB burst then. This is a bit more heavy then the
> 10MB bursts before ;)

See vmstat output above. Are you sure you have a working posix_fadvise?
If not, that would also explain the hang during playback as no readahead
was actually taking place... (to be honest, i don't think that you need
any manual readahead at all in a normal-playback situation; especially
as the kernel will by default do some. It's only when the disk is
getting busier that the benefits of readahead show up. At least this is
what i saw here)
What happens when you start a replay and then end it? is the memory
freed immediately?


Thanks for testing and the feedback.

Regards,

artur
  
Ralf Müller Nov. 21, 2005, 10:37 a.m. UTC | #3
On Montag 21 November 2005 02:15, Artur Skawina wrote:

> the freezing certainly applies to NFS -- it shows clearly if you have

Ok - I see.

> it's a problem even on 100mbit -- while the fileserver certainly can
> accept sustained 10M/s data for several seconds (at least), it's the
> client, ie vdr-box, that does not behave well -- it sits almost
> completely idle for minutes (zero network traffic, no writeback at
> all), and then goes busy for a second or so.

But this very much sounds like a NFS-problem - and much less like a VDR 
problem ...

> [...] I had
> hoped the extra fadvice every 10M would fix that, but i wanted to get
> the recording and replay cases right first. (the issue when cutting
> is simply that we need: a) start the writeback, and b) drop the
> cached data after it has hit the disk. The problem is that we don't
> really know when to do b...

Thats exactly the problem here ... without special force my kernel seems 
to prefer to use memory instead of disk ...

> For low write rates the heuristic seems 
> to work, for high rates it might fail. Yes, fdatasync obviously will
> work, but this is the sledgehammer approach :)

I know. I also don't like this approach. But at least it worked (here). 

> The fadvise(0,0) 
> solution was a first try at using a slightly smaller hammer. Keeping
> a dirty-list and flushing it after some time would be the next step
> if fadvise isn't enough.)

How do you know what is still dirty in case of writes?

> How does the cache behave when _not_ cutting? Over here it looks ok,
> i've done several recordings while playing back others, and the cache
> was basically staying the same. (as this is not a dedicated vdr box
> it is however sometimes hard to be sure)

With the active read ahead I even have leaks when only reading - the 
initiated non-blocking reads of the WILL_NEED seem to keep pages in the 
buffer caches.

> in v1 i was using a relatively small readahead window -- maybe for a
> slow disk it was _too_ small. In v2 it's a little bigger, maybe that
> will help (i increased it to make sure the readahead worked for
> fast-forward, but so far i haven't been able to see much difference).
> But I don't usually replay anything while cutting, so this hasn't
> really been tested...

My initial intention when trying to use an active read ahead has been to 
have no hangs even when another disks needs to spin up. On my system I 
sometimes have this problem and it is annoying. So a read ahead of 
several megabytes would be needed here - but even without such a huge 
read ahead I get this annoying leaks here. For normal operation 
(replay) they could be avoided by increasing the region which has to be 
cleared to at least the size of the read ahead.
   
> (BTW, with the added readahead in the v2 patch, vdr seems to come
> close to saturating a 100M connection when cutting. Even when _both_
> the source  and destination are on the same NFSv3 mounted disk, which
> kind of surprised me. LocalDisk->NFS rate  and v/v seems to be
> limited by the network. I didn't check localdisk->localdisk (lack of
> sufficient diskpace). Didn't do any real benchmarking, these are
> estimations based on observing the free diskspace decrease rate and
> network traffic)

Cool!

> The current vdr behavior isn't really acceptable -- at the very least
> the fsyncs have to be configurable -- even a few hundred megabytes
> needlessly dirtied by vdr is still much better than the bursts of
> traffic, disk and cpu usage. I personally don't mind the cache
> trashing so much; it would be enough to keep vdr happily running
> in the background without disturbing other tasks.

Depends on the use case. You are absolutely right in the NFS case. In 
the "dedicated to VDR standalone" case this is different. By throwing 
away the inode cache it makes usage of big recording archives 
uncomfortable - it takes up to 20 seconds to scan my local recordings 
directory. Thats a long time when you just want to select a 
recording ...

> > To be honest - I did not found the place where writes get flushed
> > in your patch. posix_fadvise() doesn't seem to influence flushing
> > at all.
>
> Hmm, what glibc/kernel?
> It works here w/ glibc-2.3.90 and linux-2.6.14.

SuSE 9.1:
GNU C Library stable release version 2.3.3 (20040405)
Kernel 2.6.14

> Here's "vmstat 1" output; vdr (patched 1.3.36) is currently doing a
> recording to local disk:
>
> procs -----------memory---------- ---swap-- -----io---- ...
> [ ... ]
>
> the 'bo' column shows the writeout caused by vdr. Also note the
> 'free' and 'cache' field fluctuate a bit, but do not grow. Hmm, now i
> noticed the slowly growing 'buff' -- is this causing you problems?

I don't think so - this would not fill my RAM in the next weeks ;) I 
usually have 300MB left on the box (yes - it has quite much memory for 
just a VDR ... )

> I didn't mind this here, as there's clearly plenty of free RAM
> around. Will have to investigate what happens under some memory
> pressure.

As I said - at least here there is no pressure.

> Are saying you don't get any writeback activity w/ my patch?

Correct. It starts writing back when memory is filled. Not a single 
second earlier.

> With no posix_fadvice and no fdatasync calls in the write path i get
> almost no writeout with multi-megabyte bursts every minute (triggered
> probably by ext3 journal commit (interval set to 60s) and/or memory
> pressure).

Using reiserfs here. I remember having configured it for lazy disk 
operations ... maybe this is the source for the above results. The idea 
has been to collect system writes - to not spin up the disks if not 
absolutely necessary. But this obviously also results in collecting VDR 
writes ... anyway I think this is a valid case too. At least for 
dedicated "multimedia" stations ... A bit more control about VDR IO 
would be a great thing to have.

> > It only applies to already written buffers. So the normal write
>
> /usr/src/linux/mm/fadvise.c should contain the implementation of the
> various fadvice modes in a linux 2.6 kernel. It certainly does
> trigger writeback here. Both in the local disk case, and on NFS,
> where it causes a similar traffic pattern.

Will have a look at the code.

> See vmstat output above. Are you sure you have a working
> posix_fadvise?

Quite sure - the current VDR version is performing perfectly well - 
within its limit.

> If not, that would also explain the hang during 
> playback as no readahead was actually taking place... (to be honest,
> i don't think that you need any manual readahead at all in a
> normal-playback situation; especially as the kernel will by default
> do some. It's only when the disk is getting busier that the benefits
> of readahead show up. At least this is what i saw here)

Remember - you switched off read ahead: POSIX_FADV_RANDOM
;) 

Anyway - it seems the small read ahead in your patch doesn't had the 
sightest chance against the multi megabyte write back triggered when 
buffer cache was on its limits.

> What happens when you start a replay and then end it? is the memory
> freed immediately?

I will have a look at it again.

Thanks a lot for working on the problem
Regards
Ralf
  
Artur Skawina Nov. 21, 2005, 6:05 p.m. UTC | #4
Ralf Müller wrote:
> On Montag 21 November 2005 02:15, Artur Skawina wrote:
>> client, ie vdr-box, that does not behave well -- it sits almost
>> completely idle for minutes (zero network traffic, no writeback at
>> all), and then goes busy for a second or so.
> 
> But this very much sounds like a NFS-problem - and much less like a VDR 
> problem ...

this is perfectly normal behavior; it's the same as for the local disk case. The 
problem is that since the vdr box isn't under any memory pressure it collects 
all the writes. If not for the fdatasyncs it would start writing the data 
asynchronously after some time, when it would need some RAM or had too many 
dirty pages. The problem is that vdr does not let it do that -- after 10M it 
asks the system to commit all the data to disk and return status. So the box 
does just that -- flushes the data as fast as possible in order to complete the 
synchronous request.
This is were fadvise(WONTNEED) helps -- it tells the system that we're not going 
to access the written data any time soon, so it starts committing that buffered 
data back to disk immediately. Just as it would if it was under memory pressure, 
except now there is none; and once the data gets to disk it no longer needs to 
be treated as dirty and can be easily freed.

>> [...] I had
>> hoped the extra fadvice every 10M would fix that, but i wanted to get
>> the recording and replay cases right first. (the issue when cutting
>> is simply that we need: a) start the writeback, and b) drop the
>> cached data after it has hit the disk. The problem is that we don't
>> really know when to do b...
> 
> Thats exactly the problem here ... without special force my kernel seems 
> to prefer to use memory instead of disk ...

if you have told it to do exactly that, using that reiserfs setting mentioned 
below, well, i guess it tries to do it's best to obey :)

>> For low write rates the heuristic seems 
>> to work, for high rates it might fail. Yes, fdatasync obviously will
>> work, but this is the sledgehammer approach :)
> 
> I know. I also don't like this approach. But at least it worked (here). 
> 
>> The fadvise(0,0) 
>> solution was a first try at using a slightly smaller hammer. Keeping
>> a dirty-list and flushing it after some time would be the next step
>> if fadvise isn't enough.)
> 
> How do you know what is still dirty in case of writes?

The strategy currently is this: after writing some data to the file (~1M) we use 
fadvice to make the kernel start writing it to disk; after some time we call 
fadvice on the same data _again_, now hopefully it has already hit the disk, is 
clean and will be dropped. (I actually call fadvice three, not two, times just 
to be sure). This seems to work fine for slow sequential writes, such as when 
recording; for cutting we create the dirty data faster than it can be written 
back to disk - this is where the global fadvise(WONTNEED) was supposed to help, 
and in the few cutting tests i did seemed to be enough.

>> How does the cache behave when _not_ cutting? Over here it looks ok,
>> i've done several recordings while playing back others, and the cache
>> was basically staying the same. (as this is not a dedicated vdr box
>> it is however sometimes hard to be sure)
> 
> With the active read ahead I even have leaks when only reading - the 
> initiated non-blocking reads of the WILL_NEED seem to keep pages in the 
> buffer caches.

maybe another reiserfs issue? does it occur when sequentially reading, ie on 
normal playback? Or only when also seeking around in the file? In the latter 
case i was seeing some small leaks too, that was the reason for the fadvice 
calls every X jumps.

> My initial intention when trying to use an active read ahead has been to 
> have no hangs even when another disks needs to spin up. On my system I 
> sometimes have this problem and it is annoying. So a read ahead of 
> several megabytes would be needed here - but even without such a huge 
> read ahead I get this annoying leaks here. For normal operation 

hmm, the readahead is only per-file -- do you have filesystems spanning several 
disks, _some_ of which are spun down?

> (replay) they could be avoided by increasing the region which has to be 
> cleared to at least the size of the read ahead.

Isn't this exactly what is currently happening (both w/o and with my patch)?

>> The current vdr behavior isn't really acceptable -- at the very least
>> the fsyncs have to be configurable -- even a few hundred megabytes
>> needlessly dirtied by vdr is still much better than the bursts of
>> traffic, disk and cpu usage. I personally don't mind the cache
>> trashing so much; it would be enough to keep vdr happily running
>> in the background without disturbing other tasks.
> 
> Depends on the use case. You are absolutely right in the NFS case. In 
> the "dedicated to VDR standalone" case this is different. By throwing 

A config option "Write strategy: NORMAL|STREAMING|BURST" would be enough for 
everyone :) (where STREAMING is what my patch does, at least here, BURST is with 
the fdatasyncs followed by fadvice(WONTNEED), and normal is w/o both)

> away the inode cache it makes usage of big recording archives 
> uncomfortable - it takes up to 20 seconds to scan my local recordings 
> directory. Thats a long time when you just want to select a 
> recording ...

It seemed much longer than 20s here :)
Now that vdr caches the list, it's not a big problem anymore.

>> Are saying you don't get any writeback activity w/ my patch?
> 
> Correct. It starts writing back when memory is filled. Not a single 
> second earlier.
> 
>> With no posix_fadvice and no fdatasync calls in the write path i get
>> almost no writeout with multi-megabyte bursts every minute (triggered
>> probably by ext3 journal commit (interval set to 60s) and/or memory
>> pressure).
> 
> Using reiserfs here. I remember having configured it for lazy disk 
> operations ... maybe this is the source for the above results. The idea 
> has been to collect system writes - to not spin up the disks if not 
> absolutely necessary. But this obviously also results in collecting VDR 
> writes ... anyway I think this is a valid case too. At least for 
> dedicated "multimedia" stations ... A bit more control about VDR IO 
> would be a great thing to have.

reiserfs collecting all writes would explain the behavior; whether it's a good 
thing or not in this scenario i'm not sure. Apparently this does not give you 
any way to force disk writes, other than a synchronous flush (ie fdatasync)?...

>> i don't think that you need any manual readahead at all in a
>> normal-playback situation; especially as the kernel will by default
>> do some. It's only when the disk is getting busier that the benefits
>> of readahead show up. At least this is what i saw here)
> 
> Remember - you switched off read ahead: POSIX_FADV_RANDOM
> ;) 

Just before posting v2 :)
Most test were w/ POSIX_FADV_SEQUENTIAL, but as we do the readahead manually i 
decided to see if the kernel wasn't interfering too much. So far haven't seen 
much difference. What did not work was having a large unconditional readahead -- 
this fails spectacularly w/ fast-rewind.

> Anyway - it seems the small read ahead in your patch doesn't had the 
> sightest chance against the multi megabyte write back triggered when 
> buffer cache was on its limits.

well, yes, the readahead is adjusted to the write rate :)

However, one thing that could make a large difference is hardware.
I have two local ATA disks in the vdr machine, both seagates, one older 80G and 
a newer 40M (came w/ the machine, i was too lazy to pull it out so it stayed 
there) Both are alone on an IDE channel, both have 2M cache, both are AFAICT 
identically configured, both have ext3 fs. However the 40M disk is significantly 
slower, and the difference is huge -- you can easily tell when vdr starts using 
that disk, because the increase in latency for unrelated read requests is so 
large. OTOH the 80G disk seems not only way faster, but also much more fair to 
random read requests while writes are going on. Weird.

Regards,

artur
  
Jon Burgess Nov. 21, 2005, 7:45 p.m. UTC | #5
Ralf Müller wrote:
> On Montag 21 November 2005 02:15, Artur Skawina wrote:
...
> 
>>Are saying you don't get any writeback activity w/ my patch?
> 
> 
> Correct. It starts writing back when memory is filled. Not a single 
> second earlier.
> 

Under normal default circumstances, all dirty data should be synced to 
disk within 30 seconds.

This reminds me of a recent LKML thread. It noted that under some 
circumstances the kernel seems to forget to flush the dirty data. It 
didn't come to any particular conclusion but maybe there is a problem in 
the kernel. It might be an interesting read for you...

http://www.uwsg.iu.edu/hypermail/linux/kernel/0511.1/2043.html


	Jon
  
Oliver Endriss Nov. 22, 2005, 12:33 a.m. UTC | #6
Artur Skawina wrote:
> This version adds a few minor fixes + cutting speed improvement.
> ...

When I switched to kernel 2.6 for the first time I noticed these issues,
too. Selecting the CFQ I/O scheduler in the kernel solved all problems
for me. Did you try that?

Oliver
  
God Nov. 22, 2005, 1:02 p.m. UTC | #7
Oliver Endriss wrote:
> When I switched to kernel 2.6 for the first time I noticed these issues,
> too. Selecting the CFQ I/O scheduler in the kernel solved all problems
> for me. Did you try that?

yes, i use cfq too.

$ cat /sys/block/hd?/queue/scheduler
noop anticipatory deadline [cfq]
noop anticipatory deadline [cfq]
$ cat /sys/block/hd?/queue/read_ahead_kb
4096
4096
$
  
Philippe Gramoullé Nov. 22, 2005, 1:06 p.m. UTC | #8
Hello,

On Tue, 22 Nov 2005 01:33:22 +0100
Oliver Endriss <o.endriss@gmx.de> wrote:

  | Selecting the CFQ I/O scheduler in the kernel solved all problems
  | for me. Did you try that?

Since my VDR is in the living room, i recently switched to a 100% diskless solution
and until now i was having regular freezes when recording a channel and playing back
a divx at the same time.
(system is linux 2.6.13.2/Diskless based Debian Sid/PIII 1Ghz/VDR 1.3.36)

I, indeed, forgot to add "elevator=cfq" in the boot parameters (which i use on about all my
other workstations/server :), and up to now, it definitely improved things: No more
regular freezes (about every 10/12 sec), at last with the few channels/divx combinations
i know there used to be problems with.

So all in all, thx :) You made my day :)

Truly yours,

Philippe
  
Artur Skawina Nov. 22, 2005, 2:30 p.m. UTC | #9
Philippe Gramoullé wrote:
> Oliver Endriss <o.endriss@gmx.de> wrote:
> 
>   | Selecting the CFQ I/O scheduler in the kernel solved all problems
>   | for me. Did you try that?
> 
> Since my VDR is in the living room, i recently switched to a 100% diskless solution
> and until now i was having regular freezes when recording a channel and playing back
> a divx at the same time.
> (system is linux 2.6.13.2/Diskless based Debian Sid/PIII 1Ghz/VDR 1.3.36)
> 
> I, indeed, forgot to add "elevator=cfq" in the boot parameters (which i use on about all my
> other workstations/server :), and up to now, it definitely improved things: No more
> regular freezes (about every 10/12 sec), at last with the few channels/divx combinations
> i know there used to be problems with.

if your VDR really is 100% diskless how can the IO scheduler (which controls 
access to block devices) make any difference?

artur
  

Patch

--- vdr-1.3.36.org/cutter.c	2005-10-31 13:26:44.000000000 +0100
+++ vdr-1.3.36/cutter.c	2005-11-18 03:20:50.000000000 +0100
@@ -66,6 +66,7 @@  void cCuttingThread::Action(void)
      toFile = toFileName->Open();
      if (!fromFile || !toFile)
         return;
+     fromFile->setreadahead(MEGABYTE(10));
      int Index = Mark->position;
      Mark = fromMarks.Next(Mark);
      int FileSize = 0;
@@ -90,6 +91,7 @@  void cCuttingThread::Action(void)
            if (fromIndex->Get(Index++, &FileNumber, &FileOffset, &PictureType, &Length)) {
               if (FileNumber != CurrentFileNumber) {
                  fromFile = fromFileName->SetOffset(FileNumber, FileOffset);
+                 fromFile->setreadahead(MEGABYTE(10));
                  CurrentFileNumber = FileNumber;
                  }
               if (fromFile) {
--- vdr-1.3.36.org/tools.c	2005-11-04 17:33:18.000000000 +0100
+++ vdr-1.3.36/tools.c	2005-11-18 20:33:46.000000000 +0100
@@ -851,8 +851,7 @@  bool cSafeFile::Close(void)
 
 // --- cUnbufferedFile -------------------------------------------------------
 
-#define READ_AHEAD MEGABYTE(2)
-#define WRITE_BUFFER MEGABYTE(10)
+#define WRITE_BUFFER KILOBYTE(800)
 
 cUnbufferedFile::cUnbufferedFile(void)
 {
@@ -869,7 +868,15 @@  int cUnbufferedFile::Open(const char *Fi
   Close();
   fd = open(FileName, Flags, Mode);
   begin = end = ahead = -1;
+  readahead = 16*1024;
+  pendingreadahead = 0;
   written = 0;
+  totwritten = 0;
+  if (fd >= 0) {
+     // we really mean POSIX_FADV_SEQUENTIAL, but we do our own readahead
+     // so turn off the kernel one.
+     posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);
+     }
   return fd;
 }
 
@@ -880,10 +887,10 @@  int cUnbufferedFile::Close(void)
         end = ahead;
      if (begin >= 0 && end > begin) {
         //dsyslog("close buffer: %d (flush: %d bytes, %ld-%ld)", fd, written, begin, end);
-        if (written)
+        if (0 && written)
            fdatasync(fd);
-        posix_fadvise(fd, begin, end - begin, POSIX_FADV_DONTNEED);
         }
+     posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
      begin = end = ahead = -1;
      written = 0;
      }
@@ -899,35 +906,89 @@  off_t cUnbufferedFile::Seek(off_t Offset
   return -1;
 }
 
+// when replaying and going eg FF->PLAY the position jumps back 2..8M
+// hence we might not want to drop that data at once. 
+// Ignoring for now to avoid making this even more complex, but we could
+// at least try to handle the common cases
+// (PLAY->FF->PLAY, small jumps, moving editing marks etc)
+
 ssize_t cUnbufferedFile::Read(void *Data, size_t Size)
 {
   if (fd >= 0) {
      off_t pos = lseek(fd, 0, SEEK_CUR);
-     // jump forward - adjust end position
-     if (pos > end)
-        end = pos;
-     // after adjusting end - don't clear more than previously requested
-     if (end > ahead)
-        end = ahead;
-     // jump backward - drop read ahead of previous run
-     if (pos < begin)
-        end = ahead;
+     off_t jumped = pos-end; // nonzero means we're not at the last offset - some kind of jump happened.
+     if (jumped) {
+        pendingreadahead += ahead-end+KILOBYTE(64);
+        // jumped forward? - treat as if we did read all the way to current pos.
+        if (pos > end) {
+           end = pos;
+           // but clamp at ahead so we don't clear more than previously requested.
+           // (would be mostly harmless anyway, unless we got more than one reader of this file)
+           // add a little extra readahead, JIC the kernel prefethed more than we requested.
+           if (end > (ahead+KILOBYTE(128)))
+              end = ahead+KILOBYTE(128);
+        }
+        // jumped backward? - drop both last read _and_ read-ahead
+        if (pos < begin)
+           end = ahead+KILOBYTE(128);
+        // jumped backward, but still inside prev read window? - pretend we read less.
+        if ((pos >= begin) && (pos < end))
+           end = pos;
+        }
+        
+     ssize_t bytesRead = safe_read(fd, Data, Size);
+     
+     // now drop all data accesed during _previous_ Read().
      if (begin >= 0 && end > begin)
-        posix_fadvise(fd, begin - KILOBYTE(200), end - begin + KILOBYTE(200), POSIX_FADV_DONTNEED);//XXX macros/parameters???
+        posix_fadvise(fd, begin, end-begin, POSIX_FADV_DONTNEED);
+        
      begin = pos;
-     ssize_t bytesRead = safe_read(fd, Data, Size);
      if (bytesRead > 0) {
         pos += bytesRead;
-        end = pos;
         // this seems to trigger a non blocking read - this
         // may or may not have been finished when we will be called next time.
         // If it is not finished we can't release the not yet filled buffers.
         // So this is commented out till we find a better solution.
-        //posix_fadvise(fd, pos, READ_AHEAD, POSIX_FADV_WILLNEED);
-        ahead = pos + READ_AHEAD;
+        
+        // Hmm, it's obviously harmless if we're actually going to read the data
+        // -- the whole point of read-ahead is to start the IO early...
+        // The comment above applies only when we jump somewhere else _before_ the
+        // IO started here finishes. How common would that be? Could be handled eg
+        // by posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) called some time after
+        // we detect a jump. Ignoring this for now. /AS
+
+        // Ugh, it seems to cause some "leaks" at every jump... Either the
+        // brute force approach mentioned above should work (it's not like this is
+        // much different than O_DIRECT) or keeping notes about the ahead reads and
+        // flushing them after some time. the latter seems overkill though, trying
+        // the former...
+
+        //syslog(LOG_DEBUG,"jump: %06ld ra: %06ld size: %ld", jumped, (long)readahead, (long)Size);
+
+        // no jump? also permit small jump still inside readahead window (FF).
+        if (jumped>=0 && jumped<=(off_t)readahead) {
+           if ( readahead <= Size*4 ) // automagically tune readahead size.
+              readahead = Size*4;
+           posix_fadvise(fd, pos, readahead, POSIX_FADV_WILLNEED);
+           ahead = pos + readahead;
+           }
+        else {
+           // jumped - we really don't want any readahead now. otherwise
+           // eg fast-rewind gets in trouble.
+           ahead = pos;
+
+           // flush it all; mostly to get rid of nonflushed readahead coming
+           // from _previous_ jumps. ratelimited.
+           // the accounting is _very_ unaccurate, i've seen ~50M get flushed
+           // when the limit was set to 4M. As long as this triggers after
+           // _some_ jumps we should be ok though.
+           if (pendingreadahead > MEGABYTE(2)) {
+              posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
+              pendingreadahead = 0;
+              }
+           }
         }
-     else
-        end = pos;
+     end = pos;
      return bytesRead;
      }
   return -1;
@@ -950,11 +1011,19 @@  ssize_t cUnbufferedFile::Write(const voi
            end = pos + bytesWritten;
         if (written > WRITE_BUFFER) {
            //dsyslog("flush buffer: %d (%d bytes, %ld-%ld)", fd, written, begin, end);
-           fdatasync(fd);
-           if (begin >= 0 && end > begin)
-              posix_fadvise(fd, begin, end - begin, POSIX_FADV_DONTNEED);
+           totwritten += written;
+           if (begin >= 0 && end > begin) {
+              off_t headdrop = max((long)begin&~4095,(long)WRITE_BUFFER*2);
+              posix_fadvise(fd, (begin&~4095)-headdrop, end - begin + headdrop, POSIX_FADV_DONTNEED);
+              }
            begin = end = -1;
            written = 0;
+           // the above fadvise() works when recording, but seems to leave cached
+           // data around when writing at a high rate (eg cutting). Hence...
+           if (totwritten > MEGABYTE(20)) {
+                 posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
+              totwritten = 0;
+              }
            }
         }
      return bytesWritten;
--- vdr-1.3.36.org/tools.h	2005-11-05 11:54:39.000000000 +0100
+++ vdr-1.3.36/tools.h	2005-11-18 03:13:31.000000000 +0100
@@ -209,6 +209,9 @@  private:
   off_t end;
   off_t ahead;
   ssize_t written;
+  ssize_t totwritten;
+  size_t readahead;
+  size_t pendingreadahead;
 public:
   cUnbufferedFile(void);
   ~cUnbufferedFile();
@@ -218,6 +221,7 @@  public:
   ssize_t Read(void *Data, size_t Size);
   ssize_t Write(const void *Data, size_t Size);
   static cUnbufferedFile *Create(const char *FileName, int Flags, mode_t Mode = DEFFILEMODE);
+  void setreadahead(size_t ra) { readahead = ra; };
   };
 
 class cLockFile {