[MPlayer-dev-eng] [PATCH] Use posix_fadvise in stream_file if available

Mon Nov 9 18:26:37 CET 2009

Reimar Döffinger wrote:
> > So, the "don't call fadvise every time with the whole 1MB area" I
> > had in it at first was not premature.
> 
> Yes it was, the fadvise at most doubles the number of syscalls, so
> that alone could at most double system time.
> Anything beyond that obviously _must_ be inherently due to fadvise
> itself.

No, I did it because I wanted to avoid to much repeated work in the
syscall for the overlapping area.
After all the kernel has to lookup up page entries for the area to
see if they are already in the pagecache and if not start the async
read for the page.
And page lookup is expensive as far as I know from reading about
kernel issues on lwn.net.

> > Let's try a slightly modified patch that looks ahead PREFETCH_LEN
> > and uses max_len for the length...
> 
> Is there a _really_ good reasoning for that and a complete explanation
> for the numbers or are you just unknowingly optimizing for the interleaving
> pattern of that specific file you tested?

Yes.
The reasoning is that after the initial fadvise for the whole
prefetch area we tell the kernel _once_ for each new max_len chunk
of the sliding window instead of telling it 512 times.

It optimize a normal well interleaved file best, since no max_len
chunk is told the kernel twice if we read the file linearly.

In the badly interleaved file case the seek still always tells the
kernel about the following 1MB even if we told it already two seeks
before. (switching to the alternate stream and back).

> Does the Linux kernel possibly handle overlapping fadives by re-reading
> the data over and over again or something equally stupid?

No, but it has to map the pages of the whole area, AFAICS there is
no other way to do it.

> Or did you not have added the fadvise after seek when you got those numbers?
> Because I can't really see why the sequence
> fadvise(start, PREFETCH_LEN)
> fadvise(start + 2048, PREFETCH_LEN)
> fadvise(start + 4096, PREFETCH_LEN)
> ....
> should behave much worse than
> fadvise(start, PREFETCH_LEN)
> fadvise(start + PREFETCH_LEN, 2048)
> fadvise(start + PREFETCH_LEN + 2048, 2048)
> ...

vmstat shows that it doesn't cause increased read io rate, so it
only reads the pages not already in the cache.

Also see above for "page cache lookup is expensive".

>From mm/fadvise.c:
|case POSIX_FADV_WILLNEED:
|[...]
|ret = force_page_cache_readahead(mapping, file, start_index, nrpages);
(FADV_NOREUSE is indeed a no-op in 2.6.32-rc6 BTW)

>From mm/readahead.c:force_page_cache_readahead:
|[...]
|nr_to_read = max_sane_readahead(nr_to_read);
Reduces to nr_free_pages/2 for numa node
|[...]
|while (nr_to_read) {
|[...]
|	err = __do_page_cache_readahead(mapping, filp, offset, this_chunk, 0);
|[...]
|}

>From mm/__do_page_cache_readahead:
[Gets called for at most 2MB at a time]
|[...]
|for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
|[...]
|	rcu_read_lock();
|	page = radix_tree_lookup(&mapping->page_tree, page_offset);
|	rcu_read_unlock();
|	if (page)
|		continue;
Continue with next page if the requested page is already in the pagecache
|	page = page_cache_alloc_cold(mapping);
|[...]
|	list_add(&page->lru, &page_pool);
|	if (page_idx == nr_to_read - lookahead_size)
|		SetPageReadahead(page);
|	ret++;
Note: lookahead_size is 0 here.
|}
|[5line comment "Now start the IO"]
|if (ret)
|	read_pages(mapping, filp, &page_pool, ret);

So it has to go through all the indicated pages, taking an rcu lock
and doing a radix tree lookup for each page, adding all non-present
pages to a list and handing that list to read_pages to start the
background read.

> I guess increasing STREAM_BUFFER_SIZE to 4kB and using that instead of
> max_len in fadvise would be ok and possibly more readable, too.

I actually tried a patch that increases STREAM_BUFFER_SIZE to 64kB
for plain files, but that barely makes a difference in system time
usage or speed...

> The name PREFETCH_LEN is bad though, it should probably be READAHEAD_DISTANCE.

prefetch: "(computing) To load data or instructions in anticipation
of their need"
"readahead is the file prefetching technology used in the Linux
operating system"
(Well, from wiktionary and wikipedia respectively, to be taken with
a grain of salt :))

I'd say that prefetch and read-ahead are more or less synonyms...
With the change distance is clearly better than len though.
Anyway I don't really care if it's PREFETCH_DISTANCE or
READAHEAD_DISTANCE... ^^;

> > Unpatched:
> > Run1: 0m52.213s real, 0m4.361s system, 0m2.019s user
> > Run2: 0m51.481s real, 0m4.258s system, 0m2.093s user
> > 
> > Patched:
> > Run1: 0m48.986s real, 0m4.970s system, 0m2.056s user
> > Run2: 0m48.376s real, 0m5.005s system, 0m1.975s user
> > 
> > Yep, that looks more reasonable.
> 
> It makes the patch look unneeded.
> And how does this compare to -cache? If it isn't better in
> all cases I suspect there's something wrong...

Why?
It impoves speed by about 6% in the case it's not supposed to
optimize.
And it actually works with the badly muxed file where -cache breaks
down almost completly.
Even if it uses a bit more system time in that case, -cache instead
will happily saturate the disk with unneeded IO and use up CPU time
I'd rather see spent on decoding the video...

> Playing something from a slow disk/device while also decoding
> might give something useful (sshfs over wireless maybe?).

I doubt it, since it does not cause more IO to take place than
before (vmstat does not show an increased read io rate, the kernel
checks for already mapped pages) and in the decoding case only a
miniscule amount of time will be spent on the IO.

For the above example I'd expect about 864 seconds of decoding time
in user space (based on the around 60% cpu usage I normally see on
this System while decoding Half-HD video and the 24 minutes running
time of the video).

Compared to that the worst case for the badly interleaved file above
is 1.3% and for the well interleaved file I'd expect about 0.5% of
kernel time spent on IO.

Revised patch (renamed PREFETCH_LEN to READAHEAD_DISTANCE) and
the mentioned '64KB buffer size for plain files' patches attached.
(In case you also want to have a look at the very ligthly tested
latter patch, as I said I don't think it's really worth it much to
my surprise)

-- 
Tobias						PGP: http://8ef7ddba.uguu.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fadvise.patch
Type: text/x-diff
Size: 1690 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20091109/5334fcb5/attachment.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: buffer_size.patch
Type: text/x-diff
Size: 3926 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/attachments/20091109/5334fcb5/attachment-0001.patch>