LFS writes and network receive (too much splhigh?)

Discussion:

(too old to reply)

Thor Lancelot Simon

2006-10-22 19:07:02 UTC

If I restore a backup full of large files (such that the smooth syncer,
which schedules writes when _a file's_ oldest data is 30 seconds old),
over the network onto LFS, the following thing happens:

1) Writes queue up for 30 seconds (this is a design flaw in the smooth syncer)

2) Every 30 seconds, LFS writes flat-out for a few seconds

3) While this is going on, the network interface interrupt rate falls off
dramatically, almost to zero, while the disk interface interrupt rate,
of course, rises.

I assume the network interrupts are being masked during the segment writes,
either by LFS itself or by the disk driver. This has the exasperating
effect of causing dropped packets, which causes the TCP window to slam open
and shut.

How can we fix this? The smooth syncer issue is really separate: it just
makes it easier to demonstrate the problem.

Could LFS use more locks so that it spent less time at high IPL? Or is
this really a problem in my disk device driver (amr)? I have heard
similar reports from other LFS users with different disk hardware.

--
Thor Lancelot Simon ***@rek.tjls.com

"We cannot usually in social life pursue a single value or a single moral
aim, untroubled by the need to compromise with others." - H.L.A. Hart

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2006-10-22 20:29:48 UTC

Permalink

On Sun, 22 Oct 2006 15:07:02 -0400

Post by Thor Lancelot Simon
If I restore a backup full of large files (such that the smooth
syncer, which schedules writes when _a file's_ oldest data is 30
1) Writes queue up for 30 seconds (this is a design flaw in the smooth syncer)

No, it can't. The design flaw is that the syncer writes *all* the
dirty data for any file when the oldest dirty data for that file is
syncdelay (or filedelay) old. That is a wrong implementation of the
smooth sync algorithm, which should write each page of data (and perhaps
any immediately adjacent dirty data, for the sake of efficiency) when
that page is syncdelay (or filedelay) old.

The implementation we have dengenerates to being equivalent to non
smooth sync in the case in which all the data dirtied during the sync
interval belongs to the same few large files: it does huge writes every
30 seconds, instead of doing smaller writes at a lower, constant rate.

For LFS, it could be helped perhaps by maintaining an estimate of how
much dirty data there is, and syncing whenever that estimate adds up
to more than a segment. But there would then need to be a way to stop
the sync so as to not do partial-segment writes at the end.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Hubert Feyrer

2006-10-22 20:26:18 UTC

Permalink

vfs.sync.delay = 20
vfs.sync.filedelay = 20
vfs.sync.dirdelay = 15
vfs.sync.metadelay = 10

None of these are in sysctl.3 - can someone please add documentation for
them?

- Hubert

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2006-10-22 21:56:29 UTC

Permalink

What do you think is going on?

I'm not sure yet. Are you absolutely sure it's a problem with
servicing the interrupt on time?

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2006-10-22 22:51:16 UTC

Permalink

So the reasonable inference, to me, really seems to be that LFS,
when it
writes flat-out for 5 or 10 seconds at a time, is causing the network
interrupts to not be serviced, which is what's causing TCP to back
off.

So, the next step is to determine a few pieces of info from the
network driver:

1- How many packets are being serviced per interrupt? When the rate
drops, you should see this number go UP. If it does not go up, then
it means packets are not actually arriving, right?

2- Does the Ethernet chip ever report that a packet was dropped
because the ring buffer filled up?

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2006-10-22 22:53:36 UTC

Permalink

Post by Jason Thorpe
So, the next step is to determine a few pieces of info from the
1- How many packets are being serviced per interrupt? When the rate
drops, you should see this number go UP. If it does not go up, then
it means packets are not actually arriving, right?
2- Does the Ethernet chip ever report that a packet was dropped
because the ring buffer filled up?

Ugh, the bge driver. I'll see what I can squeeze out of it.

Bill Studenmund

2006-10-23 01:18:48 UTC

Permalink

Post by Jason Thorpe

What do you think is going on?

I'm not sure yet. Are you absolutely sure it's a problem with
servicing the interrupt on time?

Well, if I run systat vmstat (with interval 1) while this is going on,
when LFS starts to whack the disk, I see the usual 2000-3000 network
interrupts per second fall off to near zero until the writes (and
disk controller) drop back to zero.
So, while LFS is writing, I will see 10, sometimes 30, very occasionally
as many as 300 interrupts per second on the network controller's interrupt
line, right as TCP backs off and throughput goes to hell. When LFS isn't
writing, I see, as I said, 2000-300 network interrupts per second -- or
as many as 7,000, if I turn interrupt moderation down.
So the reasonable inference, to me, really seems to be that LFS, when it
writes flat-out for 5 or 10 seconds at a time, is causing the network
interrupts to not be serviced, which is what's causing TCP to back off.

How much buffering is in your application? I ask as another thing that
could be happening is that the file is locked while it's being flushed, so
that the program that's reading the network stalls during this flush. That
means no more packet reception.

This scenario would be noticable by either monitoring netstat to see what
the connection queue lengths look like or by monitoring the tcp stream to
see if the LFS box is explicitly shrinking the window (i.e. the stack
noticed that the app's not reading for the moment).

Take care,

Bill