Regarding summer of code 2008(writing device drivers)

Discussion:

(too old to reply)

pankaj gupta

2008-03-28 08:15:24 UTC

Sir,
I am Pankaj gupta, a student of computer engineering in India.

I have an idea for a project which I named "WAN optimization through
application independent caching".

The idea is to reduce the bandwidth consumption by avoiding
re-transmission of packets between two gatways/nodes.

In a network which is connected to WAN through a gateway( or even if a
node is directly connected to internet), a lot of packet requests are
repetetive.
I have deviced a way to avoid this re-transmission of packet by
developing a networking device driver (at transport layer) using cache
structure in driver.

This would not be like normal browser cache. It would be a system wide
cache. Any application accessing data over network would use the cache
without knowing that it is using cache. If one application has
accessed a data once, that would be available locally to any other
application accessing the same data over network, this is another plus
point of implementing a cache at driver level.

The driver will be installed on both the communicating machines, and
cache will be maintained at both the communicating machines.
Cache consists of most recently used data packets( incoming packet or
outgoing packet) passed over network stack.
While storing the packet in cache, its MD5 checksum will be
calculated and saved along with the packet in cache. Packets will be
stored in a hash based on the MD5 checksum of packet.
Now, whenever a packet is sent from server side it will be captured by
our driver(module) and its MD5 checksum will be calculated and would
be looked-up in cache hash table.
If a match occurs, it means packet has been sent earlier and sending
it again can be avoided if other side (which also have the same
coherent cache) can be informed that this particular packet which
was recieved earlier and currently is in your cache.
So, after finding a match of MD5 checksum in its cache, server would,
instead of sending whole 1480 bytes size packet it would send only
16bytes of MD5 checksum to the other receiving machine.And other
machine would look up that particular packet in its own cache using
that MD5 checksum.

I have completed major portion of this project as my final year
project which is on WINDOWS platform. I am very eager to implement
this on UNIX platform also.

I have not sent a proposal for this project but I am waiting for your
worthful reply.

I am looking forward to your reply.

--
Regards
Pankaj Gupta
AIT Pune

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2008-03-28 20:53:54 UTC

Permalink

Post by pankaj gupta
Sir,
I am Pankaj gupta, a student of computer engineering in India.
I have an idea for a project which I named "WAN optimization through
application independent caching".
The idea is to reduce the bandwidth consumption by avoiding
re-transmission of packets between two gatways/nodes.

Pankaj,

This is an interesting project that you propose.

What you propose is a kind of packet compression, don't you think?
Perhaps your proposal is just the first of several packet compression
facilites that you can add to NetBSD.

Delivery of IP packets is not guaranteed. It seems that you will need
some feedback from the receiver, in order to know that the sender and
receiver have the same contents in their cache. Is that so?

Is your technique capable of opportunistically accelerating peer-to-peer
filesharing apps such as BitTorrent? Two instances of the same
application may segment the same stream of data differently; does your
technique compensate for differences in segmentation, or does it miss
opportunities for compression if the segmentation is different?

You mention the transport layer. I am not precisely sure where your
solution will reside in the kernel. Will you say some more about that?

Do you use only the checksum to detect duplicate packets? It seems
that there is a risk of a stream being corrupted by chance. Also,
the technique seems susceptible to data injection. What do you think?

Do endpoints who are using the packet-cache technique automatically
detect each other?

Have you thought about getting routers involved?

Dave

Post by pankaj gupta
repetetive.
I have deviced a way to avoid this re-transmission of packet by
developing a networking device driver (at transport layer) using cache
structure in driver.
This would not be like normal browser cache. It would be a system wide
cache. Any application accessing data over network would use the cache
without knowing that it is using cache. If one application has
accessed a data once, that would be available locally to any other
application accessing the same data over network, this is another plus
point of implementing a cache at driver level.
The driver will be installed on both the communicating machines, and
cache will be maintained at both the communicating machines.
Cache consists of most recently used data packets( incoming packet or
outgoing packet) passed over network stack.
While storing the packet in cache, its MD5 checksum will be
calculated and saved along with the packet in cache. Packets will be
stored in a hash based on the MD5 checksum of packet.
Now, whenever a packet is sent from server side it will be captured by
our driver(module) and its MD5 checksum will be calculated and would
be looked-up in cache hash table.
If a match occurs, it means packet has been sent earlier and sending
it again can be avoided if other side (which also have the same
coherent cache) can be informed that this particular packet which
was recieved earlier and currently is in your cache.
So, after finding a match of MD5 checksum in its cache, server would,
instead of sending whole 1480 bytes size packet it would send only
16bytes of MD5 checksum to the other receiving machine.And other
machine would look up that particular packet in its own cache using
that MD5 checksum.
I have completed major portion of this project as my final year
project which is on WINDOWS platform. I am very eager to implement
this on UNIX platform also.
I have not sent a proposal for this project but I am waiting for your
worthful reply.
I am looking forward to your reply.
--
Regards
Pankaj Gupta
AIT Pune

--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 278-3933 ext 24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Steven M. Bellovin

2008-03-28 21:37:40 UTC

Permalink

On Fri, 28 Mar 2008 15:53:54 -0500

Post by David Young
Delivery of IP packets is not guaranteed. It seems that you will need
some feedback from the receiver, in order to know that the sender and
receiver have the same contents in their cache. Is that so?

I'd think that ordinary TCP retransmissions would take care of that.

Post by David Young
Do you use only the checksum to detect duplicate packets? It seems
that there is a risk of a stream being corrupted by chance.

If the cached packet is fed to TCP, the ordinary TCP checksum would be
no worse than we have today. Also, we're working with MD5, which is a
very strong checksum; if the sender and receiver have the same MD5 hash
stored, the only possible area for corruption is the stored packet
corresponding to the MD5 hash on the receiver. But that's why I want
to send it to TCP's input routine for normal processing, to preserve
the end-to-end -- well, transport layer to transport layer -- checksum
semantics.

Post by David Young
Also,
the technique seems susceptible to data injection. What do you think?

I don't see why there's any more chance of it with this scheme than
with normal TCP.

Post by David Young
Do endpoints who are using the packet-cache technique automatically
detect each other?
Have you thought about getting routers involved?

Beware packets taking different paths. Also, when do routers discard
their caches? What do they cache?

All this said, I'm not convinced this will work particularly well, for
several reasons. First, how much data can be cached in RAM on the
receiving machine? What are the odds that some other application will
want the same data within the lifetime of the cached copy? Web
graphics are the most likely case, but the browser's cache generally
takes care of that. Second, how expensive is the cache consistency
protocol? Will there be more traffic maintaining the MD5 state than is
saved? Besides, a typical web server can't maintain data very long
(especially in the kernel) for any one web client. Finally, on many
links the cost is per-packet, rather than per-bit.

Here's a suggestion, though: only save the first ~10 packets of any
connection. First, they're slow to arrive, because of slow start.
Second, on big files you can't save that much because (a) you can't
afford to keep much in RAM, per the above; (b) once TCP get past the
slow start phase, it will work at line speed minus the effects of
upstream congestion; (c) most connections are pretty short anyway.

--Steve Bellovin, http://www.cs.columbia.edu/~smb

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

der Mouse

2008-03-28 22:47:29 UTC

Permalink

Post by Steven M. Bellovin

Also, the technique seems susceptible to data injection. What do
you think?

I don't see why there's any more chance of it with this scheme than
with normal TCP.

Sleeper injections, perhaps? With normal TCP, if you inject a packet,
it has to be in-window, or it's dropped. With this, you can inject a
packet and have it sit in a cache for a more or less unlimited time and
then have it crawl out and damage the data stream. (Nontrivial, but
I'd be very reluctant to declare it impossible. A lot of traffic is a
lot more predictable in practice than it's promised to be by theory.)

Post by Steven M. Bellovin
First, how much data can be cached in RAM on the receiving machine?

Quite a lot, if it wants to. I'm regularly seeing machines these days
with more RAM than some of my machines have _disk_.

/~\ The ASCII der Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents.montreal.qc.ca
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Steven M. Bellovin

2008-03-28 23:10:06 UTC

Permalink

On Fri, 28 Mar 2008 18:47:29 -0400 (EDT)

Post by der Mouse

Post by Steven M. Bellovin

Also, the technique seems susceptible to data injection. What do
you think?

I don't see why there's any more chance of it with this scheme than
with normal TCP.

Sleeper injections, perhaps? With normal TCP, if you inject a packet,
it has to be in-window, or it's dropped. With this, you can inject a
packet and have it sit in a cache for a more or less unlimited time
and then have it crawl out and damage the data stream. (Nontrivial,
but I'd be very reluctant to declare it impossible. A lot of traffic
is a lot more predictable in practice than it's promised to be by
theory.)

I still don't see the attack. The packet can only get into the cache
if it's in-window for some stream, plus it passes the TCP checksum.
(Aside: we never want to cache UDP packets without checksum. In fact,
we never want to send them, but that's a separate rant.) That means
that a once-valid packet has to be corrupted on the local machine while
sitting in kernel RAM. Doing that requires root access, but someone
with that access has many easier paths to corrupting an application or
its data.

I'm more worried about accidental contamination, hence my suggestion
about the TCP checksum.

Post by der Mouse

Post by Steven M. Bellovin
First, how much data can be cached in RAM on the receiving machine?

Quite a lot, if it wants to. I'm regularly seeing machines these days
with more RAM than some of my machines have _disk_.

Sure, but how much network traffic do they have? Is that the best
performance improvement per dollar/euro/yen/zorkmid spent, compared
with using that RAM for file system buffers or executables, especially
when you take the probability of reuse into account. (Hmm -- for NFS,
it might be a very promising idea...)

--Steve Bellovin, http://www.cs.columbia.edu/~smb

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

der Mouse

2008-03-29 22:45:46 UTC

Permalink

Post by Steven M. Bellovin

Post by der Mouse

I don't see why there's any more chance of [data injection] with
this scheme than with normal TCP.

Sleeper injections, perhaps?

I still don't see the attack. The packet can only get into the cache
if it's in-window for some stream, plus it passes the TCP checksum.

Oh, I misunderstood. I thought this was done at the IP layer, not the
TCP layer.

Post by Steven M. Bellovin
(Aside: we never want to cache UDP packets without checksum. [...])

If you're doing caching at the TCP layer, you don't have to worry about
non-TCP packets.

Post by Steven M. Bellovin
[...], especially when you take the probability of reuse into
account. (Hmm -- for NFS, it might be a very promising idea...)

Doesn't most NFS use UDP, and thus not get cached? I certainly know
that I've seen NFS-over-TCP used seldom-to-never.

/~\ The ASCII der Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents.montreal.qc.ca
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Joerg Sonnenberger

2008-03-30 01:13:58 UTC

Permalink

Post by der Mouse
Doesn't most NFS use UDP, and thus not get cached? I certainly know
that I've seen NFS-over-TCP used seldom-to-never.

Depends. For high-performance networks, UDP is often far from the best
choice.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Matthias Scheler

2008-04-01 21:32:08 UTC

Permalink

Post by der Mouse

Post by Steven M. Bellovin
[...], especially when you take the probability of reuse into
account. (Hmm -- for NFS, it might be a very promising idea...)

Doesn't most NFS use UDP, and thus not get cached? I certainly know
that I've seen NFS-over-TCP used seldom-to-never.

Huh? It is actually NFS over UDP that is dying out. NFSv4 over UDP was
removed from the spec by purpose.

I had lots of problems with NFS over UDP with a fast file server
(NetBSD-i386
with Gigabit ethernet) and a slow client (NetBSD-sparc, 10Mb/s
ethernet).
Switching to NFS over TCP fixed those problems.

Kind regards

--
Matthias Scheler http://zhadum.org.uk/

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Hubert Feyrer

2008-03-28 23:57:15 UTC

Permalink

Post by pankaj gupta
I am looking forward to your reply.

Can you describe what usage scenario this would come into effect with?
Speeding up download of the same file twice?

What impact will this have on the virtual memory system, if you cache all
traffic going through a machine? I'm thinking of a file server that serves
many megabytes per second, will it cache the whole traffic? Or am I
misunderstanding something entirely here? What's the use case you have in
mind?

- Hubert

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de