Spinn3r Indexing 52T Per Month
I looked at our bandwidth numbers and Spinn3r has indexed 52T of raw content per month.
That's 52 TERABYTES people. Nearly 160Mbits continuous IO processed 24/7.
A good portion of this is redundant RSS and polled HTML.
I'd really love to have the web upgraded to support Delta encoding. This would save a ton of money in bandwidth costs.
"Rsync encoding" http://rproxy.samba.org/doc/protocol/protocol.html is an earlier example of a delta-encoding transport (and much easier to compute than vcdiff as suggested in your linked paper).
But all delta type encodings require you to maintain a cache of the "previous" version. Do you have enough storage to hold 52TB / month? :-)
Of course you don't need that much as RSS feeds are refreshed far more frequently than once a month.
Do you have a idea how much space you'd need to store a "current" snapshot of of all the feeds you spider?
Posted by: Ian Rogers | December 05, 2007 at 06:13 PM