.

This is Spinn3r's offficial weblog where we discuss new product direction, feature releases, and all our cool news.

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking.

If you'd like to read more about Spinn3r you could read our Founder's blog or check out Tailrank - our memetracker.

Spinn3r is proudly hosted by ServerBeach.

Archives

September 2009
July 2009
June 2009
May 2009
April 2009
February 2009
January 2009
December 2008
October 2008
September 2008

Spinn3r Indexing 52T Per Month

I looked at our bandwidth numbers and Spinn3r has indexed 52T of raw content per month.

That's 52 TERABYTES people. Nearly 160Mbits continuous IO processed 24/7.

A good portion of this is redundant RSS and polled HTML.

I'd really love to have the web upgraded to support Delta encoding. This would save a ton of money in bandwidth costs.

Comments

Ian Rogers

"Rsync encoding" http://rproxy.samba.org/doc/protocol/protocol.html is an earlier example of a delta-encoding transport (and much easier to compute than vcdiff as suggested in your linked paper).

But all delta type encodings require you to maintain a cache of the "previous" version. Do you have enough storage to hold 52TB / month? :-)

Of course you don't need that much as RSS feeds are refreshed far more frequently than once a month.

Do you have a idea how much space you'd need to store a "current" snapshot of of all the feeds you spider?

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

If you have a TypeKey or TypePad account, please Sign In.