.

This is Spinn3r's offficial weblog where we discuss new product direction, feature releases, and all our cool news.

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking.

If you'd like to read more about Spinn3r you could read our Founder's blog or check out Tailrank - our memetracker.

Spinn3r is proudly hosted by ServerBeach.

Archives

September 2009
July 2009
June 2009
May 2009
April 2009
February 2009
January 2009
December 2008
October 2008
September 2008

Storing the Full Internet

The other day I blogged about Blekko and what it would take to in terms of hardware index the full Internet.

High Scalability responded with some interesting thoughts.

Kevin Burton calculates that Blekko, one of the barbarian hoard storming Google's search fortress, would need to spend $5 million just to buy enough weapons, er storage.

Kevin estimates storing a deep crawl of the internet would take about 5 petabytes. At a projected $1 million per petabyte that's a paltry $5 million. Less than expected. Imagine in days of old an ambitious noble itching to raise an army to conquer a land and become its new prince. For a fine land, and the search market is one of the richest, that would be a smart investment for a VC to make.

The comments are interesting.

Borislav Agapiev made some interesting comments.

"(far) less than 5 PB - a world class index would be 20B pages times 10KB per page = 200TB. This is for page storage, there would be more for storing the index i.e. posting lists. It would depend on size of individual postings and lengths of posting lists but few PB would cover it."

It's more than just the raw page HTML.

You need metadata, previous archived pages, diffs, ngram distribution for text clustering algorithms and duplicate detection and

Also, avg page size in Google's full 'net crawl is 15k. We see similar numbers in Spinn3r.

The bottom line is the storage required is very cheap. BTW, $1M/PB = $1/GB seems too high, nowadays cheap SATA 500GB disks can be had for $100.

Yes.... This is a standard SATA disk or $.2 / GB. However, this is JUST disk not CPU, a redundant copy of that disk, and additional disks for high IO transfer rates.

You need a LOT more spindles to be able to access this 5P. Storing the data is one thing. Making sure it's highly available, fault tolerant, and high performance is a totally separate issue and ends up seriously increasing your costs.

For instance, one can crawl with a good crawler 1M pages/day on 1Mbps bandwidth i.e. 1B pages/day with 1GBps. So with 20Gbps one can crawl entire Internet daily. 20Gbps of crawling bandwidth goes for $100K/mo in the Valley, you can saturate it with , say, thousand cheap crawlers ($1-$1.5K each). I would think that Google spends way more in their cafeteria than that :)

This doesn't include the political aspects. You can't build a full web crawl in one day. You'd be blocked instantly. Crawlers have to be polite.

We spend a TON of time working with blog hosting companies to make sure our crawlers yield to their policies. For example, LiveJournal won't allow us to crawl with more than 10 threads.

Borislav rephrased this by saying that "crawling does not scale" since you don't HAVE to scale it as much as you'd think because you can only build the crawl at a slow pace.

There are other factors here though. Building a web graph, recomputing your ranking algorithm so you can re-prioritize your crawl, serving the content to clients, finding duplicates, all this stuff requires more resources than you initially think.

Of course maybe these aren't actual problems. Building this stuff is fun!

Comments

Borislav Agapiev

Kevin,

I agree with your clarifications, my goal was to give a higher-level picture. There are also other issue w.r.t. the details, e.g. whether to archive copies of pages at all. Google is getting away with this, but there are quite a few people who interpret full caching of pages as a copyright violation, even just linking (http://www.iht.com/articles/ap/2007/02/13/business/EU-FIN-Belgium-Google-vs-Newspapers.php).

It is true that you need redundant disks to serve the results to significant number of users, basically multiple copies to trade off space vs. speed. However it can be still done with surprisingly small resources, the best example IMHO is the valiant effort put up by GigaBlast - http://gigablast.com.

Regarding crawling, sorry, I do not get that - why would you think one can get a pass with a really slow crawl simply because there are roadblocks put up by others despite obvious feasibility? Clearly, if there was a way to solve that and make everyone happy and have users with fresh and up-to-date content, that should be what we are looking for, right? Hint - there is a way, that is what I am working on right now :), unfortunately we are in stealth so can't say more at this point ...

In addition to storing (indexing) and crawling, there is also query serving and ranking. For query serving, Google' average query flow is 1500+queries/sec, with peaks probably over 10K. That is a lot (that is why Nutch people claim query serving is a bigger problem than crawling for open source search) but also not something outrageous, e.g. one should be able to cover it with few hundreds machines with lots of RAM. Again see GigaBlast for an excellent example how to do it efficiently, early on they were using the same machines for everything don't know if they changed ... (BTW, disclaimer: - I am not affiliated with them in any way, simply always liked their stuff :) )

That leaves us with ranking. Google resources there do appear insurmountable, however IMHO there is a serious problem in their approach in that they rely solely on automation and are being basically spammed to a mind boggling extent by hordes of SEO black,white,gray-hat etc. people. I think people are (very) slowly waking up to this, giving rise to efforts such as Mahalo and Search Wikia. However the problem with these approaches IMHO is that they do not scale at all, nowhere close to billions of documents, relying almost solely on human element.

I believe what is needed is to combine wisdom of crowds and user input with powerful automation, that can scale to (tens, hundreds, ...) of billions of documents, ala PageRank It is possible, that is also what we are working on :) ...

Anyway, great discussion, it is definitely the case that the scale of Web wide search is not nearly as intimidating as some make it out to be. such an effort could be mounted by a number of players, even in a head-on centralized fashion. IMHO the problem very few are even trying, a situation Google I am sure loves :) and continues to cultivate.

Kevin Burton

"Kevin, I agree with your clarifications, my goal was to give a higher-level
picture. There are also other issue w.r.t. the details, e.g. whether to archive
copies of pages at all. Google is getting away with this, but there are quite a
few people who interpret full caching of pages as a copyright violation, even
just linking."

Yes. I agree this becomes complicated. One of the ways to solve it is to just
let people control their content in whatever way they see fit.

It's really just the Darwin awards of the Internet.

If you don't want to be linked to or don't want your content indexed that's
fine.

The NY Times was a hold out for a LONG time but finally came around to Google's
way of thinking.

The Belgium case is a bit disturbing.

"It is true that you need redundant di:sks to serve the results to significant
number of users, basically multiple copies to trade off space vs. speed.
However it can be still done with surprisingly small resources, the best example
IMHO is the valiant effort put up by GigaBlast - http://gigablast.com."

Agreed. I think you could do it on a smaller set of machines. I originally noted
1-5P and we were looking at smaller steps to getting there. 1P is somewhat
reasonable.

"Regarding crawling, sorry, I do not get that - why would you think one can get
a pass with a really slow crawl simply because there are roadblocks put up by
others despite obvious feasibility? Clearly, if there was a way to solve that
and make everyone happy and have users with fresh and up-to-date content, that
should be what we are looking for, right? Hint - there is a way, that is what I
am working on right now :), unfortunately we are in stealth so can't say more at
this point ... "

Yes.... I realize there's a way. I'm sure we're thinking along the same lines.

"In addition to storing (indexing) and crawling, there is also query serving and
ranking. For query serving, Google' average query flow is 1500+queries/sec, with
peaks probably over 10K. That is a lot (that is why Nutch people claim query
serving is a bigger problem than crawling for open source search)"

Oh... I totally agree. Search is 90% read.

"However the problem with these approaches IMHO is that they do not scale at
all, nowhere close to billions of documents, relying almost solely on human
element. I believe what is needed is to combine wisdom of crowds and user input
with powerful automation, that can scale to (tens, hundreds, ...) of billions of
documents, ala PageRank It is possible, that is also what we are working on :)
... Anyway, great discussion, it is definitely the case that the scale of Web
wide search is not nearly as intimidating as some make it out to be. such an
effort could be mounted by a number of players, even in a head-on centralized
fashion. IMHO the problem very few are even trying, a situation Google I am sure
loves :) and continues to cultivate."

The rules are certainly changing - which is good.

Even running running a decent crawler AND search appliance with the current
state of the art hardware and software is certainly within reach.

Should be an interesting couple of years.

Burn Alex

Google don't indexing my blog abaut pain, pain relief, pain treatment and pain medications. What is this? My blog placed on Blogspot. please send me answer about this.

noname

i just tried mahalo. limited and somewhat seo cheesy.

The comments to this entry are closed.