Storing the Full Internet
The other day I blogged about Blekko and what it would take to in terms of hardware index the full Internet.
High Scalability responded with some interesting thoughts.
Kevin Burton calculates that Blekko, one of the barbarian hoard storming Google's search fortress, would need to spend $5 million just to buy enough weapons, er storage.
Kevin estimates storing a deep crawl of the internet would take about 5 petabytes. At a projected $1 million per petabyte that's a paltry $5 million. Less than expected. Imagine in days of old an ambitious noble itching to raise an army to conquer a land and become its new prince. For a fine land, and the search market is one of the richest, that would be a smart investment for a VC to make.
The comments are interesting.
Borislav Agapiev made some interesting comments.
"(far) less than 5 PB - a world class index would be 20B pages times 10KB per page = 200TB. This is for page storage, there would be more for storing the index i.e. posting lists. It would depend on size of individual postings and lengths of posting lists but few PB would cover it."
It's more than just the raw page HTML.
You need metadata, previous archived pages, diffs, ngram distribution for text clustering algorithms and duplicate detection and
Also, avg page size in Google's full 'net crawl is 15k. We see similar numbers in Spinn3r.
The bottom line is the storage required is very cheap. BTW, $1M/PB = $1/GB seems too high, nowadays cheap SATA 500GB disks can be had for $100.
Yes.... This is a standard SATA disk or $.2 / GB. However, this is JUST disk not CPU, a redundant copy of that disk, and additional disks for high IO transfer rates.
You need a LOT more spindles to be able to access this 5P. Storing the data is one thing. Making sure it's highly available, fault tolerant, and high performance is a totally separate issue and ends up seriously increasing your costs.
For instance, one can crawl with a good crawler 1M pages/day on 1Mbps bandwidth i.e. 1B pages/day with 1GBps. So with 20Gbps one can crawl entire Internet daily. 20Gbps of crawling bandwidth goes for $100K/mo in the Valley, you can saturate it with , say, thousand cheap crawlers ($1-$1.5K each). I would think that Google spends way more in their cafeteria than that :)
This doesn't include the political aspects. You can't build a full web crawl in one day. You'd be blocked instantly. Crawlers have to be polite.
We spend a TON of time working with blog hosting companies to make sure our crawlers yield to their policies. For example, LiveJournal won't allow us to crawl with more than 10 threads.
Borislav rephrased this by saying that "crawling does not scale" since you don't HAVE to scale it as much as you'd think because you can only build the crawl at a slow pace.
There are other factors here though. Building a web graph, recomputing your ranking algorithm so you can re-prioritize your crawl, serving the content to clients, finding duplicates, all this stuff requires more resources than you initially think.
Of course maybe these aren't actual problems. Building this stuff is fun!