.

This is Spinn3r's offficial weblog where we discuss new product direction, feature releases, and all our cool news.

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking.

If you'd like to read more about Spinn3r you could read our Founder's blog or check out Tailrank - our memetracker.

Spinn3r is proudly hosted by ServerBeach.

Archives

September 2009
July 2009
June 2009
May 2009
April 2009
February 2009
January 2009
December 2008
October 2008
September 2008

Thoughts on Efficient Crawling through URL Ordering

I'm re-reading "Efficient Crawling through URL Ordering" and a few other papers I've read a few years ago.

Now that I have Skim I can take notes in the PDF directly which is turning out to be amazingly productive.

It dawned on me that I should also blog these notes as well.

First, some background:

A crawler is a program that retrieves Web pages, commonly for use by a search engine [Pinkerton 1994] or a Web cache. Roughly, a crawler starts off with the URL for an initial page P0. It retrieves P0, extracts any URLs in it, and adds them to a queue of URLs to be scanned. Then the crawler gets URLs from the queue (in some order), and repeats the process. Every page that is scanned is given to a client that saves the pages, creates an index for the pages, or summarizes or analyzes the content of the pages.

The authors discuss a number of priority metrics including query driven crawling, pagerank and backlink based crawling.

This paper is a bit dated with the authors noting that the web is about 1.5T in size. The web has grown a bit since then.

Crawl & Stop. Under this model, the crawler C starts at its initial page P0 and stops after visiting K pages. At this point a perfect crawler would have visited pages R1, ..., RK, where R1 is the page with the highest importance value, R2 is the next highest, and so on. We call pages R1 through RK the hot pages. The K pages visited by our real crawler will contain only M pages with rank higher than or equal to I(RK). We define the performance of the crawler C to be PCS(C) = (M•100)/K. The performance of the ideal crawler is of course 100%. A crawler that somehow manages to visit pages entirely at random, and may revisit pages, would have a performance of (K•100)/T, where T is the total number of pages in the Web. (Each page visited is a hot page with probability K/T. Thus, the expected number of desired pages when the crawler stops is K2/T.)

This a useful model for estimating the accuracy of a crawler. Our discovery engine approaches 100% of the connected graph. We then promote these URLs into Spinn3r which are then indexed by our crawler.

Spinn3r by definition is designed to approach 100% accuracy of the crawl with 100% realtime indexing. When a blog is posted we have to index it within 5 minutes for our clients.

For larger crawls, estimations of the efficiency become more important.

...

Query driven crawling also offers additional benefits. Back in the day, URLs weren't generated from mainstream content management systems so it wasn't really possible to extract metadata from them directly:

As we will see, for similarity, we may be able to use the text that anchors the URL u as a predictor of the text that P might contain. Thus, one possible ordering metric O(u) is IS(A, Q), where A is the anchor text of the URL u, and Q is the driving query.

Now, URLs have additional metadata that we can extract. For example, from:

http://feedblog.org/2007/12/25/what-is-wrong-with-icerocket/

We know that the article as published on 12-25-2007. We also know that the title might be "what is wrong with icerocket".

If we were performing a targeted crawl for articles created in 2007 about icerocket this would be an additional way to hint your queue to prioritize this URL for indexing.

...

Their trust metric based crawling (pagerank) is obviously superior to anyone who's researched trust metrics. Their observation that backlink counting metrics behave like a depth-first crawl is accurate and the exact same behavior I've seen with our crawlers.

Comments

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

If you have a TypeKey or TypePad account, please Sign In.