.

This is Spinn3r's offficial weblog where we discuss new product direction, feature releases, and all our cool news.

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking.

If you'd like to read more about Spinn3r you could read our Founder's blog or check out Tailrank - our memetracker.

Spinn3r is proudly hosted by ServerBeach.

Archives

May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
August 2007
July 2007

Spinn3r 2.2.1 Released

200805151505-1Spinn3r 2.2.1 is out the door.

This is evolution on over Spinn3r 2.2 which has a number of features and fixes suggested by our user base.

New API Methods:

As a result of our recent infrastructure changes, we're now able to provide a more robust feature set to our customers.

Ninety percent of our users are served by our raw crawler API but occasionally there are questions regarding support for a specific weblog, access to archive posts, etc.

These new methods should help improve this situation by making it easier to interact with Spinn3r.

At the moment this functionality is only supported with our permalink interface. We're working on back porting this functionality to our feed API as well.

source.list

Our new source.list API is designed for customers with existing crawlers that want to tie into our spam prevention and ping infrastructure.

The source.list API was designed to help 3rd party crawlers tie into Spinn3r's ping stream and realtime polling and prioritization backend.

Returns an RSS feed with lists of weblogs that have either been found or discovered by Spinn3r or published after a given timestamp.

You can see the source.list documentation for further information.

200805151742permalink.history

A number of Spinn3r customers have requested the ability to fetch historical content for specific blogs. This is now possible with our new permalink.history method.

Given a weblog URL, return recently published articles. This can be used to find the most recent results from techcrunch.com, gigaom.com, etc.

Results in recent posts sorted by reverse chronological order.

This is made possible due to the backend database improvements we've been steadily working on over the last year. We're going to port these changes to the feed API shortly. We're waiting to bring more hardware online for this which should take 2-3 weeks.

permalink.status

This provides the ability to obtain the status for a specific post (permalink) within Spinn3r.

General Crawler Improvements

This release also includes the following crawler improvements:

Faster Polling Interval

We've migrated to 45 minute (vs 60 minute) polling intervals for all cyclical feeds and sources. Everything else in Spinn3r is updated in real time when we receive a ping.

We're going to be reducing this to a 30 minute polling interval in the next week or so. We're going to pause at 45 minutes to see if any sites complain and make sure there aren't any performance issues which we have to deal with.

This should be fine as Bloglines has been using 30 minute polling intervals for a few years now and it hasn't caused any problems.

200805151745Weekly Indexing of Pinged Weblogs

We've also moved to a mechanism of re-indexing pinged weblogs on a weekly basis. While 99% of blogs in our index send pings correctly there's the possibility of dropped ping due to misconfigured blog host. This could be do either to an error on their part or a temporary network outage.

To correct this behavior we've migrated to a weekly re-index mechanism where we send out our crawlers if we haven't heard from a blog in at least a week.

feed.getDelta supports publisher_type

This was an omission from Spinn3r 2.2 that one of our customers pointed out.

The permalink.getDelta method supported a publisher_type but the feed.getDelta method did not.

Advanced Mainstream Media Feed Detection

Mainstream media support for RSS has always been mediocre at best. Our permalink API was designed to help improve this situation by indexing all recent posts on a given website.

The problem is we would still be missing additional metadata such as the original publication date, author, and title.

It's impossible to discover these feeds because they may be buried deep within the website and many of these sites don't have RSS autodiscovery setup correctly.

Kiplinger.com is a good example. This website has a number of RSS feeds but the only way to find them is to click on an 'rss' link at the bottom of the page, which is a link to another HTML page which contains a set of RSS feeds.

Some sites are even worse. AOL News has a page which lists the RSS feeds but they don't actually link to them - they link to myAOL. They have an RSS feed link when you view the page in a browser but this is actually generated via javascript which (obviously) crawlers can't see.

The solution has been to release a focused crawler for these sites to recursively index pages and attempt to find links to RSS feeds. These RSS feeds are then indexed and used to fetch additional metadata.

We've pushed the first pass of this functionality and are going to be releasing another version of our crawler that allows us to discover even more mainstream media feeds.

Documentation Updates

There have been a number of documentation updates available over on our wiki.

Specifically, the changes around the source and permalink APIs.

More to come...

We're also going to be releasing Spinn3r 2.2.2 which will have more updates in our crawler including additional support for forums and mainstream media feeds and enhancements to our core weblog discovery algorithms.

I suspect that this will be about two weeks before all the backend infrastructure work is complete.

Thanks to Flickr users josef.stuefer, buntalshoot, and Mr Usaji for the amazing photos of the above spiders.

Slides from Spinn3r Architecture Talk at 2008 MySQL Users Conference

Here's a copy of the slides from the talk I just gave about the architecture of Spinn3r at the 2008 MySQL Users Conference:

We present the backend architecture behind Spinn3r – our scalable web and blog crawler.

Most existing work in scaling MySQL has been around high read throughput environments similar to web applications. In contrast, at Spinn3r we needed to complete thousands of write transactions per second in order to index the blogosphere at full speed.

We have achieved this through our ground up development of a fault tolerant distributed database and compute infrastructure all built on top of cheap commodity hardware.

Spinn3R Architecture Talk - 2008 Mysql Users Conference

New Spinn3r Open Ping Server

As part of Spinn3r 2.2 we've released an open ping server.

What's a ping server you ask?

In blogging, ping is an XML-RPC-based push mechanism by which a weblog notifies a server that its content has been updated. An XML-RPC signal is sent to one or more "ping servers," which can then generate a list of blogs that have new material. Many blog authoring tools automatically ping one or more servers each time the blogger creates a new post or updates an old one.

The goal here is to be somewhat independent from the other ping servers out there. This way we can avoid any downtime or problems that would occur if they vanish entirely.

We already receive pings from a number of major blog hosting providers. If you're a blog host and would like to send us your ping stream please let us know. We'd prefer that you not use the open ping server as we can audit your ping stream a bit better when it's using a custom URL.

Why would you want to send us pings? Because we crawl for a number of major search startups and analytics companies (as well as PhDs and Universities) and your users will get a solid impact from their blog post when it hits Spinn3r.

Just use the URL:

http://rpc.spinn3r.com/open/RPC2

... for your RPC router and you're set.

Also, a note to spammers - don't even bother spending spam our way. We can handle the throughput just fine. Further, unless our discovery engine has approved the blog as being ham we're just going to drop the ping and send it to /dev/null.

... however, I pretty much assume you're going to send us spam anyway. So have at it.

More on the Wordpress Blog Spam Cancer

200804081439Technorati published more information on the wordpress blog spam cancer that's spreading around the Internet.

If you're running a version of Wordpress less than 2.5 you need to stop what you're doing NOW and upgrade! Don't wait until your blog is compromised.

The blogosphere has had its share of maladies before. Comment spam, trackback spam, splogs and link trading schemes are the colds and flus that we've come to know and groan about. But lately, a cancer has afflicted the ecosystem that has led us at Technorati to take some drastic measures. Thousands of WordPress installations out in the wilds of the web are vulnerable to security compromises, they are being actively exploited and we're not going to index them until they're fixed.

We know about them at Technorati because part of what we do is count links. Compromised blogs have been coming to our attention because they have unusually high outbound links to spam destinations. The blog authors are usually unaware that they've been p0wned because the links are hidden with style attributes to obscure their visibility. Some bloggers only find out when they've been dropped by Google, this WordPress user wrote

I've reached out to Ian Kallen to offer collaboration on fixing this issue.

We're going to push out a point release of Spinn3r to block blogs that exhibit this spam problem.

It's such a rare event to have hundreds of thousands of weblogs compromised in a systematic manner.

Spinn3r 2.2 Released

200804062309Spinn3r 2.2 rolled out the door today.

We've been working on a much larger release which is still pending but wanted to release new functionality out the door for some of our more recent clients.

So what's new?

We've added the ability to register weblogs directly within Spinn3r. All that's necessary is to call a new source.register method with a link to a weblog or any URL that has an RSS feed and publishes dynamic content. Spinn3r will then do the rest. We'll fetch the HTML feed, perform RSS autodiscovery, and then add it to our source list and start crawling in real time.

What's interesting is that this allows our clients to collaborate on weblog discovery. Spinn3r does a great job at discovering weblogs but there are some niche sources where we'd love to have a few more signals to help out in our spam detection.

200804062320-1This also fixes a number of bugs including:

  • Our permalink crawler API now adds the ability to filter by API tier.
  • We've added better mainstream media site detection.
  • A new post:resource_guid field is available within Spinn3r results to identify a unique post
  • New publisher types including FORUM, CLASSIFIED, and REVIEW.

It sounds crazy but we've also started a sub-project to allow Spinn3r to also license spam content. We've had a few malware and anti-virus companies approach us looking for a solid stream of real time spam posts. Unfortunately, Spinn3r wasn't setup to provide this as 99% of our customers are only interested in ham.

This adds a new spam_probability backend variable which isn't exposed just yet. We'll allow our customers to add &spam_probability=x.x in their API call to control how much spam they want to receive.

Believe it or not, some customers would like to boost up their signal a bit and add a bit and add more spam as a tradeoff to get a bit more recall.

By default, this content will only be available to the client who registered the source. This prevents clients with niche requirements to index special feeds (search feeds being a good example) without hurting any of our other customers.

Spinn3r 2.5 is also right around the corner. It's taken us a bit longer than we had hoped to bring our new hardware online. You can read about our progress here on my personal blog.

Massive Blog Spam Epidemic Gets More Attention

200804071213We've been covering a massive blog spam epidemic thanks to a nasty/evil spammer who's exploiting a XMLRPC bug in Wordpress 2.2.

This issue is FINALLY getting the attention it deserves:

I had a closer look at many of the blogs concerned that had spammy content — pages promoting credit cards, pharmaceuticals and the like, and I realized that if you go to the root domain they are all legitimate blogs. Not scraper blogs that were being auto-generated with adsense / affiliate links, which was extremely curious, and actually reminiscient of something that hit home a few months ago.

A few months ago, this blog got hacked — but in a sneaky way. Not only did the hackers insert “invisible” code into my template, so that I was getting listed in Google for all manner of sneaky (and NSFW terms), so that people could click on those links with the hacker getting the affiliate cash — but *actually*, said hackers also inserted fake tempates into my wordpress theme.

Techaddress is also covering this issue...

Oddly enough Tailrank picks up on this spam because of our clustering algorithm. We cluster common links and terms via our blog index and promote these stories to our front page.

Since we 'trust' stories with past behavior when major A-list blogs like ZDNet get owned we believe they are legitimate links.

If we had a smaller index this might be a big easier to handle but we're indexing 12M blogs within Tailrank and on Spinn3r.

Another way around this of course would be to blacklist every blog running Wordpress 2.2 or earlier but we're talking millions of blogs here and we don't want to unfairly harm anyone.

To date our approach has been to wait until Tailrank has identified the spam, and then blacklist any blogs that have been compromised.

Unfortunately this is a war of attrition with the spammer just spending a few more days and hacking another dozen or so sites.

The only positive aspect of this is that it's encouraging people to upgrade to Wordpress 2.5.

We're also working on some secondary algorithms to catch this a bit sooner and we'll probably ship these in Spinn3r 2.5 which is due shortly.

Spinn3r At ICWSM Next Week

200803281639Spinn3r be at the International Conference on Weblogs and Social Media ICWSM this week.

The conference looks great:

The rapid creation and consumption of social media content continues to drive the evolution of the Internet and the Web. Social media content now accounts for the majority of content published daily on the web.

As the space evolves, researcher and industrial practitioners find themselves at a key point for collaborating on research, implementation and deployment of a wide range of analyses and applications. The International Conference on Weblogs and Social Media invites researchers in the broad field of social media analysis to submit papers for its second meeting.

If you're not planning on attending and are in the Seattle area you should look at the program and reconsider.

If anyone wants to meet up and grab coffee please let me know.

Features in the Next Release of Spinn3r

A new release of Spinn3r is around the corner and I wanted to publish a link to documentation of a new feature we're shipping.

We're going to be releasing a new source API which allows our users to obtain the status of a weblog or register a new weblog (or feed) as a source within Spinn3r.

By default, these sources won't be made available to our other users. If more than one of our users registers the same URL we'll then make it a publish source.

This has been a frequent request over the last few months. It should also give us another signal for our spam detection algorithms.

Spinn3r Client Driver for Perl

200803141449The guys over at Slaant were nice enough to write an Open Source driver for Spinn3r written in Perl.

They did all the work here and we're immensely grateful that they decided to release it as Open Source.

This is 100% native and uses Expat for XML parsing.

As part of this release I also wrote some notes on client design guidelines. It turns out that 80% of the problems are produced by common implementation issues. Things like using read and connect timeouts, correct DNS caching, UTF-8 encoding, etc.

WWW::Spinn3r is an iterative interface to the Spinn3r API. The Spinn3r API is implemented over REST and XML and documented throughly at `http://spinn3r.com/documentation'. This document makes many reference to the online doc and the reader is advised to study Spinn3r documentation before proceeding further. ...

This module gives your a perl hash interface to the API. You'll need just two functions from this module: `new()' and `next()'. `new()' creates a new instance of the API and `next()' returns the next item from the Spinn3r feed.

Yahoo Extends Semantic Web Support?

200803131640Looks like Yahoo is releasing more details about web standards, RDF, and microformat support in their search platform:

While there has been remarkable progress made toward understanding the semantics of web content, the benefits of a data web have not reached the mainstream consumer. Without a killer semantic web app for consumers, site owners have been reluctant to support standards like RDF, or even microformats. We believe that app can be web search.

By supporting semantic web standards, Yahoo! Search and site owners can bring a far richer and more useful search experience to consumers. For example, by marking up its profile pages with microformats, LinkedIn can allow Yahoo! Search and others to understand the semantic content and the relationships of the many components of its site. With a richer understanding of LinkedIn's structured data included in our index, we will be able to present users with more compelling and useful search results for their site. The benefit to LinkedIn is, of course, increased traffic quality and quantity from sites like Yahoo! Search that utilize its structured data.

... and of course a rising tide lifts all boats. I expect this will help out Spinn3r as this just means more structured content for us to index.

They're using the right vocabulary though:

In the coming weeks, we'll be releasing more detailed specifications that will describe our support of semantic web standards. Initially, we plan to support a number of microformats, including hCard, hCalendar, hReview, hAtom, and XFN. Yahoo! Search will work with the web community to evolve the vocabulary framework for embedding structured data. For starters, we plan to support vocabulary components from Dublin Core, Creative Commons, FOAF, GeoRSS, MediaRSS, and others based on feedback. And, we will support RDFa and eRDF markup to embed these into existing HTML pages. Finally, we are announcing support for the OpenSearch specification, with extensions for structured queries to deep web data sources.

Techcrunch has more on the subject and generally likes the direction Yahoo is taking.

New Spinn3r Reference Client Release

We just pushed a new release of our Spinn3r reference client (2.1.3.1-beta).

This is a small bug fix only release. Normally it wouldn't be a very big deal but this includes a performance optimization which can increase API throughput by about 2x in most situations.

We now fetch items 100 at a time. We couldn't do this before because of a memory issue with our HTTP implementation. If we encounter a HTTP 500 while performing a request of 100 items we temporarily fall back to 10 items.

This will probably increase the HTTP 500 errors on our servers by 1-2% but the performance advantage for our customer base is clearly advantageous.

Thirty Percent of Blogs are Spam?

Matt Mullenweg posted a February round up of Wordpress growth recently:

245,329 blogs were created. 432,478 new users joined. 1,920,593 file uploads. 2,814,893 posts and 996 thousand new pages. 4,961,330 comments. 3,813,432 logins. 540,799,534 pageviews on WordPress.com, and another 304,499,648 on self-hosted blogs. (845,299,182 pageviews total across blogs we know about.) 726,789 active blogs in February, where “active” means they got a human visitor.

A few bloggers went through and analyzed the stats to determine the percentage of live spam blogs.

If you're willing to trust Matt Mullenweg, and believe WordPress is fairly representative of blog platforms everywhere, then have we got a statistic for you: it seems that at least 30 percent of all blogs may be spam.

...

Divide the second number by the first, multiply by 100, and you get 31.7 percent. Almost one-third, if you round up, or three out of ten, if you'd prefer to round down. That's high, and that's ignoring McCarthy's "more than" and the possibility that WordPress missed some splogs.

The major question is churn and how long they've been live and in production.

Wordpress is obviously Open Source and anyone can download the code and start spamming on their own servers if they want.

The official hosted service (Wordpress.com) has done a great job at crushing blogs that make it onto the site.

Blogger seriously needs to improve their spam killing accuracy. There's much more spam that both makes it to Blogger and stays hosted there for months at a time.

The Pingosphere is another story though. We see about 90% spam coming through the ping network.

I blogged about this a few weeks ago.

Spinn3r 2.1.3 Now Available

Spinn3r 2.1.3 is live and out the door.

The biggest feature in this release is that we've bought our entire feed archive online. You can now get access to nearly 400GB of content from the last 8 months.

We also updated the response XML and included a new feed:url element for each published item. This references the URL to the source RSS feed.

This was primarily added to aid companies using a 3rd party crawler with their migration to Spinn3r.

Not every post will have a feed URL. Posts from the Six Apart update stream (specifically, LiveJournal) will not have feed:url.

As usual we've pushed a new update to the Spinn3r reference client.

This implements a new getFeedURL method which is needed to use the new result format. Existing clients shouldn't need to upgrade unless they need this specific feature.

Spinn3r 2.1.2

We just pushed Spinn3r 2.1.2. This is mostly a stabilization release without any major new features.

There are a few extensions provided which have been requested by a number of customers including:

  • Support for API tiers in the permalink.getDelta interface. This is available in the API response but net yet available as a query parameter.
  • Support for the original creation date of items found in the original RSS or Atom feed. This is now provided with atom:published. The original crawl data is also preserved and included as post:date_found. Both of these values are ISO 8601 timestamps. This is only available via the feed.getDelta method.
  • Extended support for the title and description of a weblog in API responses. This is now supported by more weblogs and included in the response of both the feed.getDelta and permalink.getDelta methods.

The Spinn3r reference client has been upgraded to 2.1.2 to support additional values including the original post publication time.

We also performed some hardware upgrades during this release to handle additional load including new web servers dedicated to serving API callers.

Blog Ping and Spam Statistics

One of the great features of Spinn3r is that it has native spam prevention which prevents a good deal of worthless content form being indexed by our customers.

What we haven't done a good job of doing (until now) is exposing the sheer volume of ping spam that hits our servers.

Every hour we see nearly 720k pings. Of these pings nearly 93% of them are from spam blogs!

Due to the sheer volume we've had to rework a number of our algorithms to function extremely efficiently.

Here's a copy of our ping traffic for the last 24 hours. Note that the dips in the graphs are due to monitoring issues and not representative of the underlying data.

200801210154

Spinn3r and Social Network Data Portability

200801101243There's been a lot of talk recently about social network data portability with Plaxo, Facebook, and Google now having employees as members of the group.

From Spinn3r's perspective, it's not just about data portability, it's about a fully open social graph.

By open, I mean no restrictions other than copyright and plenty of fair use for public data (private data is another issue altogether which quickly becomes a lot more complicated).

The blogosphere has really paved the way for this with its history of open data thanks to RSS and Atom.

MySpace should be commended for their participation in the blogosphere with their blogging system. They send pings, have RSS feeds, and don't mind that we crawl and build applications on top of their data.


There are certain hosted blogging systems (who shall remain nameless) which, while fully open, have additional restrictions for crawlers. They only allow a finite number of requests to their system. The number is so low that it's mathematically impossible to crawl all their content.


Now, it's their system, they have the right to do what they want and provide access under whatever restrictions they deem fit. However, it's the user's data - not theirs. We don't have any obligation to use their system and customers are going to flock to systems which are more open and have more compelling applications.

Don't believe me? It's not altruism - it's the free market. Users are going to flock to systems with vibrant and compelling applications.

The open content thanks to the blogosphere has brought us companies like Bloglines, Tailrank, Google Reader, Kosmix, Zvents, Powerset - I could go on.

I remember this the other day when I was reading VentureBeat's coverage of Friendfeed and the irony of the fact that Facebook Feeds aren't actually RSS feeds.

This open data is becoming more and more valuable - not just to the company writing the applications that create the open data but to the entire ecosystem. So valuable in fact that NewsGator decided to release all of their applications available for free because they can sell backend appliances that index the data and build compelling applications.

This needs to be solved not from the perspective of user portability but from that of an open content network where all players have equal access to the data.

Announcing Spinn3r 2.1

I'm very excited to announce that Spinn3r 2.1 is now available.200712311502

A number of major new features have been implemented in this release which has taken us more than three months of hard work to get out the door.

We've also finished up another stage of our backend and are planning on buying a few more toys in 2008 which should make things interesting moving forward.

Let's dive into the details.

200712311414

Full Crawler Support


Our new crawler functionality fetches the full HTML of every post we discover, extracts the body of the post, excluding sidebar and chrome and provides this content under a new API.

Our reference client implements the new API and existing clients should be able to easily port their code to support this new functionality.

This is a major new architecture for us and we plan on expanding support for this moving forward.

Content Extraction

The quality of mainstream media RSS feeds is notoriously lacking. For example, CNN has RSS feeds but they only have a one line description instead of the full content of the post.

This has always been a problem with RSS search engines such as Feedster or Google Blog Search - what's the point of using a search engine that's not indexing 80% of potential content?

We're also seeing the same thing with a number of the A-list blogs. RSS feeds turn into a liability when bandwidth increases significantly every month with each new user. The more traffic a blog gets the greater the probability that they'll enable partial RSS feeds in order to reduce their bandwidth costs and increase click through rates.

Spinn3r 2.1 adds a new feature which can extract the 'content' of a post and eliminate sidebar chrome and other navigational items.

It does this by using an internal content probability model and scanning the HTML to determine what is potentially content and what's potentially a navigation item.

For a visual example, you can see the attached screenshot of a page with content highlighted in yellow. This screenshot was generated by passing our content extraction algorithm over this article on The New York Sun.

This method isn't always 100% accurate. It has a small probability of false negatives for the first few words of a post. As such, it's meant for consumption by algorithms and not by humans.

To increase accuracy we also implement Google ad section targeting and hAtom entry-content which help us focus on the body of the post.

We also support Yahoo's robots-nocontent to help with HTML chrome/sidebar elimination.

About 15% of our content is indexed with these additional methods.

Better Mainstream Media Support

Since Spinn3r 2.0, we've been indexing mainstream news sites such as CNN, the NY Times, etc.

We've improved this support in 2.1 by tuning and performing extensive QA for these mainstream media sites. We've implemented parsers for individual sites such as the NY Times and performed manual audits on more than 600 sites to make sure our crawlers were indexing them at optimal levels.

Expanded Online Archives

In previous versions of Spinn3r we only kept about seven days of content online at any given time. Since most of our customers are only interested in the most recent content this wasn't a problem. As we grow, we're starting to have more and more requests for access to older content.

That's easier said than done though. Spinn3r is currently indexing over 500GB of data. Providing all of our customers with real time access while maintaining adequate performance is a difficult task.

Thanks to the recent architecture updates we've performed we're now able to keep all of our content online. Right now Spinn3r has about 2.5 months of data (around 200GB) online. We're going to be expanding this in future versions to keep all of our legacy content online and available to our customers.

Spinn3r Reference Client

In November, we announced the availability of our Spinn3r reference client for Java.

This has significantly reduced our initial client implementation time. One of our recent clients was able to get up and running under two hours!

Spinn3r 2.1 has an updated reference client which supports all new protocol updates for this release and additional command line options as well.

What's Next?

Spinn3r is growing. November and December were big months for us. We closed 5 new clients and are now used by companies which have raised more than $50M in VC funding.

We're also now being used in production by a number of researchers from top universities including the University of Maryland Baltimore County (my alma matter), University of Washington, and the University of Southern California.

Having so much feedback from multiple parties makes it clear where we need to move Spinn3r in the future.

Having public stats about our crawler is a common request. Spinn3r 2.2 will introduce a lot more statistics about our crawler including the ability to view individual weblogs, total bandwidth usage, total posts per hour, etc.

Comment extraction is also a popular request. We're going to extend Spinn3r to index each individual post and extract each comment as a unique entity with individual title, link, and body for each comment.

Stay tuned. We still have a few more tricks up our sleeves.

Storing the Full Internet

The other day I blogged about Blekko and what it would take to in terms of hardware index the full Internet.

High Scalability responded with some interesting thoughts.

Kevin Burton calculates that Blekko, one of the barbarian hoard storming Google's search fortress, would need to spend $5 million just to buy enough weapons, er storage.

Kevin estimates storing a deep crawl of the internet would take about 5 petabytes. At a projected $1 million per petabyte that's a paltry $5 million. Less than expected. Imagine in days of old an ambitious noble itching to raise an army to conquer a land and become its new prince. For a fine land, and the search market is one of the richest, that would be a smart investment for a VC to make.

The comments are interesting.

Borislav Agapiev made some interesting comments.

"(far) less than 5 PB - a world class index would be 20B pages times 10KB per page = 200TB. This is for page storage, there would be more for storing the index i.e. posting lists. It would depend on size of individual postings and lengths of posting lists but few PB would cover it."

It's more than just the raw page HTML.

You need metadata, previous archived pages, diffs, ngram distribution for text clustering algorithms and duplicate detection and

Also, avg page size in Google's full 'net crawl is 15k. We see similar numbers in Spinn3r.

The bottom line is the storage required is very cheap. BTW, $1M/PB = $1/GB seems too high, nowadays cheap SATA 500GB disks can be had for $100.

Yes.... This is a standard SATA disk or $.2 / GB. However, this is JUST disk not CPU, a redundant copy of that disk, and additional disks for high IO transfer rates.

You need a LOT more spindles to be able to access this 5P. Storing the data is one thing. Making sure it's highly available, fault tolerant, and high performance is a totally separate issue and ends up seriously increasing your costs.

For instance, one can crawl with a good crawler 1M pages/day on 1Mbps bandwidth i.e. 1B pages/day with 1GBps. So with 20Gbps one can crawl entire Internet daily. 20Gbps of crawling bandwidth goes for $100K/mo in the Valley, you can saturate it with , say, thousand cheap crawlers ($1-$1.5K each). I would think that Google spends way more in their cafeteria than that :)

This doesn't include the political aspects. You can't build a full web crawl in one day. You'd be blocked instantly. Crawlers have to be polite.

We spend a TON of time working with blog hosting companies to make sure our crawlers yield to their policies. For example, LiveJournal won't allow us to crawl with more than 10 threads.

Borislav rephrased this by saying that "crawling does not scale" since you don't HAVE to scale it as much as you'd think because you can only build the crawl at a slow pace.

There are other factors here though. Building a web graph, recomputing your ranking algorithm so you can re-prioritize your crawl, serving the content to clients, finding duplicates, all this stuff requires more resources than you initially think.

Of course maybe these aren't actual problems. Building this stuff is fun!

Thoughts on Efficient Crawling through URL Ordering

I'm re-reading "Efficient Crawling through URL Ordering" and a few other papers I've read a few years ago.

Now that I have Skim I can take notes in the PDF directly which is turning out to be amazingly productive.

It dawned on me that I should also blog these notes as well.

First, some background:

A crawler is a program that retrieves Web pages, commonly for use by a search engine [Pinkerton 1994] or a Web cache. Roughly, a crawler starts off with the URL for an initial page P0. It retrieves P0, extracts any URLs in it, and adds them to a queue of URLs to be scanned. Then the crawler gets URLs from the queue (in some order), and repeats the process. Every page that is scanned is given to a client that saves the pages, creates an index for the pages, or summarizes or analyzes the content of the pages.

The authors discuss a number of priority metrics including query driven crawling, pagerank and backlink based crawling.

This paper is a bit dated with the authors noting that the web is about 1.5T in size. The web has grown a bit since then.

Crawl & Stop. Under this model, the crawler C starts at its initial page P0 and stops after visiting K pages. At this point a perfect crawler would have visited pages R1, ..., RK, where R1 is the page with the highest importance value, R2 is the next highest, and so on. We call pages R1 through RK the hot pages. The K pages visited by our real crawler will contain only M pages with rank higher than or equal to I(RK). We define the performance of the crawler C to be PCS(C) = (M•100)/K. The performance of the ideal crawler is of course 100%. A crawler that somehow manages to visit pages entirely at random, and may revisit pages, would have a performance of (K•100)/T, where T is the total number of pages in the Web. (Each page visited is a hot page with probability K/T. Thus, the expected number of desired pages when the crawler stops is K2/T.)

This a useful model for estimating the accuracy of a crawler. Our discovery engine approaches 100% of the connected graph. We then promote these URLs into Spinn3r which are then indexed by our crawler.

Spinn3r by definition is designed to approach 100% accuracy of the crawl with 100% realtime indexing. When a blog is posted we have to index it within 5 minutes for our clients.

For larger crawls, estimations of the efficiency become more important.

...

Query driven crawling also offers additional benefits. Back in the day, URLs weren't generated from mainstream content management systems so it wasn't really possible to extract metadata from them directly:

As we will see, for similarity, we may be able to use the text that anchors the URL u as a predictor of the text that P might contain. Thus, one possible ordering metric O(u) is IS(A, Q), where A is the anchor text of the URL u, and Q is the driving query.

Now, URLs have additional metadata that we can extract. For example, from:

http://feedblog.org/2007/12/25/what-is-wrong-with-icerocket/

We know that the article as published on 12-25-2007. We also know that the title might be "what is wrong with icerocket".

If we were performing a targeted crawl for articles created in 2007 about icerocket this would be an additional way to hint your queue to prioritize this URL for indexing.

...

Their trust metric based crawling (pagerank) is obviously superior to anyone who's researched trust metrics. Their observation that backlink counting metrics behave like a depth-first crawl is accurate and the exact same behavior I've seen with our crawlers.

Yahoo Extends Meta Crawl Tags via HTTP Headers

Yahoo has extended support for crawler control to HTTP headers (at least in Slurp):

Today we're announcing support for tags that give webmasters even more flexibility over which pages and documents are crawled and indexed by Yahoo! Search. Specifically, we're extending our support of page level exclusion tags -- NOINDEX, NOARCHIVE, NOSNIPPET, NOFOLLOW -- to provide additional control for archiving and summarization of ANY file type. Previously, these page level tags could only be expressed within html pages through the META directive (for e.g. <META NAME="Slurp" CONTENT="NOARCHIVE">), but based on feedback from our webmasters, Yahoo! now enables these tags to be expressed through X-Robots-Tag directive in the http header, giving webmasters the flexibility to achieve exclusions on PDF, Word documents, PowerPoint, video, and other file types,

I think we'll go ahead and implement this in Spinn3r.

This seems like a reasonable extension. I wish they would have implemented their Robots-Nocontent extension enabled for easier parsing.

The way it's currently written you have to use a context free parser which are difficult to write.

Spinn3r Talk Accepted at 2008 MySQL Users Conference

200712201803Our talk on the Spinn3r web crawler architecture entitled "Scaling MySQL and Java in High Write Throughput Environments" has been accepted at the 2008 MySQL Conference.

This is really exciting because we're also hoping to Open Source more components before April.

We present the backend architecture behind Spinn3r - our scalable web and blog crawler.

Most existing work in scaling MySQL has been around high read throughput environments similar to web applications. In contrast, at Spinn3r we needed to complete thousands of write transactions per second in order to index the blogosphere at full speed.

We have achieved this through our ground up development of a fault tolerant distributed database and compute infstructure all built on top of cheap commodity hardware.

We've built out a number of technologies on top of MySQL that help enable us to easily scale operations.

We've implemented an Open Source load balancing JDBC driver named lbpool. Lbpool allows us to loosely couple our MySQL slaves which allow us to gracefully handle system failures. It also supports load balancing, reprovisioning, slave lag, and other advanced features not available in the stock MySQL JDBC driver.

We've also built out a sharded database similar to infrastructure built at other companies such as Google (Adwords) and Yahoo (Flickr). Our sharded DB has a number of interesting properties including ultra high throughput requirements (we process 52TB per month), distributed sequence generation, and distributed query execution.

High Scalability Interview about Spinn3r and Tailrank Cluster Architecture

High Scalability just published an interview about the cluster architecture of Spinn3r and Tailrank.

We lift the kimono a bit and talk about plans for the future, our current hardware architecture, and our Open Source plans:

Ever feel like the blogosphere is 500 million channels with nothing on? Tailrank finds the internet's hottest channels by indexing over 24M weblogs and feeds per hour. That's 52TB of raw blog content (no, not sewage) a month and requires continuously processing 160Mbits of IO. How do they do that?

This is an email interview with Kevin Burton, founder and CEO of Tailrank.com. Kevin was kind enough to take the time to explain how they scale to index the entire blogosphere.

Spinn3r Reference Client Moved to Google Code

We've moved the Spinn3r reference client into Google Code which should allow for a lot more collaboration with the Open Source community.

We've already received some great feedback from our client base about features to implement and misc small bug fixes.

I think that moving forward we're going to be using Google Code for all the Open Source projects that we sponsor.

Beta Announcement of Spinn3r Client Libraries

200710221300One of the things we've started to notice in the last few weeks is that as the Spinn3r API becomes more sophisticated it's becoming difficult for clients to implement.

Moving forward, we're going to release and support client drivers for multiple languages including Java, Python, Perl, and Ruby.

Today we're announcing a Java reference implementation (and Javadoc) of the Spinn3r API.

All of our drivers will be released under the Apache 2.0 license. The APL is a very liberal license and basically allows customers and researchers using the Spinn3r API to build whatever type of application they want on top of our platform without having to worry about legal and licensing implications.

Another interesting property of this implementation is that it's very small - 1500 lines of code. This means ports to other languages should be very easy.

This API will be included in Spinn3r 2.1 as a final release as there are only a few small features left to implement.

Update: Niall suggests hosting this on Google Code so there can be a public SVN repository. This makes a lot of sense. We started hosting code.tailrank.com before Google Code as released. This would be one less thing for Tailrank to admin.

Thanks also go out to adactio on Flickr for providing a cool photo of a latte under the Creative Commons attribution license.

Spinn3r Indexing 52T Per Month

I looked at our bandwidth numbers and Spinn3r has indexed 52T of raw content per month.

That's 52 TERABYTES people. Nearly 160Mbits continuous IO processed 24/7.

A good portion of this is redundant RSS and polled HTML.

I'd really love to have the web upgraded to support Delta encoding. This would save a ton of money in bandwidth costs.

Announcing Spinn3r 2.0

200710031513-2After nearly a year in development, I'm pleased to announce the release of Spinn3r 2.0.

We've also been heads down working on Tailrank as well and are announcing Tailrank 2.5 today as well.

All of this has been possible due to the sheer amount of work we've invested into our software and hardware infrastructure. We're pretty ambitious and now that we've completed the majority of our infrastructure work we can ship more applications at a faster rate.

This release includes a number of new features which have been requested by our customers over the last year including:

Now indexing to 12M weblogs

When we first released Spinn3r 1.0 we wanted to focus on the core portion of the blogosphere which received the most attention. While this is still a major focus for us, increasing our index to cover more niche topics and user generated content has been a big request from our customer base.

We've built out a complex discovery engine to find new weblogs which takes in a number of factors including publication rate, link analysis, content analysis and other factors to determine if a blog should be added to our index.

While other services claim to support a lot more than 12M blogs we try to focus only on the portion of the blogosphere that publishes at least once per month.

We also spend a good deal of time fighting spam which is also beneficial to our user base.

Mainstream News Crawler

We've added a new mainstream media crawler which indexes the top 10k high ranking news sites such as CNN, the New York Times, and the Wall Street Journal. We expose the full HTML of each mainstream media article within our API, including the article title and additional metadata.

This will also be the basis of new crawling technology for or main index which will help us backfill partial content feeds, extract comments from blog posts, and numerous other features.

Tags

We now index tags in both HTML rel-tag form and within RSS and Atom categories. Not only are tags included in API output but you can also query for a specific tag with the API.

Language Classification

We've implemented an in-house language classification library (based on current research) for analyzing content and determining the underlying language based on the raw text. This allows us to discover the language for a given page or weblog simply by the author's own words.

This becomes very valuable when the language of the blog isn't specified or incorrect due to a configuration error.

Weblog Influence Ranking

Not all weblogs should be considered equal. Certain weblogs have a lot more influence than others.

Taking this into consideration we've created a new ranking algorithm named "influence rank" which represents how successful a blog is at breaking a meme which eventually gets published on Tailrank.

Top Weblog Filtering

Only need the top 1k weblogs? Not a problem! We now expose our influence ranking (and other factors) to the user and filter results accordingly.

This has been beneficial for startups that are working on beta applications and don't yet have the compute capacity to process the blogosphere.

Author metadata

Author data included within the API when available from the source RSS or Atom feed.

This has been requested from numerous customers and is now a default feature in the API.

Ease of Use

We've made the API even easier to implement than before. Implementing the Spinn3r API should be as simple as downloading a URL. In fact, one of our recent customers had one of their interns implement it in their spare time! Not bad!

Free for Research Purposes

Working towards your PhD? Need a copy of the blogosphere for academic use? Not a problem! We're now making Spinn3r 100% free of charge for researchers.

Spinn3r 2.1

There were a few key features that couldn't make it into Spinn3r 2.0 that we're really excited about. We're planning on releasing Spinn3r 2.1 in the next couple weeks which will include some cool functionality (so stay tuned).

Google News As a Walled Garden?

Scoble seems to think Google News is a Walled Garden because they're blocking robots with an aggressive robots.txt entry.

On my personal blog I posted about this the other day. Seems like it's getting some more attention.

Post-mortem of an Advanced Spam Attack

Tailrank (and Spinn3r) suffered a spam attack over the weekend of July 21st. This attack was very advanced both in terms of the scope and technical nature. It consisted of a large number of doorway pages that redirected to another site which tried to install malware on the victims computers.

Source of Inbound Links to Doorway Pages

We're still analyzing the source of inbound links to the doorway pages in this attack. The attacker used vulnerable blog sites to link to .edu domains which hosted the actual content.

This analysis is made more difficult due to that fact that the source of the links often clean up the offending data before we can perform analysis.

Hosting Source for Pages

All of the content was hosted on compromised websites. Theses pages seemed to have resulted from more than one security hole. For instance the content hosted on:

http://cadc.auburn.edu/it/generic-levitra.html

... was probably due to a file upload hole where the attacker found a way to upload files.

On the other hand URLs such as:

http://smallschools.ischool.washington.edu:8000/d_www/generic-levitra.html http://webtango.ischool.washington.edu:8002/x_www/lesbian-incest.html http://webtango.ischool.washington.edu:8002/a_www/japanese-girls.html

Seem to have been hosted on HTTP daemons that the attackers were able to install on those hosts.

Content of Doorway Pages

The doorway pages on the hacked servers contain content which was designed to rank well on search engines. The pages were optimized for search by specifically targeting .eds domains with high pagerank.

The attacker is using machine generated text that is waited heavily waited towards a certain topic. For instance the URL:

http://cadc.auburn.edu/it/japanese-girls.html

Contains the words "japanese girls" in almost every seantance. The general strategy with this content is to rank high in searches for the target terms on search engines.

Once they have managed to attract a user to the page it then contains javascript code to redirect the user to the payload page. This redirect is heavily obfuscated such that it is imposable to know that the page contains a redirect unless you execute the javascript.

Example redirect javascript:

if(tqfojokx969 = 'bi653')
  eval(irvpoqb515+ni271+msippqydp980+bi653+bpz978+if660+hvwnvgrhgw311+yf177+onby422+hvwnvgrhgw311);

Each of the vars being evaluated contains only a small portion of the redirect code:


var irvpoqb515='docu';
var ni271 ='ment';
var msippqydp980='.lo';
var bpz978='ti';

There are 30 more variables here in random order here which each form the evaluated string.

The effect of this is that one can not write a primitive scanner that can tell that there is a redirect in the code making doorway detection difficult.

This is advantageous to the attacker as it means we can not easily red flag pages which contain a redirect. They also have hidden the target of the redirect so even if we knew that the payload site is malicious we can not see that in the source of the document without executing the javascript.

Content of the Attack Page

The attack page contains a DHTML application that pretends to scan the victims computer for malware and then offers a windows .exe that will supposedly cleans the computer of the malware that it "found."

In reality, the .exe is almost certainly itself malware. The Ajax was very well executed and looked identical to a Windows dialog box:

Picture_17

The code was also written to customize itself depending on what version of windows was running and if the browser was Internet Explorer:


is_XP_SP2 = (navigator.userAgent.indexOf("SV1") != -1) 
            || (navigator.appMinorVersion && 
            (navigator.appMinorVersion.indexOf('SP2') != -1));

is_IE=false;

if (navigator.appName.toLowerCase()=='microsoft internet explorer') {
if (navigator.userAgent.toLowerCase().indexOf('opera') <= 0) {
is_IE=true;
}
}

Analytics

The attacker tracked the referrer to detect which SEO spam campaigns and keywords were successful on specific doorway pages.

They can then use this data to determine which campaigns were most successful and focus their efforts on improving conversion.

Effect on Tailrank and Spinn3r

The majority of the attack effected Tailrank and not Spinn3r. There were only about a dozen blogs which linked through to the doorway pages involved in this attack. These blogs have either been blocked or have had their authors contacted and the spam removed.

Tailrank ended up promoting about 30 stories which were removed a few hours later.

Summary

All of these steps adds up to make this a very advanced attack. The attack probably took one-two man months of work to achieve.

Now it is likely that the work was preformed where the cost of labor is much lower but it is worth looking at the cost in these terms to understand just how motivated the attackers are.

This is a arms race that takes up a large portion of our time at Tailrank. Where this attack did manage to get content in to our index for a short time it is worth noting that amount that got though is small in comparison the the amount we block every hour.

Tailrank on LunchMeet

200707090019Jonathan and I had the opportunity to hang out with Eddie last week for LunchMeet:
The blogosphere is a noisy place filled with many interconnected conversations on all sorts of disperate topics. Tailrank, which calls itself a memetracker, is a service that tracks the zeitgeist of conversations in the blogosphere. I sat down with Kevin Burton, Tailrank's CEO and founder, and Jonathan Moore, brilliant engineer and hacker, at their home office to learn a bit about their services. Tailrank is built upon Spinn3r, another service that Burton and Moore built, that spiders and indexes the blogosphere.

Reporting Crawl Stats to Google Analytics

Google Analytics doesn't track bots. Why?

There are of course technical problems here. Google Analytics only tracks clients which can execute javascript.

I can't help but thinking that we should add a feature to Spinn3r (and hopefully Google's crawler would follow) where we could report these stats to javascript reporting tools.

A REST API would be simple to implement. When a crawler indexes a page that has Google Analytics it could make a callback to an API on their end which would track the hit.

Of course this only work with bots that were polite. Web bots that were rude wouldn't actually register.