Announcing Spinn3r 2.1
I'm very excited to announce that Spinn3r 2.1 is now available.
A number of major new features have been implemented in this release which has taken us more than three months of hard work to get out the door.
We've also finished up another stage of our backend and are planning on buying a few more toys in 2008 which should make things interesting moving forward.
Let's dive into the details.
Full Crawler Support
Our new crawler functionality fetches the full HTML of every post we discover, extracts the body of the post, excluding sidebar and chrome and provides this content under a new API.
Our reference client implements the new API and existing clients should be able to easily port their code to support this new functionality.
This is a major new architecture for us and we plan on expanding support for this moving forward.
The quality of mainstream media RSS feeds is notoriously lacking. For example, CNN has RSS feeds but they only have a one line description instead of the full content of the post.
We're also seeing the same thing with a number of the A-list blogs. RSS feeds turn into a liability when bandwidth increases significantly every month with each new user. The more traffic a blog gets the greater the probability that they'll enable partial RSS feeds in order to reduce their bandwidth costs and increase click through rates.
Spinn3r 2.1 adds a new feature which can extract the 'content' of a post and eliminate sidebar chrome and other navigational items.
It does this by using an internal content probability model and scanning the HTML to determine what is potentially content and what's potentially a navigation item.
For a visual example, you can see the attached screenshot of a page with content highlighted in yellow. This screenshot was generated by passing our content extraction algorithm over this article on The New York Sun.
This method isn't always 100% accurate. It has a small probability of false negatives for the first few words of a post. As such, it's meant for consumption by algorithms and not by humans.
We also support Yahoo's robots-nocontent to help with HTML chrome/sidebar elimination.
About 15% of our content is indexed with these additional methods.
Better Mainstream Media Support
Since Spinn3r 2.0, we've been indexing mainstream news sites such as CNN, the NY Times, etc.
We've improved this support in 2.1 by tuning and performing extensive QA for these mainstream media sites. We've implemented parsers for individual sites such as the NY Times and performed manual audits on more than 600 sites to make sure our crawlers were indexing them at optimal levels.
Expanded Online Archives
In previous versions of Spinn3r we only kept about seven days of content online at any given time. Since most of our customers are only interested in the most recent content this wasn't a problem. As we grow, we're starting to have more and more requests for access to older content.
That's easier said than done though. Spinn3r is currently indexing over 500GB of data. Providing all of our customers with real time access while maintaining adequate performance is a difficult task.
Thanks to the recent architecture updates we've performed we're now able to keep all of our content online. Right now Spinn3r has about 2.5 months of data (around 200GB) online. We're going to be expanding this in future versions to keep all of our legacy content online and available to our customers.
Spinn3r Reference Client
This has significantly reduced our initial client implementation time. One of our recent clients was able to get up and running under two hours!
Spinn3r 2.1 has an updated reference client which supports all new protocol updates for this release and additional command line options as well.
Spinn3r is growing. November and December were big months for us. We closed 5 new clients and are now used by companies which have raised more than $50M in VC funding.
We're also now being used in production by a number of researchers from top universities including the University of Maryland Baltimore County (my alma matter), University of Washington, and the University of Southern California.
Having so much feedback from multiple parties makes it clear where we need to move Spinn3r in the future.
Having public stats about our crawler is a common request. Spinn3r 2.2 will introduce a lot more statistics about our crawler including the ability to view individual weblogs, total bandwidth usage, total posts per hour, etc.
Comment extraction is also a popular request. We're going to extend Spinn3r to index each individual post and extract each comment as a unique entity with individual title, link, and body for each comment.
Stay tuned. We still have a few more tricks up our sleeves.