.

This is Spinn3r's offficial weblog where we discuss new product direction, feature releases, and all our cool news.

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking.

If you'd like to read more about Spinn3r you could read our Founder's blog or check out Tailrank - our memetracker.

Spinn3r is proudly hosted by ServerBeach.

Archives

September 2009
July 2009
June 2009
May 2009
April 2009
February 2009
January 2009
December 2008
October 2008
September 2008

Announcing Spinn3r 2.1

I'm very excited to announce that Spinn3r 2.1 is now available.200712311502

A number of major new features have been implemented in this release which has taken us more than three months of hard work to get out the door.

We've also finished up another stage of our backend and are planning on buying a few more toys in 2008 which should make things interesting moving forward.

Let's dive into the details.

200712311414

Full Crawler Support


Our new crawler functionality fetches the full HTML of every post we discover, extracts the body of the post, excluding sidebar and chrome and provides this content under a new API.

Our reference client implements the new API and existing clients should be able to easily port their code to support this new functionality.

This is a major new architecture for us and we plan on expanding support for this moving forward.

Content Extraction

The quality of mainstream media RSS feeds is notoriously lacking. For example, CNN has RSS feeds but they only have a one line description instead of the full content of the post.

This has always been a problem with RSS search engines such as Feedster or Google Blog Search - what's the point of using a search engine that's not indexing 80% of potential content?

We're also seeing the same thing with a number of the A-list blogs. RSS feeds turn into a liability when bandwidth increases significantly every month with each new user. The more traffic a blog gets the greater the probability that they'll enable partial RSS feeds in order to reduce their bandwidth costs and increase click through rates.

Spinn3r 2.1 adds a new feature which can extract the 'content' of a post and eliminate sidebar chrome and other navigational items.

It does this by using an internal content probability model and scanning the HTML to determine what is potentially content and what's potentially a navigation item.

For a visual example, you can see the attached screenshot of a page with content highlighted in yellow. This screenshot was generated by passing our content extraction algorithm over this article on The New York Sun.

This method isn't always 100% accurate. It has a small probability of false negatives for the first few words of a post. As such, it's meant for consumption by algorithms and not by humans.

To increase accuracy we also implement Google ad section targeting and hAtom entry-content which help us focus on the body of the post.

We also support Yahoo's robots-nocontent to help with HTML chrome/sidebar elimination.

About 15% of our content is indexed with these additional methods.

Better Mainstream Media Support

Since Spinn3r 2.0, we've been indexing mainstream news sites such as CNN, the NY Times, etc.

We've improved this support in 2.1 by tuning and performing extensive QA for these mainstream media sites. We've implemented parsers for individual sites such as the NY Times and performed manual audits on more than 600 sites to make sure our crawlers were indexing them at optimal levels.

Expanded Online Archives

In previous versions of Spinn3r we only kept about seven days of content online at any given time. Since most of our customers are only interested in the most recent content this wasn't a problem. As we grow, we're starting to have more and more requests for access to older content.

That's easier said than done though. Spinn3r is currently indexing over 500GB of data. Providing all of our customers with real time access while maintaining adequate performance is a difficult task.

Thanks to the recent architecture updates we've performed we're now able to keep all of our content online. Right now Spinn3r has about 2.5 months of data (around 200GB) online. We're going to be expanding this in future versions to keep all of our legacy content online and available to our customers.

Spinn3r Reference Client

In November, we announced the availability of our Spinn3r reference client for Java.

This has significantly reduced our initial client implementation time. One of our recent clients was able to get up and running under two hours!

Spinn3r 2.1 has an updated reference client which supports all new protocol updates for this release and additional command line options as well.

What's Next?

Spinn3r is growing. November and December were big months for us. We closed 5 new clients and are now used by companies which have raised more than $50M in VC funding.

We're also now being used in production by a number of researchers from top universities including the University of Maryland Baltimore County (my alma matter), University of Washington, and the University of Southern California.

Having so much feedback from multiple parties makes it clear where we need to move Spinn3r in the future.

Having public stats about our crawler is a common request. Spinn3r 2.2 will introduce a lot more statistics about our crawler including the ability to view individual weblogs, total bandwidth usage, total posts per hour, etc.

Comment extraction is also a popular request. We're going to extend Spinn3r to index each individual post and extract each comment as a unique entity with individual title, link, and body for each comment.

Stay tuned. We still have a few more tricks up our sleeves.

Comments

Joe W.

Any details on how the content extraction is done?

The comments to this entry are closed.