.

This is Spinn3r's offficial weblog where we discuss new product direction, feature releases, and all our cool news.

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking.

If you'd like to read more about Spinn3r you could read our Founder's blog or check out Tailrank - our memetracker.

Spinn3r is proudly hosted by ServerBeach.

Archives

September 2009
July 2009
June 2009
May 2009
April 2009
February 2009
January 2009
December 2008
October 2008
September 2008

Spinn3r 2.2.1 Released

200805151505-1Spinn3r 2.2.1 is out the door.

This is evolution on over Spinn3r 2.2 which has a number of features and fixes suggested by our user base.

New API Methods:

As a result of our recent infrastructure changes, we're now able to provide a more robust feature set to our customers.

Ninety percent of our users are served by our raw crawler API but occasionally there are questions regarding support for a specific weblog, access to archive posts, etc.

These new methods should help improve this situation by making it easier to interact with Spinn3r.

At the moment this functionality is only supported with our permalink interface. We're working on back porting this functionality to our feed API as well.

source.list

Our new source.list API is designed for customers with existing crawlers that want to tie into our spam prevention and ping infrastructure.

The source.list API was designed to help 3rd party crawlers tie into Spinn3r's ping stream and realtime polling and prioritization backend.

Returns an RSS feed with lists of weblogs that have either been found or discovered by Spinn3r or published after a given timestamp.

You can see the source.list documentation for further information.

200805151742permalink.history

A number of Spinn3r customers have requested the ability to fetch historical content for specific blogs. This is now possible with our new permalink.history method.

Given a weblog URL, return recently published articles. This can be used to find the most recent results from techcrunch.com, gigaom.com, etc.

Results in recent posts sorted by reverse chronological order.

This is made possible due to the backend database improvements we've been steadily working on over the last year. We're going to port these changes to the feed API shortly. We're waiting to bring more hardware online for this which should take 2-3 weeks.

permalink.status

This provides the ability to obtain the status for a specific post (permalink) within Spinn3r.

General Crawler Improvements

This release also includes the following crawler improvements:

Faster Polling Interval

We've migrated to 45 minute (vs 60 minute) polling intervals for all cyclical feeds and sources. Everything else in Spinn3r is updated in real time when we receive a ping.

We're going to be reducing this to a 30 minute polling interval in the next week or so. We're going to pause at 45 minutes to see if any sites complain and make sure there aren't any performance issues which we have to deal with.

This should be fine as Bloglines has been using 30 minute polling intervals for a few years now and it hasn't caused any problems.

200805151745Weekly Indexing of Pinged Weblogs

We've also moved to a mechanism of re-indexing pinged weblogs on a weekly basis. While 99% of blogs in our index send pings correctly there's the possibility of dropped ping due to misconfigured blog host. This could be do either to an error on their part or a temporary network outage.

To correct this behavior we've migrated to a weekly re-index mechanism where we send out our crawlers if we haven't heard from a blog in at least a week.

feed.getDelta supports publisher_type

This was an omission from Spinn3r 2.2 that one of our customers pointed out.

The permalink.getDelta method supported a publisher_type but the feed.getDelta method did not.

Advanced Mainstream Media Feed Detection

Mainstream media support for RSS has always been mediocre at best. Our permalink API was designed to help improve this situation by indexing all recent posts on a given website.

The problem is we would still be missing additional metadata such as the original publication date, author, and title.

It's impossible to discover these feeds because they may be buried deep within the website and many of these sites don't have RSS autodiscovery setup correctly.

Kiplinger.com is a good example. This website has a number of RSS feeds but the only way to find them is to click on an 'rss' link at the bottom of the page, which is a link to another HTML page which contains a set of RSS feeds.

Some sites are even worse. AOL News has a page which lists the RSS feeds but they don't actually link to them - they link to myAOL. They have an RSS feed link when you view the page in a browser but this is actually generated via javascript which (obviously) crawlers can't see.

The solution has been to release a focused crawler for these sites to recursively index pages and attempt to find links to RSS feeds. These RSS feeds are then indexed and used to fetch additional metadata.

We've pushed the first pass of this functionality and are going to be releasing another version of our crawler that allows us to discover even more mainstream media feeds.

Documentation Updates

There have been a number of documentation updates available over on our wiki.

Specifically, the changes around the source and permalink APIs.

More to come...

We're also going to be releasing Spinn3r 2.2.2 which will have more updates in our crawler including additional support for forums and mainstream media feeds and enhancements to our core weblog discovery algorithms.

I suspect that this will be about two weeks before all the backend infrastructure work is complete.

Thanks to Flickr users josef.stuefer, buntalshoot, and Mr Usaji for the amazing photos of the above spiders.

Comments

The comments to this entry are closed.