Spinn3r 2.2.1 Released
Spinn3r 2.2.1 is out the door.
This is evolution on over Spinn3r 2.2 which has a number of features and fixes suggested by our user base.
New API Methods:
As a result of our recent infrastructure changes, we're now able to provide a more robust feature set to our customers.
Ninety percent of our users are served by our raw crawler API but occasionally there are questions regarding support for a specific weblog, access to archive posts, etc.
These new methods should help improve this situation by making it easier to interact with Spinn3r.
At the moment this functionality is only supported with our permalink interface. We're working on back porting this functionality to our feed API as well.
source.list
Our new source.list API is designed for customers with existing crawlers that want to tie into our spam prevention and ping infrastructure.
The source.list API was designed to help 3rd party crawlers tie into Spinn3r's ping stream and realtime polling and prioritization backend.Returns an RSS feed with lists of weblogs that have either been found or discovered by Spinn3r or published after a given timestamp.
You can see the source.list documentation for further information.
A number of Spinn3r customers have requested the ability to fetch historical content for specific blogs. This is now possible with our new permalink.history method.
Given a weblog URL, return recently published articles. This can be used to find the most recent results from techcrunch.com, gigaom.com, etc.Results in recent posts sorted by reverse chronological order.
This is made possible due to the backend database improvements we've been steadily working on over the last year. We're going to port these changes to the feed API shortly. We're waiting to bring more hardware online for this which should take 2-3 weeks.
permalink.status
This provides the ability to obtain the status for a specific post (permalink) within Spinn3r.
General Crawler Improvements
This release also includes the following crawler improvements:
Faster Polling Interval
We've migrated to 45 minute (vs 60 minute) polling intervals for all cyclical feeds and sources. Everything else in Spinn3r is updated in real time when we receive a ping.
We're going to be reducing this to a 30 minute polling interval in the next week or so. We're going to pause at 45 minutes to see if any sites complain and make sure there aren't any performance issues which we have to deal with.
This should be fine as Bloglines has been using 30 minute polling intervals for a few years now and it hasn't caused any problems.
Weekly Indexing of Pinged Weblogs
We've also moved to a mechanism of re-indexing pinged weblogs on a weekly basis. While 99% of blogs in our index send pings correctly there's the possibility of dropped ping due to misconfigured blog host. This could be do either to an error on their part or a temporary network outage.
To correct this behavior we've migrated to a weekly re-index mechanism where we send out our crawlers if we haven't heard from a blog in at least a week.
feed.getDelta supports publisher_type
This was an omission from Spinn3r 2.2 that one of our customers pointed out.
The permalink.getDelta method supported a publisher_type but the feed.getDelta method did not.
Advanced Mainstream Media Feed Detection
Mainstream media support for RSS has always been mediocre at best. Our permalink API was designed to help improve this situation by indexing all recent posts on a given website.
The problem is we would still be missing additional metadata such as the original publication date, author, and title.
It's impossible to discover these feeds because they may be buried deep within the website and many of these sites don't have RSS autodiscovery setup correctly.
Kiplinger.com is a good example. This website has a number of RSS feeds but the only way to find them is to click on an 'rss' link at the bottom of the page, which is a link to another HTML page which contains a set of RSS feeds.
Some sites are even worse. AOL News has a page which lists the RSS feeds but they don't actually link to them - they link to myAOL. They have an RSS feed link when you view the page in a browser but this is actually generated via javascript which (obviously) crawlers can't see.
The solution has been to release a focused crawler for these sites to recursively index pages and attempt to find links to RSS feeds. These RSS feeds are then indexed and used to fetch additional metadata.
We've pushed the first pass of this functionality and are going to be releasing another version of our crawler that allows us to discover even more mainstream media feeds.
Documentation Updates
There have been a number of documentation updates available over on our wiki.
Specifically, the changes around the source and permalink APIs.
More to come...
We're also going to be releasing Spinn3r 2.2.2 which will have more updates in our crawler including additional support for forums and mainstream media feeds and enhancements to our core weblog discovery algorithms.
I suspect that this will be about two weeks before all the backend infrastructure work is complete.
Thanks to Flickr users josef.stuefer, buntalshoot, and Mr Usaji for the amazing photos of the above spiders.

Comments