.

This is Spinn3r's offficial weblog where we discuss new product direction, feature releases, and all our cool news.

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking.

If you'd like to read more about Spinn3r you could read our Founder's blog or check out Tailrank - our memetracker.

Spinn3r is proudly hosted by ServerBeach.

Archives

September 2009
July 2009
June 2009
May 2009
April 2009
February 2009
January 2009
December 2008
October 2008
September 2008

Feed Update Protocols and SUP

It looks like Friend Feed is proposing a new update protocol for RSS which avoids the thundering herd problem present with RSS polling.

When you add a web site like Flickr or Google Reader to FriendFeed, FriendFeed's servers constantly download your feed from the service to get your updates as quickly as possible. FriendFeed's user base has grown quite a bit since launch, and our servers now download millions of feeds from over 43 services every hour.

Venture Beat has more on the subject: (as does Tech Confidential)

It looks like the rapid fire site updates are about to start again for the social content conversation site FriendFeed. Just a few days after the launch of its new “beta” area, FriendFeed is finalizing a new technology that could help pull content into the site at a much faster rate.

The technology, called Simple Update Protocol (SUP) will process updates from the various services that FriendFeed imports faster than it currently does using traditional Really Simple Syndication (RSS) feeds, FriendFeed co-founder Paul Buchheit told Tech Confidential.

Spinn3r has a similar problem of course but we have 17.5M sources to consider.

The requirements are straight forward:

* Simple to implement. Most sites can add support with only few lines of code if their database already stores timestamps. * Works over HTTP, so it's very easy to publish and consume. * Cacheable. A SUP feed can be generated by a cron job and served from a static text file or from memcached. * Compact. Updates can be about 21 bytes each. (8 bytes with gzip encoding) * Does not expose usernames or secret feed urls (such as Google Reader Shared Items feeds)
Sites wishing to produce a SUP feed must do two things:

* Add a special tag to their SUP enabled Atom or RSS feeds. This tag includes the feed's SUP-ID and the URL of the appropriate SUP feed.

Interesting that this is seeing attention again because Dave proposed this in RSS 2.0:

is an optional sub-element of .

It specifies a web service that supports the rssCloud interface which can be implemented in HTTP-POST, XML-RPC or SOAP 1.1.

Its purpose is to allow processes to register with a cloud to be notified of updates to the channel, implementing a lightweight publish-subscribe protocol for RSS feeds.

In this example, to request notification on the channel it appears in, you would send an XML-RPC message to radio.xmlstoragesystem.com on port 80, with a path of /RPC2. The procedure to call is xmlStorageSystem.rssPleaseNotify.

However SUP is not XMLRPC (which is probably good since I'm a REST fan)

By using SUP-IDs instead of feed urls, we avoid having to expose the feed url, avoid URL canonicalization issues, and produce a more compact update feed (because SUP-IDs can be a database id or some other short token assigned by the service).

This can be avoided by just using the unique source URL. The feed is irrelevant. Just map the source to feed URL on your end.

Because it is still possible to miss updates due to server errors or other malfunctions, SUP does not completely eliminate the need for polling. However, when using SUP, feed consumers can reduce polling frequency while simultaneously reducing update latency. For example, if a site such as FriendFeed switched from polling feeds every 30 minutes to polling every 300 minutes (5 hours), and also monitored the appropriate SUP feed every 3 minutes, the total amount of feed polling would be reduced by about 90%, and new updates would typically appear 10 times as fast.

Spinn3r performs a hybrid. We index pinged sources once per week but also index right when they ping us. Best of both worlds basically.

The current ping space is across the board though.

There's XMLRPC, XML, the Six Apart update stream and now JSON:

This doesn't seem too different from Changes.xml...

Witness http://blogsearch.google.com/changes.xml vs http://friendfeed.com/api/sup.json

I'm not sure what the solution is here but it's clear we need some standardization in this area.

One suggestion for SUP is to not use a JSON-only protocol. Having an alternative REST/XML version seems to be advantageous for people who don't want to put a second parser framework in production.

Comments

The comments to this entry are closed.