.

This is Spinn3r's offficial weblog where we discuss new product direction, feature releases, and all our cool news.

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking.

If you'd like to read more about Spinn3r you could read our Founder's blog or check out Tailrank - our memetracker.

Spinn3r is proudly hosted by ServerBeach.

Archives

September 2009
July 2009
June 2009
May 2009
April 2009
February 2009
January 2009
December 2008
October 2008
September 2008

Announcing Spinn3r 2.0

200710031513-2After nearly a year in development, I'm pleased to announce the release of Spinn3r 2.0.

We've also been heads down working on Tailrank as well and are announcing Tailrank 2.5 today as well.

All of this has been possible due to the sheer amount of work we've invested into our software and hardware infrastructure. We're pretty ambitious and now that we've completed the majority of our infrastructure work we can ship more applications at a faster rate.

This release includes a number of new features which have been requested by our customers over the last year including:

Now indexing to 12M weblogs

When we first released Spinn3r 1.0 we wanted to focus on the core portion of the blogosphere which received the most attention. While this is still a major focus for us, increasing our index to cover more niche topics and user generated content has been a big request from our customer base.

We've built out a complex discovery engine to find new weblogs which takes in a number of factors including publication rate, link analysis, content analysis and other factors to determine if a blog should be added to our index.

While other services claim to support a lot more than 12M blogs we try to focus only on the portion of the blogosphere that publishes at least once per month.

We also spend a good deal of time fighting spam which is also beneficial to our user base.

Mainstream News Crawler

We've added a new mainstream media crawler which indexes the top 10k high ranking news sites such as CNN, the New York Times, and the Wall Street Journal. We expose the full HTML of each mainstream media article within our API, including the article title and additional metadata.

This will also be the basis of new crawling technology for or main index which will help us backfill partial content feeds, extract comments from blog posts, and numerous other features.

Tags

We now index tags in both HTML rel-tag form and within RSS and Atom categories. Not only are tags included in API output but you can also query for a specific tag with the API.

Language Classification

We've implemented an in-house language classification library (based on current research) for analyzing content and determining the underlying language based on the raw text. This allows us to discover the language for a given page or weblog simply by the author's own words.

This becomes very valuable when the language of the blog isn't specified or incorrect due to a configuration error.

Weblog Influence Ranking

Not all weblogs should be considered equal. Certain weblogs have a lot more influence than others.

Taking this into consideration we've created a new ranking algorithm named "influence rank" which represents how successful a blog is at breaking a meme which eventually gets published on Tailrank.

Top Weblog Filtering

Only need the top 1k weblogs? Not a problem! We now expose our influence ranking (and other factors) to the user and filter results accordingly.

This has been beneficial for startups that are working on beta applications and don't yet have the compute capacity to process the blogosphere.

Author metadata

Author data included within the API when available from the source RSS or Atom feed.

This has been requested from numerous customers and is now a default feature in the API.

Ease of Use

We've made the API even easier to implement than before. Implementing the Spinn3r API should be as simple as downloading a URL. In fact, one of our recent customers had one of their interns implement it in their spare time! Not bad!

Free for Research Purposes

Working towards your PhD? Need a copy of the blogosphere for academic use? Not a problem! We're now making Spinn3r 100% free of charge for researchers.

Spinn3r 2.1

There were a few key features that couldn't make it into Spinn3r 2.0 that we're really excited about. We're planning on releasing Spinn3r 2.1 in the next couple weeks which will include some cool functionality (so stay tuned).

Comments

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

If you have a TypeKey or TypePad account, please Sign In.