After nearly a year in development, I'm pleased to announce the release of Spinn3r 2.0.
We've also been heads down working on Tailrank as well and are announcing Tailrank 2.5 today as well.
All of this has been possible due to the sheer amount of work we've invested into our software and hardware infrastructure. We're pretty ambitious and now that we've completed the majority of our infrastructure work we can ship more applications at a faster rate.
This release includes a number of new features which have been requested by our customers over the last year including:
Now indexing to 12M weblogs
When we first released Spinn3r 1.0 we wanted to focus on the core portion of the blogosphere which received the most attention. While this is still a major focus for us, increasing our index to cover more niche topics and user generated content has been a big request from our customer base.
We've built out a complex discovery engine to find new weblogs which takes in a number of factors including publication rate, link analysis, content analysis and other factors to determine if a blog should be added to our index.
While other services claim to support a lot more than 12M blogs we try to focus only on the portion of the blogosphere that publishes at least once per month.
We also spend a good deal of time fighting spam which is also beneficial to our user base.
Mainstream News Crawler
We've added a new mainstream media crawler which indexes the top 10k high ranking news sites such as CNN, the New York Times, and the Wall Street Journal. We expose the full HTML of each mainstream media article within our API, including the article title and additional metadata.
This will also be the basis of new crawling technology for or main index which will help us backfill partial content feeds, extract comments from blog posts, and numerous other features.
Tags
We now index tags in both HTML rel-tag form and within RSS and Atom categories. Not only are tags included in API output but you can also query for a specific tag with the API.
Language Classification
We've implemented an in-house language classification library (based on current research) for analyzing content and determining the underlying language based on the raw text. This allows us to discover the language for a given page or weblog simply by the author's own words.
This becomes very valuable when the language of the blog isn't specified or incorrect due to a configuration error.
Weblog Influence Ranking
Not all weblogs should be considered equal. Certain weblogs have a lot more influence than others.
Taking this into consideration we've created a new ranking algorithm named "influence rank" which represents how successful a blog is at breaking a meme which eventually gets published on Tailrank.
Top Weblog Filtering
Only need the top 1k weblogs? Not a problem! We now expose our influence ranking (and other factors) to the user and filter results accordingly.
This has been beneficial for startups that are working on beta applications and don't yet have the compute capacity to process the blogosphere.
Author metadata
Author data included within the API when available from the source RSS or Atom feed.
This has been requested from numerous customers and is now a default feature in the API.
Ease of Use
We've made the API even easier to implement than before. Implementing the Spinn3r API should be as simple as downloading a URL. In fact, one of our recent customers had one of their interns implement it in their spare time! Not bad!
Free for Research Purposes
Working towards your PhD? Need a copy of the blogosphere for academic use? Not a problem! We're now making Spinn3r 100% free of charge for researchers.
Spinn3r 2.1
There were a few key features that couldn't make it into Spinn3r 2.0 that we're really excited about. We're planning on releasing Spinn3r 2.1 in the next couple weeks which will include some cool functionality (so stay tuned).