.

This is Spinn3r's offficial weblog where we discuss new product direction, feature releases, and all our cool news.

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking.

If you'd like to read more about Spinn3r you could read our Founder's blog or check out Tailrank - our memetracker.

Spinn3r is proudly hosted by ServerBeach.

Archives

June 2009
May 2009
April 2009
February 2009
January 2009
December 2008
October 2008
September 2008
August 2008
July 2008

Spinn3r Sponsors 2009 International Conference for Weblogs and Social Data Challenge

200810211134Spinn3r is sponsoring the International Conference for Weblogs and Social Media this year with a snapshot of our index.

The data set was designed for use by researchers to build cool and interesting applications with the data.

Good research topics might include...
  • link analysis
  • social network extraction
  • tracing the evolution of news
  • blog search and filtering
  • psychological, socialogical, ethnographic, or personality-based studies
  • analysis of influence among bloggers
  • blog summarization and discource analysis

We're already used by a number of researchers in top universities. Textmap (which presented at ICWSM last year) just migrated to using Spinn3r and Blogs Cascades has been using us for a while now.

The data set is pretty large. 142GB compressed (27GB uncompressed) but you need a solid chunk of data to perform interesting research.

The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 20092008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).

This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial
crisis; ...) as well as everything else you might expect to find posted to blogs.

To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) icwsm.org. Once your form has been processed, you will be sent a URL and password where you can download the collection.

Here is a sample of blog posts from the collection. The XML format is described on the Spinn3r website.

Comments

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

If you have a TypeKey or TypePad account, please Sign In