.

This is Spinn3r's offficial weblog where we discuss new product direction, feature releases, and all our cool news.

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking.

If you'd like to read more about Spinn3r you could read our Founder's blog or check out Tailrank - our memetracker.

Spinn3r is proudly hosted by ServerBeach.

Archives

September 2009
July 2009
June 2009
May 2009
April 2009
February 2009
January 2009
December 2008
October 2008
September 2008

Spinn3r Hiring Five new Engineers (and growing rapidly)

Spinn3r is growing fast. We've had an exceptional month (an exceptional year actually). Closing new deals. Releasing new features for our customers. Working on new backend architecture changes, and generally having a lot of fun in the process.

We've been posting to Craigslist like mad in the last few weeks but I wanted to take the time to post to our blog.

We're hiring five new Engineers to join the team with us here in San Francisco.

This is in addition to the two new Engineers we've hired in the last couple months.

We're hiring two Crawl Engineers, Operations Engineer, Support and QA Engineer, and Java Engineer.

Spinn3r is a great place to work. Smart people. Huge amounts of data. Great customers. New offices in SOMA (we're in an awesome 103 year old building) and plenty of interesting problems to work on...

Spinn3r Hiring Support Engineer

200907071505We're hiring a Support Engineer at Spinn3r. This is a key hire (and will take a lot of work off my shoulders) so we plan on taking our time to find the right candidate.

That said, this is an awesome opportunity to get in and work on a rapidly growing startup.

About Spinn3r:

Spinn3r is a licensed weblog crawler used by search engines, weblog analytic companies, and generally anyone who needs access to high quality weblog and social media data.

We crawl the entire blogosphere in real-time, rank, and classifying blogs, as well as remove spam. We then provide this information to our customers in a clean format for use within IR applications.

Spinn3r is rare in the startup world in that we're actually profitable. We've proven our business model which gives us a significant advantage in future product design and expanding our current customer base and feature set.

We've also been smart and haven't raised a dime of external VC funding which gives us a lot more flexibility in terms how how we want to grow the company moving forward.

For more information please visit our website.

Responsibilities:

  • Interact with customers both both in the early sales cycle and support role to answer technical questions about our technology (crawling, ranking, etc) (20%)
  • Monitor our crawler stats to enable understanding of operation and detect operational anamolies, monitor statistics, implement new features, etc.(20%).
  • Work on Java implementation of various new Spinn3r features as well as fix bugs in our current product. You will also be working on infrastructure in this position and responsible for various backend Java components of our architecture. (60%)
  • General passion and interest in technology (distributed systems, open content, Web 2.0, etc).

I should stress that while you'll be interacting with customers, and providing support, our customers are exceedingly brilliant and amazingly knowledgeable about our space. They're a major asset and staying in sync with them is very important for the company.

Requirements and Experience:

  • Java (though Python, C, C++, etc would work fine).
  • Ability to understand customer needs and prioritize feature requests.
  • Friendly, patient, and excellent people skills when interacting with customers.
  • Understanding HTTP
  • Databases (MySQL, etc).
  • Ability (and appreciation) for working in a Startup environment.
  • Must like cats :)

NYTimes, on Memetracker, and Spinn3r

200907182203I didn't have time to blog about this when it was originally posted but the NYTimes has a great piece on the cool work done by Jure Leskovec and Jon Kleinberg with their work on Memetracker (which is powered by Spinn3r).

For the most part, the traditional news outlets lead and the blogs follow, typically by 2.5 hours, according to a new computer analysis of news articles and commentary on the Web during the last three months of the 2008 presidential campaign.

The finding was one of several in a study that Internet experts say is the first time the Web has been used to track — and try to measure — the news cycle, the process by which information becomes news, competes for attention and fades.

Researchers at Cornell, using powerful computers and clever algorithms, studied the news cycle by looking for repeated phrases and tracking their appearances on 1.6 million mainstream media sites and blogs. Some 90 million articles and blog posts, which appeared from August through October, were scrutinized with their phrase-finding software.

Spinn3r at Real Time Stream Crunchup, RSS Speedups, and Startup Discounts

Spinn3r will be at the Real Time Stream Crunchup tomorrow (which should be fun). There should be more announcements about realtime RSS which is interesting:
While there is an argument to be made that RSS is dying, being replaced by more instantaneous forms of content delivery such as Twitter and other real time streams, many people aren’t quite yet ready to give up on it. Instead, they want to save it by speeding it up. Tomorrow, at our Real Time Stream CrunchUp, we will see three demos of projects that do just that in slightly different ways.

Google engineers Brad Fitzpatrick and Brett Slatkin will show a demo of a new push protocol called pubsubhubub, Netvibes CEO Freddy Mini will demo his similar RSS Instant Update Hub, and WordPress engineer Andy Skelton will show off a Jabber client which uses the XMPP protocol to push blog headlines into an IM-like environment faster than RSS.

If these receive significant adoption Spinn3r will implement them pretty quickly (we push revisions every week). Gnip also launched their early stage partner program:
Gnip is also launching a early-stage startup partner program that will let startups access to all of Gnip’s service features and data services. The program is aimed towards software development startups that have been in business for less than 3 years and generating less than $200,000 in revenue. Of course, Gnip requires that partners pay a fee of $1000 but says the services that they will receive are valued at $10,000 per month.
Spinn3r has been offering this for more than three years now. If you're an early stage startup and you need access to Spinn3r data we can do so at a fraction of the price. I should also note that if you're a research organization we can provide you with data for free. We provide access to Spinn3r data to more than 100 PhDs from to universities world wide.

Spinn3r Hiring Crawl Engineer

200907071505We're hiring a Crawl Engineer at Spinn3r. This is a key hire (and will take a lot of work off my shoulders) so we plan on taking our time to find the right candidate.

That said, it's an awesome opportunity to get in and work on a rapidly growing startup.

About Spinn3r:

Spinn3r is a licensed weblog crawler used by search engines, weblog analytic companies, and generally anyone who needs access to high quality weblog and social media data.

We crawl the entire blogosphere in real-time, rank, and classifying blogs, as well as remove spam. We then provide this information to our customers in a clean format for use within IR applications.

Spinn3r is rare in the startup world in that we're actually profitable. We've proven our business model which gives us a significant advantage in future product design and expanding our current customer base and feature set.

We've also been smart and haven't raised a dime of external VC funding which gives us a lot more flexibility in terms how how we want to grow the company moving forward.

For more information please visit our website.

Responsibilities:

  • Maintain our current crawler.
  • Monitor and implement statistics behind the current crawler to detect anomalies.
  • Implement new features for customers
  • Work on backend architecture to improve performance and stability.
  • Implement custom protocol extension for enhanced metadata and site specific social media support.
  • Work on new products and features using large datasets.

Requirements and Experience:

  • Java (though Python, C, C++, etc would work fine).
  • HTML, XML, RSS or Atom.
  • HTTP
  • Distributed systems.
  • Databases (MySQL, etc).
  • Algorithm design (especially in distributed systems).
  • Ability (and appreciation) for working in a Startup environment.
  • Must like cats :)

Ideal:

  • Past experience running and working with large crawlers.
  • Understanding of IR algorithms (K-means, naive bayes, inverted index compression, etc).
  • Experience within the Open Source community.

Questions:

For bonus points. Feel free to answer the following questions in your email:

  • You have a live corpus of text and HTML from 25M weblogs. You want to cluster these weblogs into logical communities (tech, politics, entertainment, etc). How (and what algorithm) would you cluster and rank the content within a reasonable time? What computational resources would this require (memory, CPU, network bandwidth, etc).
  • You are building a ranking algorithm. This algorithm will execute across a large link graph. How would you store the graph to use the smallest (yet reasonable) amount of computational resources (memory and CPU).

Spinn3r 3.1 - Now with Twitter Support and Social Media Ranking

Spinn3r 3.1 just went live today and we're announcing two new features.

Twitter Firehose Support

Spinn3r listens to a new Twitter firehose API which is a sample of the full Twitter feed.

All Twitter content is classified with a new MICROBLOG publisher type which we will be using for Twitter, Identi.ca, Jaiku, and other Microblog systems.

Further, it is language classified (using an algorithm we have developed) and includes all metadata from Twitter including publication time, author name, handle, etc.

All content is real time and indexed and available within Spinn3r a few moments after it is published.

There were a lot of requests for this feature (Social Media is HOT) so I expect a lot of innovation from our customers.

Technically this is still in beta but we feel it's ready for use in production applications (once we get some feedback from our users).

Here's the current breakdown of Twitter vs other social media and blog content in Spinn3r. While Twitter is larger it's important to realize this is much less content since Twitter posts are short.

200906151047

Social Media Rank

We also published some of our new ranking technology which has been in development for a while now (more than a year).

We're indexing social media sites and computing rank on users based on their social graph.

The results are pretty interesting. Scobleizer OWNS Friendfeed. Techcrunch consistently places high. They're #3 on Friendfeed. #274 on Digg and #24 on Twitter.

It's also interesting how the founders of these social media properties consistently place high, even over celebrities. Ev Williams is still #5 on Twitter. Kevin Rose is still #1 on Digg. Paul is #16 on Friendfeed.

Sources (or nodes) are ranked by authority whereby the more friends or inbound links you have the higher your rank.

Our key differentiator is that we do not consider raw inbound link count to be an accurate representation of authority. This is highly vulnerable to spam and rank errors as users who attract a large number of links (either through black hat methods, link baiting, or viral marketing) can inflate their rankings (and harm other legitimate users).

We consider the quality of inbound links to be far more important. You can observe this in our results as the authority for a source is not a direct function of raw inbounds links. Some users can have high authority but very few (relative) inbound links.

We're really eager for feedback here. If you have any comments on our ranking system feel free to contact us with your thoughts.

Blogfa.com (Iranian Blog Network) Down Due to Iranian Protests and Civil Unrest?

If you've been paying attention to the news, there are massive protests and riots all over Tehran following the recent (potentially corrupt) election.

The phones have been cut off and you can't make calls into Tehran.

Apparently, you can't blog if you're hosted on Blogfa.com.

They're down at the moment and not responding.

Here's the current Spinn3r graph for content being published from Blogfa.

We monitor all the posts across the blog networks for anomalies. Needless to say, this would be a significant anomaly.

Update: The HTTP server is up and responding but with Service Unavailable

200906132032

Papers using Spinn3r Data from ICWSM 2009

Additional papers based on the Spinn3r/ICWSM dataset have been published. It seems I have a lot of reading to do!

Flash Floods and Ripples: The Spread of Media Content through the Blogosphere

This paper is based on the Spinn3r data set (ICWSM 2009), which consists of web feeds collected during a two month period in 2008. The data set includes posts from blogs as well as other data sources like news feeds. We discuss our methodology for cleaning up the data and extracting posts of popular blog domains for the study. Because the Spinn3r data set spans multiple blog domains and language groups, this gives us a unique opportunity to study the link structure and the content sharing patterns across multiple blog domains. For a representative type of content that is shared in the blogosphere, we focus on videos of the popular web-based broadcast media site, YouTube.

Our analysis, based on 8.7 million blog posts by 1.1 million blogs across 15 major blog hosting sites, reveals a number of interesting findings. First, the network structure of blogs shows a heavy-tailed degree distribution, low reciprocity, and low density. Although the majority of the blogs connect only to a few others, certain blogs connect to thousands of other blogs. These high-degree blogs are often content aggregators, recommenders, and reputed content producers. In contrast to other online social networks, most links are unidirectional and the network is sparse in the blogosphere. This is because links in social networks represent friendship where reciprocity and mutual friends are expected, while blog links are used to reference information from other data sources.

Identifying Personal Stories in Millions of Weblog Entries

Stories of people's everyday experiences have long been the focus of psychology and sociology research, and are increasingly being used in innovative knowledge-based technologies. However, continued research in this area is hindered by the lack of standard corpora of sufficient size and by the costs of creating one from scratch. In this paper, we describe our efforts to develop a standard corpus for researchers in this area by identifying personal stories in the tens of millions of blog posts in the ICWSM 2009 Spinn3r Dataset. Our approach was to employ statistical text classification technology on the content of blog entries, which required the creation of a sufficiently large set of annotated training examples. We describe the development and evaluation of this classification technology and how it was applied to the dataset in order to identify nearly a million personal stories.


In this paper, we describe our efforts to overcome the limitations of our previous story collection research using new technologies and by capitalizing on the availability of a new weblog dataset. In 2009, the 3rd International AAAI Conference on Weblogs and Social Media sponsored the ICWSM 2009 Data Challenge to spur new research in the area of weblog analysis. A large dataset was released as part of this challenge, the ICWSM 2009 Spinn3r Dataset (ICWSM, 2009), consisting of tens of millions of weblog entries collected and processed by Spinn3r.com, a company that indexes, interprets, filters, and cleanses weblog entries for use in downstream applications. Available to all researchers who agree to a dataset license, this corpus consists of a comprehensive snapshot of weblog activity between August 1, 2008 and October 1, 2008. Although this dataset was described as containing 44 million weblog entries when it was originally released, the final release of this dataset actually consists of 62 million entries in Spinn3r.com's XML format.

SentiSearch: Exploring Mood on the Web

Given an accurate mood classification system, one might imagine it to be simple to configure the classifier as a search filter, thus creating a mood-based retrieval system. However, the challenge lies in the fact that in order to classify the mood for a potential result, the entire content of that page must be downloaded and analyzed. Much like a typical web-based retrieval system, to avoid this cost, pages could be crawled and their mood indexed along with the representation stored for search indexing. Alternatively, the presence of a massive dataset from www.spinn3r.com enabled the ESSE system to be built, performing mood classification and result filtering on the fly (Burton et al. 2009). Because the dataset (including textual content), search system, and mood classification system all exist on the same server, the filtering retrieval system was made possible. The dataset not only allows access to the content of a blog post (beyond the summary and title typically made available through search APIs) but the closed nature of the dataset allows for experimentation while still being vast enough to provide breadth and depth of topical coverage.

Event Intensity Tracking in Weblog Collections

The data provided for ICWSM 2009 came from a weblog indexing service Spinn3r (http://spinn3r.com). This included 60 million postings spanned over August and September 2008. Some meta-data is provided by Spinn3r.

Each post comes with Spinn3r’s pre-determined language tag. Around 24 million posts are in English, 20 million more are labeled as ‘U’, and the remaining 16 million are comprised of 27 other languages (Fig. 3). The languages are encoded in ISO 639 two-letter codes (ISO 639 Codes, 2009). Other popular languages include Japanese (2.9 million), Chinese/Japanese/Korean (2.7 million) and Russian (2.5 million). The second largest label is U unknown. This data could potentially hold posts in languages not yet seen or posts in several languages. Our present work, including additional dataset analysis presented next, is limited to the English posts unless otherwise specified. In future work we plan to also consider other languages represented in the dataset.

Quantification of Topic Propagation using Percolation Theory: A study of the ICWSM Network

Our research is the first attempt to give an accurate measure for the level of information propagation. This paper presents ‘SugarCube’, a model designed to tackle part of this problem by offering a mathematically precise solution for the quantification of the level of topic propagation. The paper also covers the application of SugarCube in the analysis of the social network structure of the ICWSM/Spinn3r dataset (ICWSM 2009). It presents threshold values for the communities found within the collection, and paves the way for the measurement of topic propagation within those communities. Not only can SugarCube quantify the proliferation level of topics, but it also helps to identify ‘heavily-propagated’ or Global topics. This novel approach is inspired by Percolation Theory and its application in Physics (Efros 1986).

Spinn3r talk from ICWSM 2009

Here are the slides from my talk at ICWSM 2009. The talk went really well I think. Lots of great questions from the audience.

The winning paper, "Flash Floods and Ripples: The Spread of Media Content through the Blogosphere", was very good and I'm excited to read it in full when I get a few moments.

Spinn3R Icwsm Presentation-1

Spinn3r 3.0: New Features, New Architecture, New APIs - More Goodness

200812181719I'm proud to announce that we have just released Spinn3r 3.0 after more than a year of development.

This has been quite a lot of work based on feedback from our customer base and ships with some really awesome functionality.

Most of this time has been spent on architecture but a good deal has been spent implementing features for our rapidly growing user base.

When you outsource a major component of your infrastructure, like crawling, you tend to lean on it heavily and push it to the very edge.

Spinn3r has benefited significantly from our user base as they have suggested a number of excellent features. This has dramatically increased our reliability, performance, and feature set.

A good deal of work here has been spent on scalability, performance, and optimizations, including serious improvements to our core backend infrastructure.

There's quite a lot that's new in this release so I'll just dive in.

200812181049Industry Standardization

We're now powering startups who have raised in excess of $100M in VC funding.

What is interesting is that a large portion of the industry is standardizing around our infrastructure. Why wouldn't they? We've been in production for over three years now and have been in production applications the entire time.

Research Program

Over the last year we have been providing researchers access to Spinn3r and this is really starting to pay off with currently half a dozen papers published using our architecture.

We haven't had a chance to announce this until now so we're pretty excited that this is finally public.

We have researchers at Harvard, Carnegie Melon, Stanford, Caltech, University of Maryland Baltimore County, University of Washington
University of Southern California, Nanyang Technological University, University Of Edinburgh, National Institute of Informatics in Japan, California Institute of Technology, University of Hannover, in Germany, and on and on.

Recently, the University of North Texas used Spinn3r to track the Swine Flu outbreak.

Cornell just recently launched a Memetracker powered by Spinn3r which we're really excited about.

Textmap is also another search engine using Spinn3r. Their paper, Large-Scale Sentiment Analysis for News and Blogs from the 2007 International Conference for Weblogs and Social Networking (ICWSM) does a good job explaining their system.

We also have a number of our customers performing entity extraction and sentiment analysis and I think that this space is going to be really maturing in the next few years.


200812162259International Conference for Weblogs and Social Media

We're sponsoring the International Conference for Weblogs and Social Media in San Jose this year.

We provided them with four months of data - nearly 400GB of blog data.

It turned out to be a huge success with more than 100 research groups requesting access. We've also provided them with direct access to Spinn3r and will continue to do so for the foreseeable future.

We will almost certainly sponsor ICWSM 2010 with a similar corpus. Possibly expanding it with more data including our comment API, permalink content and content extract and would increase the size to around 4TB.

New Admin Console

We now have a new web application to help developers interface with Spinn3r.

The general idea is that while Spinn3r provides a very powerful API, it was sometimes difficult for our new customers to get up and running. Further, once they were up and running, they would report a problem without a way to pinpoint what they were seeing.

Now with our console they can just login, drill down on the specific datapoint they are interested in, and send in a URL documenting their question.

We have statistics on anything you can imagine. We have hosting provider breakdown, language breakdown, posts per hour, comments per hour, links per hour. We even have most of these broken down by blog host.

Here's a screenshot of our language breakdown:

200905171905

As you can see we're heavily biased by English content as most of our customers are in the US.

We've also instrumented statistics about our customers including their individual API lag (or lack thereof), number of registered sources and their throughput, etc.

Further, we've now implemented web versions of most of our popular APIs to help easily debug Spinn3r.

For example, our customers can give us the URL to an A-list weblog and we can show them the most recent posts within Spinn3r:

200812162217

There are interfaces for most features of Spinn3r. Of course one can always use the API directly and we have a great command line interface as well.

Comment API

User commentary across blogs can be valuable for search engines and users but right now there are no real standards for indexing comments made within the blogosphere.

The wfw:comments and Atom threading standards exist (and we support them) but these are only supported within a minority of blogging systems.

We've written hand tuned parsers for fetching the remaining comments and we support the majority of content management systems.

The only restriction is that we don't re-index content right now. This is going to change shortly after we ship 3.0 and probably make it into 3.1.

If you'd like beta access please send us an email and we'll provide you with additional documentation.

Hybrid Real Time Indexing

Spinn3r is directly integrated into the vast ping architecture. If we receive a ping from a weblog we immediately launch our crawlers to fetch the update.

The difficulty here is that not everyone sends pings. We have had a hybrid crawler for a about 6-12 months now which allows us to support both pings and sources on different indexing intervals. Currently about 70% of our content is fetched from pinged sources and the other 30% is fetched once every thirty minutes.

Archives

We've expanded our archive capacity and now host more than 7 months of content comprising some 21TB (twenty one terabytes) of content.

We have online capacity for up to 66TB of content and can expand to about 300TB by purchasing additional hardware. From this point moving on we plan on keeping all archives for all time.

This is made possible due to the database migration we performed as part of Spinn3r 3.0 as well as our new datacenter migration.

Full Source History

We've extended our API so that it's no longer just a raw crawler API. Now it's a full blown database of the entire blogosphere.

You can give us a weblog, feed, or permalink and we can show you the entire history and you can page through the API going back in time.

Improved Performance

We completely rewrote our API result handling in 3.0 and we can now support a much higher throughput than before.

Assuming you're not bottlenecked by bandwidth, you should be able to sustain 10-30x over real time indexing.

This means it only takes on average 1 hour to download 30 hours worth of content (with higher throughput possible assuming you have the necessary bandwidth).

This might sound like overkill but when our customers need access to archive data, or they've been offline for a long period of time, they want to catch up quickly.

We're also working on some API extensions to handle parallel downloading which should, in theory, mean we can index Spinn3r content as fast as we want just by throwing more hardware (and bandwidth) at the problem.

Right now it's not really a problem though as the blogosphere could increase by size in 30x and we would easily be able to handle the load.

Public Statistics

We're also going to be sharing our internal statistics with the public in the goal of sharing as much as possible about Spinn3r. Our monitoring architecture allows us to have thousands and thousands of metrics so we can literally monitor everything possible about the blogosphere.

Cost Savings

Our biggest competitor to date has been build vs buy. We have a few competitors out there but they can't compete with us on feature set or pricing.

We've made a very compelling case in Spinn3r 3.0 for a significant cost savings that can be made by switching from a proprietary (or Open Source) crawler and using Spinn3r.

If you're using Spinn3r you can save upwards of $45k per month in hosting costs. With the current economic situation it's starting to become very compelling to switch.

Tailrank

200812181707A number of people (including our customers) have asked me about Tailrank and what we're doing with the product.

We're shutting down the consumer facing website because the architecture was difficult (and expensive) to maintain alongside Spinn3r 3.0. We're going to be selling off the assets and integrating the technology into a future version of Spinn3r.

If you're interested in purchasing the assets to Tailrank feel free to send me a private email or just contact us directly.

At this point Tailrank is not a market in which we want to focus. The space is too fragmented, has too much competition, and very little room for innovation.

The clustering and ranking technology present in Tailrank 3.0 (which never shipped) will be refactored and integrated in the Spinn3r 3.1 or 3.2 timeframe (2-6 months).

If you'd like an alternative I'd recommend Reddit, Digg, Techmeme, Newsvine, Google News, or Wikio. They're all good products with a significant amount of attention and traction.

At some point we might integrate memetracking directly into the Spinn3r corpus but at the moment our customers have a number of pending requests that we plan on servicing first.

What's next?

As usual, we're always focused on customers and constantly improving the product. We're also working on a few things in parallel which use much larger data sets.

We're going to be pushing a few new architecture changes in the next sixty days which will allow us to perform additional computations and extract more metadata. The mainline thinking in information retrieval today is that while algorithms are smart, data is smarter.

We've taken that approach throughout our design. Of course one of the main problems is the sheer size of the infrastructure required to push this much data. With Spinn3r 3.0 where nearly there and will have some interesting product features in the next quarter.

Using Spinn3r to Track Swine Flu

Want to track Swine Flu outbreaks? Just use Spinn3r! Courtney Corley and Jorge Reyes are two University of North Texas graduate students who have been using Spinn3r under our research program to mine data about the recent Swine Flu outbreak. The Denton Record-Chronicle has the story:
“We’re looking at what people write in blogs, Web [sites] and social media like Facebook, YouTube, etc. But, in particular, we’re just using blogs,” Corley said. “We have a service that allows us access to all blogs written in whatever language.” The service is called Spinn3r, and allows them to pull together all media across the Internet that contains the keywords they search for. “It’s a really rich resource to use for public health to see what people are writing about,” he said. “It’s a massive amount of data. Jorge and I for the past week have been looking at all the blogs that talk about swine flu. There are many words in Spanish for swine flu, so Jorge has been able to navigate that.” Reyes, who is from Mexico, said he was motivated to work on the project because his family was in the country where the virus originated. “All my family was there, I was worried,” Reyes said. “We were like, ‘what could we do with the tools we have?’
200905071134-1 caption: Courtney Corley and Jorge Reyes, who are tracking the spread of swine flu in the United States and Mexico, are shown Wednesday on campus.

Researchers Using Spinn3r

We've been providing researchers with access to Spinn3r for more than two years now. The results are really starting to land now.

We're sponsoring ICWSM this year with a 400GB snapshot. This is being used by more than 100 research groups of about 500 total researchers.

There should be a few dozen more papers published in the next few weeks but I wanted to highlight these now as they are already live.

Specifications and Architectures of Federated Event-Driven Systems

Specifying the Personal Information Broker Data Acquisition: Data can be acquired from multiple sources – currently we use Spinn3r, later we will also acquire IEM, Twitter, Technorati, etc. Each of these acquisitions is specified differently. Acquisition of Spinn3r data, referenced in Fig- ure 3 step 1, is achieved through changing URL arguments in a manner defined by Spinn3r. Thus, the specification is unique to Spinn3r. While that particular specification cannot be reused, using the compositional approach, exchanging Spinn3r for Twitter, a news feed, or an instant messaging account while maintaining the integrity of the composition is trivial. The specifications for all of these information inter- faces are very different; a notation that allows the description of composite applications must account for this.

Blogs as Predictors of Movie Success

In this work, we attempt to assess if blog data is useful for prediction of movie sales and user/critics ratings. Here are our main contributions:

• We evaluate a comprehensive list of features that deal with movie references in blogs (a total of 120 features) using
the full spinn3r.com blog data set for 12 months.

• We find that aggregate counts of movie references in blogs are highly predictive of movie sales but not predictive of
user and critics ratings.

• We identify the most useful features for making movie sales predictions using correlation and KL divergence as metrics and use clustering to find similarity between the features.

• We show, using time series analysis as in (Gruhl, D. et. al. 2005), that blog references generally precede movie sales
by a week and thus weekly sales can be predicted from blog references in the preceding weeks.

• We confirm low correlation between blog references and first week movie sales reported by (Mishne, G. et. al. 2006) but we find that (a) budget is a better predictor for the first week; (b) subsequent weeks are much more pre-dictive from blogs (with up to 0.86 correlation).

Data and Features

The data set we used for this paper is the spinn3r.com blog data set from Nov. 2007 until Nov. 2008. This data set includes practically all the blog posts published on the webin this period (approximately 1.5 TB of compressed XML).

Blogvox2: A modular domain independent sentiment analysis system

Bloggers make a huge impact on society by representing and influencing the people. Blogging by nature is about expressing and listening to opinion. Good sentiment detection tools, for blogs and other social media, tailored to politics can be a useful tool for today’s society. With the elections around the corner, political blogs are vital to exerting and keeping political influence over society. Currently, no sentiment analysis framework that is tailored to Political Blogs exist. Hence, a modular framework built with replicable modules for the analysis of sentiment in blogs tailored to political blogs is thus justified.

...

Spinn3r (http://tailrank.com ) provided live spam-resistant and high performance spider dataset to us. We tested our framework on this dataset since it was live feeds and we wanted to test our performance of sentiment analysis on these dataset for performance analysis and testing. We periodically pinged the online api for the current dataset of all the rss feeds. Although we had different domains that were provided to us, we chose the political
domain for consistency with our other results.

Meme-tracking and the Dynamics of the News Cycle

Tracking new topics, ideas, and “memes” across the Web has been an issue of considerable interest. Recent work has developed methods for tracking topic shifts over long time scales, as well as abrupt spikes in the appearance of particular named entities. However, these approaches are less well suited to the identification of content that spreads widely and then fades over time scales on the order of days — the time scale at which we perceive news and events.

...

Dataset description. Our dataset covers three months of online mainstream and social media activity from August 1 to October 31 2008 with about 1 million documents per day. In total it consist of 90 million documents (blog posts and news articles) from 1.65 million different sites that we obtained through the Spinn3r API [27]. The total dataset size is 390GB and essentially includes complete online media coverage: we have all mainstream media sites that are
part of Google News (20,000 different sites) plus 1.6 million blogs, forums and other media sites. From the dataset we extracted the total 112 million quotes and discarded those with L < 4, M < 10, and those that fail our single-domain test with ε = .25. This left us with 47 million phrases out of which 22 million were distinct. Clustering the phrases took 9 hours and produced a DAG with 35,800 non-trivial components (clusters with at least two phrases) that together included 94,700 nodes (phrases).

NYTimes on Crawling the Deep Web

The New York Times has a great piece today on crawling the deep web - the portion of the web that isn't easily accessible to normal web crawlers.
The challenges that the major search engines face in penetrating this so-called Deep Web go a long way toward explaining why they still can’t provide satisfying answers to questions like “What’s the best fare from New York to London next Thursday?” The answers are readily available — if only the search engines knew how to find them. Now a new breed of technologies is taking shape that will extend the reach of search engines into the Web’s hidden corners. When that happens, it will do more than just improve the quality of search results — it may ultimately reshape the way many companies do business online.
Oddly enough, Google published a paper on this topic at VLDB 2008 entitled "Google's Deep-Web Crawl".
This paper describes a system for surfacing Deep-Web content; i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. Our objective is to select queries for millions of diverse forms such that we are able to achieve good (but perhaps incomplete) coverage through a small number of submissions per site and the surfaced pages are good candidates for selection into a search engine's index. We adopt an iterative probing approach to identify the candidate keywords for a [generic] text box. At a high level, we assign an initial seed set of words as values for the text box ... [and then] extract additional keywords from the resulting documents ... We repeat the process until we are unable to extract further keywords or have reached an alternate stopping condition. A typed text box will produce reasonable result pages only with type-appropriate values. We use ... [sampling of] known values for popular types ... e.g. zip codes ... state abbreviations ... city ... date ... [and] price.
The smart guys over at Kosmix were also quoted:
“The crawlable Web is the tip of the iceberg,” says Anand Rajaraman, co-founder of Kosmix (www.kosmix.com), a Deep Web search start-up whose investors include Jeffrey P. Bezos, chief executive of Amazon.com. Kosmix has developed software that matches searches with the databases most likely to yield relevant information, then returns an overview of the topic drawn from multiple sources. “Most search engines try to help you find a needle in a haystack,” Mr. Rajaraman said, “but what we’re trying to do is help you explore the haystack.”
...
In a similar vein, Prof. Juliana Freire at the University of Utah is working on an ambitious project called DeepPeep (www.deeppeep.org) that eventually aims to crawl and index every database on the public Web. Extracting the contents of so many far-flung data sets requires a sophisticated kind of computational guessing game.
Interesting. I'm going to have to reach out to them to offer our help with Spinn3r (more on our research program shortly).

A Change in Robots.txt

Change is in the air. Including the robots.txt at whitehouse.gov:

User-agent: *
Disallow: /includes/

I wonder if Eric Schmidt had something to do with this... he was slated for a cabinet position but turned it down at the last minute.

It's no secret that Google employees have given a LOT of money to Obama.

Spinn3r is going to love this data!

Ignoring Blogroll and Sidebar Content in Search

200812191035Google Blog Search shipped with an update a few months back to index the full HTML of each new blog post.

The only problem is that they indexed the full HTML and not the article content:

I wanted to give everyone a brief end-of-the-year update on the blogroll problem. When we switched blogsearch to indexing the full text of posts, we started seeing a lot more results where the only matches for a query where from the blogroll or other parts of the page that frame the actual post. (There's been a lot of discussion of the problem. You can search for [google blogsearch] using Google Blogsearch.)

We're in the midst of deploying a solution for this problem. The basic approach is to analyze each blog to look for text and markup that is common to all of the posts. Usually, these comment elements include the blogroll, any navigational elements, and other parts of
the page that aren't part of the post. This approach works well for a lot of blogs, but we're continuing to improve the algorithm. The
search results should ignore matches that only come from these common elements. The indexing change to implement it is deployed almost everywhere now.

Spinn3r customers have had a solution for this problem for nearly a year now.

The quality of mainstream media RSS feeds is notoriously lacking. For example, CNN has RSS feeds but they only have a one line description instead of the full content of the post.

This has always been a problem with RSS search engines such as Feedster or Google Blog Search - what's the point of using a search engine that's not indexing 80% of potential content?

We're also seeing the same thing with a number of the A-list blogs. RSS feeds turn into a liability when bandwidth increases significantly every month with each new user. The more traffic a blog gets the greater the probability that they'll enable partial RSS feeds in order to reduce their bandwidth costs and increase click through rates.

Spinn3r 2.1 adds a new feature which can extract the 'content' of a post and eliminate sidebar chrome and other navigational items.

It does this by using an internal content probability model and scanning the HTML to determine what is potentially content and what's potentially a navigation item.

See the yellow text in the image on the right? That was identified algorithmically and isolated form the body of the post.

To be fair it's a difficult problem but I've had a few years to think about it.

Cornell Researchers Launch Memetracker Powered by Spinn3r

200810231630We have a number of other pending announcements of researchers building cool applications with Spinn3r but this one was just too awesome to hold back.

Researchers at Cornell have developed a new memetracker (cleverly named MemeTracker) powered by Spinn3r.

Jure Leskovec, Lars Backstrom and Jon Kleinberg (author of the HITS algorithm, among other things) built MemeTracker by tracking the hottest quotes from throughout the blogosphere and rending a graph by the grouping quotes and then tracking the number of quote references.

MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories per day from 1 million online sources, ranging from mass media to personal blogs.

We track the quotes and phrases that appear most frequently over time across this entire spectrum. This makes it possible to see how different stories compete for news coverage each day, and how certain stories persist while others fade quickly.

The plot above shows the frequency of the top 100 quotes in the news over time, for roughly the past two months.

Here's a screenshot but you should definitely play with MemeTracker to see how it works:

200810231629

We've been thinking of shipping a new API for tracking quotes across the blogosphere. Our new change tracking algorithm for finding duplicate content also does an excellent of finding quotes.

Tracking duplicate content turns out to be very important in spam prevention and ranking. It just so happens that there's a number of overlapping features and technologies that these things can provide.

We're not ready to ship it just yet because the backend requires about 2TB of random access data. This isn't exactly cheap so we've been experimenting with some new algorithms and hardware to bring down the pricing. I think we'll be able to ship something along these lines once we get our next big release out the door.

Spinn3r Sponsors 2009 International Conference for Weblogs and Social Data Challenge

200810211134Spinn3r is sponsoring the International Conference for Weblogs and Social Media this year with a snapshot of our index.

The data set was designed for use by researchers to build cool and interesting applications with the data.

Good research topics might include...
  • link analysis
  • social network extraction
  • tracing the evolution of news
  • blog search and filtering
  • psychological, socialogical, ethnographic, or personality-based studies
  • analysis of influence among bloggers
  • blog summarization and discource analysis

We're already used by a number of researchers in top universities. Textmap (which presented at ICWSM last year) just migrated to using Spinn3r and Blogs Cascades has been using us for a while now.

The data set is pretty large. 142GB compressed (27GB uncompressed) but you need a solid chunk of data to perform interesting research.

The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 20092008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).

This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial
crisis; ...) as well as everything else you might expect to find posted to blogs.

To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) icwsm.org. Once your form has been processed, you will be sent a URL and password where you can download the collection.

Here is a sample of blog posts from the collection. The XML format is described on the Spinn3r website.

Spinn3r 2.3.1

We just pushed Spinn3r 2.3.1. If you depend on changes in this release you should grab the new reference client.

A number of small fit and finish fixes went into this release. More important fixes include:

- New post:title element in the permalink API. When non-null this is the authoritative title element from the RSS feed for crawled content. This gets us bit further towards a grand unified API for indexing the blogosphere.

- New post:body element which will include authoritative feed content in the next push of our crawler (we're just testing it now).

- The internal hashcodes for sources and feeds are included in the API and reference client for advanced API usage and debugging.

- The source.register mechanism now allows clients to specify publisher_type for new sources. We're going to work on a new API to allow customers to flag sources for existing sources as well.

- A number of extension are now present in the spinn3r admin console for debugging including:

- the ability to view raw HTML source for a given permalink or feed

- the ability to view the cached HTML rendered in your local browser.

- The spinn3r admin console now graphs publisher types (mainstream weblogs, news feeds, etc).

- All Spinn3r robots can now be identified by reverse DNS. This is documented in our robot FAQ:

How do I verify that the robot visiting my website is Spinn3r?

First, it will have a User-Agent of:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1; aggregator:Tailrank (Spinn3r 2.3); http://spinn3r.com/robot) Gecko/20021130

Second, we support robot DNS verification.

When you have a HTTP log entry which has our user agent, just perform a reverse DNS on the raw IP address.

For example:


%shell% nslookup 64.34.195.138
Non-authoritative answer:
138.195.34.64.in-addr.arpa name = robot32.spinn3r.com.


%shell% nslookup robot32.spinn3r.com
Non-authoritative answer:
Name: robot32.spinn3r.com
Address: 64.34.195.138

.... and did I mention we're hiring?

Spinn3r Hiring Director of Sales

We're hiring a Director of Sales to interface with new clients. This is actually somewhat of an open role in that you'll also be helping us out with bizdev and marketing.

Basically, we're growing fast and need someone to help us out on multiple business fronts.

If you're the right guy feel free to get in contact with us and we'll take it from there.

Position:

As Director of Sales you'll be responsible for early stage economic growth of Spinn3r including leads follow through, closing new customers, handling our marketing efforts, and interfacing with existing customers when new products are released. Generally doing whatever it takes to pull in more revenue and take Spinn3r to the next level.

You'll also help us out with Marketing and Business Development and fall into a more specific role as the company grows.

We're a fast growing startup so you should be familiar with this type of environment.

This is an excellent opportunity for the right candidate as we're growing fast and plan on releasing some new products in the coming months which should make things VERY interesting.

Responsibilities:

  • Follow through with pending customers as they experiment with the Spinn3r platform.
  • Help with marketing efforts to push Spinn3r forward into the marketplace.
  • Qualify new leads.
  • Introduce customers to Spinn3r including frequently asked questions.
  • Follow up and upsell existing customers with new product releases.

Requirements:

  • Proven track record of success.
  • Personable and polished face to face interaction.
  • 3-5 experience in enterprise software sales or long sales cycles.
  • 2-4 years of sales experience.
  • Excellent oral and written communications skills and comfortable presenting written proposals a must.
  • Proven ability to thrive in a startup environment is critical.
  • Bachelors degree or higher.

Desired:

  • Significant understanding of the search space.
  • Experience and understanding of weblog technologies.
Location:

Located in the SOMA district of San Francisco. One block away from MUNI (N/KT), 4 blocks from Caltrain, 5 blocks from BART.

Feed Update Protocols and SUP

It looks like Friend Feed is proposing a new update protocol for RSS which avoids the thundering herd problem present with RSS polling.

When you add a web site like Flickr or Google Reader to FriendFeed, FriendFeed's servers constantly download your feed from the service to get your updates as quickly as possible. FriendFeed's user base has grown quite a bit since launch, and our servers now download millions of feeds from over 43 services every hour.

Venture Beat has more on the subject: (as does Tech Confidential)

It looks like the rapid fire site updates are about to start again for the social content conversation site FriendFeed. Just a few days after the launch of its new “beta” area, FriendFeed is finalizing a new technology that could help pull content into the site at a much faster rate.

The technology, called Simple Update Protocol (SUP) will process updates from the various services that FriendFeed imports faster than it currently does using traditional Really Simple Syndication (RSS) feeds, FriendFeed co-founder Paul Buchheit told Tech Confidential.

Spinn3r has a similar problem of course but we have 17.5M sources to consider.

The requirements are straight forward:

* Simple to implement. Most sites can add support with only few lines of code if their database already stores timestamps. * Works over HTTP, so it's very easy to publish and consume. * Cacheable. A SUP feed can be generated by a cron job and served from a static text file or from memcached. * Compact. Updates can be about 21 bytes each. (8 bytes with gzip encoding) * Does not expose usernames or secret feed urls (such as Google Reader Shared Items feeds)
Sites wishing to produce a SUP feed must do two things:

* Add a special tag to their SUP enabled Atom or RSS feeds. This tag includes the feed's SUP-ID and the URL of the appropriate SUP feed.

Interesting that this is seeing attention again because Dave proposed this in RSS 2.0:

is an optional sub-element of .

It specifies a web service that supports the rssCloud interface which can be implemented in HTTP-POST, XML-RPC or SOAP 1.1.

Its purpose is to allow processes to register with a cloud to be notified of updates to the channel, implementing a lightweight publish-subscribe protocol for RSS feeds.

In this example, to request notification on the channel it appears in, you would send an XML-RPC message to radio.xmlstoragesystem.com on port 80, with a path of /RPC2. The procedure to call is xmlStorageSystem.rssPleaseNotify.

However SUP is not XMLRPC (which is probably good since I'm a REST fan)

By using SUP-IDs instead of feed urls, we avoid having to expose the feed url, avoid URL canonicalization issues, and produce a more compact update feed (because SUP-IDs can be a database id or some other short token assigned by the service).

This can be avoided by just using the unique source URL. The feed is irrelevant. Just map the source to feed URL on your end.

Because it is still possible to miss updates due to server errors or other malfunctions, SUP does not completely eliminate the need for polling. However, when using SUP, feed consumers can reduce polling frequency while simultaneously reducing update latency. For example, if a site such as FriendFeed switched from polling feeds every 30 minutes to polling every 300 minutes (5 hours), and also monitored the appropriate SUP feed every 3 minutes, the total amount of feed polling would be reduced by about 90%, and new updates would typically appear 10 times as fast.

Spinn3r performs a hybrid. We index pinged sources once per week but also index right when they ping us. Best of both worlds basically.

The current ping space is across the board though.

There's XMLRPC, XML, the Six Apart update stream and now JSON:

This doesn't seem too different from Changes.xml...

Witness http://blogsearch.google.com/changes.xml vs http://friendfeed.com/api/sup.json

I'm not sure what the solution is here but it's clear we need some standardization in this area.

One suggestion for SUP is to not use a JSON-only protocol. Having an alternative REST/XML version seems to be advantageous for people who don't want to put a second parser framework in production.

Spinn3r 2.2.5

We just pushed Spinn3r 2.2.5 which fixes a number of small issues including:

- fixed bug with permalink.history which would potentially select incorrect
content during pagination.

- new feed.status API.

- fixed small issue with RSS feeds that were (incorrectly) using session IDs.

- Fixed bug with potentially incudling HTML content in content_extract results.

- Both api.spinn3r.com and our reference client now support HTTP keep alives. In certain situations this can improve performance over highly latent Internet connections.

Spinn3r 2.2.3

Spinn3r 2.2.3 just made it out the door. This is a small release that mostly tightens up our source.list and source.status support.

We've also improved the documentation for source.list including adding a full command line client in the example section.

The Spinn3r reference client will now print a custom error message when generated from the server. This has been added to ease debugging when calling source.register with a source that Spinn3r might not like.

Spinn3r 2.2.2 Escapes From the Laboratory

200806041758Spinn3r 2.2.2 escaped from the hatch and is now free to wreak havoc upon the blogosphere.

This is mostly a point release. There has been a lot of backend work done including new hardware and database changes but most of these updates aren't visible to our user base.

New changes include:

  • New force option when registering new weblogs. We had a number of our users attempt to register weblogs with peculiar URL structure. We now allow them to register the URLs anyway with this new force option.
  • New Spinn3r reference client which implements changes necessary to force a source.register.
  • Fixed a bug with Feedburner URL handling where content would be ignored for feeds that had too many URL redirections.
  • We're now indexing content at 30 minute intervals. Spinn3r was previously indexing cyclical and non-pinged feeds once per hour but we've been able to tighten this up a bit with our new hardware.
  • Reduced some of our content buffering variables to allow ping handling to be a bit more realtime. Pinged content is now indexed within 2 minutes from the time we've received a ping rather than the 5 minutes we were using before.
  • We've pushed some new weblog discovery code which is humming along nicely. We've discovered approximately 1M new weblogs since last week and about 25k new mainstream media news sources. These were only made visible to our indexer after a number of ranking and content classification tweaks.

Spinn3r Hiring Senior Systems Administrator

200805271727Spinn3r is hiring for an experienced Senior Systems Administrator with solid Linux and MySQL skills and a passion for building scalable and high performance infrastructure.

About Spinn3r:

Spinn3r is a licensed weblog crawler used by search engines, weblog analytic companies, and generally anyone who needs access to high quality weblog data.

We crawl the entire blogosphere in realtime, remove spam, rank, and classifying blogs, and provide this information to our customers.

Spinn3r is rare in the startup world in that we're actually profitable. We've proven our business model which gives us a significant advantage in future product design and expanding our current customer base and feature set.

We've also been smart and haven't raised a dime of external VC funding which gives us a lot more flexibility in terms how how we want to grow the company moving forward.

Overview:

In this role you'll be responsible for maintaining performance and availability of our cluster as well as future architecture design.

You're going to need to have a high level overview of our architecture but shouldn't be shy about diving into MySQL and/or Linux internals.

This is a great opportunity for the right candidate. You're going to be working in a very challenging environment with a lot of fun toys.

You're also going to be a core member of the team and will be given a great deal of responsibility.

We have a number of unique scalability challenges including high write throughput and massive backend database requirements.

We're also testing some cutting edge technology including SSD storage, distributed database technology and distributed crawler design.

Responsibilities:

  • Maintaining 24 x 7 x 365 operation of our cluster
  • Tuning our MySQL/InnoDB database environment
  • Maintaining our current crawler operations
  • Monitoring application availability and historical performance tracking
  • Maintaining our hardware and linux environment
  • Maintaining backups, testing failure scenarios, suggesting database changes

Requirements:

  • Experience in managing servers in large scale environments
  • Advanced understandling of Linux (preferably Debian). You need to grok the kernel, filesystem layout, memory model, swap, tuning, etc.
  • Advanced understanding of MySQL including replication and the InnoDB storage engine
  • Knowledge of scripting languages (Bash and PHP are desirable)
  • Maintaining software configuration within a large cluster of servers.
  • Network protocols including HTTP, SSH, and DNS
  • BS in Computer Science (or comparable experience)

Further Reading:

Spinn3r 2.2.1 Released

200805151505-1Spinn3r 2.2.1 is out the door.

This is evolution on over Spinn3r 2.2 which has a number of features and fixes suggested by our user base.

New API Methods:

As a result of our recent infrastructure changes, we're now able to provide a more robust feature set to our customers.

Ninety percent of our users are served by our raw crawler API but occasionally there are questions regarding support for a specific weblog, access to archive posts, etc.

These new methods should help improve this situation by making it easier to interact with Spinn3r.

At the moment this functionality is only supported with our permalink interface. We're working on back porting this functionality to our feed API as well.

source.list

Our new source.list API is designed for customers with existing crawlers that want to tie into our spam prevention and ping infrastructure.

The source.list API was designed to help 3rd party crawlers tie into Spinn3r's ping stream and realtime polling and prioritization backend.

Returns an RSS feed with lists of weblogs that have either been found or discovered by Spinn3r or published after a given timestamp.

You can see the source.list documentation for further information.

200805151742permalink.history

A number of Spinn3r customers have requested the ability to fetch historical content for specific blogs. This is now possible with our new permalink.history method.

Given a weblog URL, return recently published articles. This can be used to find the most recent results from techcrunch.com, gigaom.com, etc.

Results in recent posts sorted by reverse chronological order.

This is made possible due to the backend database improvements we've been steadily working on over the last year. We're going to port these changes to the feed API shortly. We're waiting to bring more hardware online for this which should take 2-3 weeks.

permalink.status

This provides the ability to obtain the status for a specific post (permalink) within Spinn3r.

General Crawler Improvements

This release also includes the following crawler improvements:

Faster Polling Interval

We've migrated to 45 minute (vs 60 minute) polling intervals for all cyclical feeds and sources. Everything else in Spinn3r is updated in real time when we receive a ping.

We're going to be reducing this to a 30 minute polling interval in the next week or so. We're going to pause at 45 minutes to see if any sites complain and make sure there aren't any performance issues which we have to deal with.

This should be fine as Bloglines has been using 30 minute polling intervals for a few years now and it hasn't caused any problems.

200805151745Weekly Indexing of Pinged Weblogs

We've also moved to a mechanism of re-indexing pinged weblogs on a weekly basis. While 99% of blogs in our index send pings correctly there's the possibility of dropped ping due to misconfigured blog host. This could be do either to an error on their part or a temporary network outage.

To correct this behavior we've migrated to a weekly re-index mechanism where we send out our crawlers if we haven't heard from a blog in at least a week.

feed.getDelta supports publisher_type

This was an omission from Spinn3r 2.2 that one of our customers pointed out.

The permalink.getDelta method supported a publisher_type but the feed.getDelta method did not.

Advanced Mainstream Media Feed Detection

Mainstream media support for RSS has always been mediocre at best. Our permalink API was designed to help improve this situation by indexing all recent posts on a given website.

The problem is we would still be missing additional metadata such as the original publication date, author, and title.

It's impossible to discover these feeds because they may be buried deep within the website and many of these sites don't have RSS autodiscovery setup correctly.

Kiplinger.com is a good example. This website has a number of RSS feeds but the only way to find them is to click on an 'rss' link at the bottom of the page, which is a link to another HTML page which contains a set of RSS feeds.

Some sites are even worse. AOL News has a page which lists the RSS feeds but they don't actually link to them - they link to myAOL. They have an RSS feed link when you view the page in a browser but this is actually generated via javascript which (obviously) crawlers can't see.

The solution has been to release a focused crawler for these sites to recursively index pages and attempt to find links to RSS feeds. These RSS feeds are then indexed and used to fetch additional metadata.

We've pushed the first pass of this functionality and are going to be releasing another version of our crawler that allows us to discover even more mainstream media feeds.

Documentation Updates

There have been a number of documentation updates available over on our wiki.

Specifically, the changes around the source and permalink APIs.

More to come...

We're also going to be releasing Spinn3r 2.2.2 which will have more updates in our crawler including additional support for forums and mainstream media feeds and enhancements to our core weblog discovery algorithms.

I suspect that this will be about two weeks before all the backend infrastructure work is complete.

Thanks to Flickr users josef.stuefer, buntalshoot, and Mr Usaji for the amazing photos of the above spiders.

Slides from Spinn3r Architecture Talk at 2008 MySQL Users Conference

Here's a copy of the slides from the talk I just gave about the architecture of Spinn3r at the 2008 MySQL Users Conference:

We present the backend architecture behind Spinn3r – our scalable web and blog crawler.

Most existing work in scaling MySQL has been around high read throughput environments similar to web applications. In contrast, at Spinn3r we needed to complete thousands of write transactions per second in order to index the blogosphere at full speed.

We have achieved this through our ground up development of a fault tolerant distributed database and compute infrastructure all built on top of cheap commodity hardware.

Spinn3R Architecture Talk - 2008 Mysql Users Conference

New Spinn3r Open Ping Server

As part of Spinn3r 2.2 we've released an open ping server.

What's a ping server you ask?

In blogging, ping is an XML-RPC-based push mechanism by which a weblog notifies a server that its content has been updated. An XML-RPC signal is sent to one or more "ping servers," which can then generate a list of blogs that have new material. Many blog authoring tools automatically ping one or more servers each time the blogger creates a new post or updates an old one.

The goal here is to be somewhat independent from the other ping servers out there. This way we can avoid any downtime or problems that would occur if they vanish entirely.

We already receive pings from a number of major blog hosting providers. If you're a blog host and would like to send us your ping stream please let us know. We'd prefer that you not use the open ping server as we can audit your ping stream a bit better when it's using a custom URL.

Why would you want to send us pings? Because we crawl for a number of major search startups and analytics companies (as well as PhDs and Universities) and your users will get a solid impact from their blog post when it hits Spinn3r.

Just use the URL:

http://rpc.spinn3r.com/open/RPC2

... for your RPC router and you're set.

Also, a note to spammers - don't even bother spending spam our way. We can handle the throughput just fine. Further, unless our discovery engine has approved the blog as being ham we're just going to drop the ping and send it to /dev/null.

... however, I pretty much assume you're going to send us spam anyway. So have at it.

More on the Wordpress Blog Spam Cancer

200804081439Technorati published more information on the wordpress blog spam cancer that's spreading around the Internet.

If you're running a version of Wordpress less than 2.5 you need to stop what you're doing NOW and upgrade! Don't wait until your blog is compromised.

The blogosphere has had its share of maladies before. Comment spam, trackback spam, splogs and link trading schemes are the colds and flus that we've come to know and groan about. But lately, a cancer has afflicted the ecosystem that has led us at Technorati to take some drastic measures. Thousands of WordPress installations out in the wilds of the web are vulnerable to security compromises, they are being actively exploited and we're not going to index them until they're fixed.

We know about them at Technorati because part of what we do is count links. Compromised blogs have been coming to our attention because they have unusually high outbound links to spam destinations. The blog authors are usually unaware that they've been p0wned because the links are hidden with style attributes to obscure their visibility. Some bloggers only find out when they've been dropped by Google, this WordPress user wrote

I've reached out to Ian Kallen to offer collaboration on fixing this issue.

We're going to push out a point release of Spinn3r to block blogs that exhibit this spam problem.

It's such a rare event to have hundreds of thousands of weblogs compromised in a systematic manner.

Spinn3r 2.2 Released

200804062309Spinn3r 2.2 rolled out the door today.

We've been working on a much larger release which is still pending but wanted to release new functionality out the door for some of our more recent clients.

So what's new?

We've added the ability to register weblogs directly within Spinn3r. All that's necessary is to call a new source.register method with a link to a weblog or any URL that has an RSS feed and publishes dynamic content. Spinn3r will then do the rest. We'll fetch the HTML feed, perform RSS autodiscovery, and then add it to our source list and start crawling in real time.

What's interesting is that this allows our clients to collaborate on weblog discovery. Spinn3r does a great job at discovering weblogs but there are some niche sources where we'd love to have a few more signals to help out in our spam detection.

200804062320-1This also fixes a number of bugs including:

  • Our permalink crawler API now adds the ability to filter by API tier.
  • We've added better mainstream media site detection.
  • A new post:resource_guid field is available within Spinn3r results to identify a unique post
  • New publisher types including FORUM, CLASSIFIED, and REVIEW.

It sounds crazy but we've also started a sub-project to allow Spinn3r to also license spam content. We've had a few malware and anti-virus companies approach us looking for a solid stream of real time spam posts. Unfortunately, Spinn3r wasn't setup to provide this as 99% of our customers are only interested in ham.

This adds a new spam_probability backend variable which isn't exposed just yet. We'll allow our customers to add &spam_probability=x.x in their API call to control how much spam they want to receive.

Believe it or not, some customers would like to boost up their signal a bit and add a bit and add more spam as a tradeoff to get a bit more recall.

By default, this content will only be available to the client who registered the source. This prevents clients with niche requirements to index special feeds (search feeds being a good example) without hurting any of our other customers.

Spinn3r 2.5 is also right around the corner. It's taken us a bit longer than we had hoped to bring our new hardware online. You can read about our progress here on my personal blog.

Massive Blog Spam Epidemic Gets More Attention

200804071213We've been covering a massive blog spam epidemic thanks to a nasty/evil spammer who's exploiting a XMLRPC bug in Wordpress 2.2.

This issue is FINALLY getting the attention it deserves:

I had a closer look at many of the blogs concerned that had spammy content — pages promoting credit cards, pharmaceuticals and the like, and I realized that if you go to the root domain they are all legitimate blogs. Not scraper blogs that were being auto-generated with adsense / affiliate links, which was extremely curious, and actually reminiscient of something that hit home a few months ago.

A few months ago, this blog got hacked — but in a sneaky way. Not only did the hackers insert “invisible” code into my template, so that I was getting listed in Google for all manner of sneaky (and NSFW terms), so that people could click on those links with the hacker getting the affiliate cash — but *actually*, said hackers also inserted fake tempates into my wordpress theme.

Techaddress is also covering this issue...

Oddly enough Tailrank picks up on this spam because of our clustering algorithm. We cluster common links and terms via our blog index and promote these stories to our front page.

Since we 'trust' stories with past behavior when major A-list blogs like ZDNet get owned we believe they are legitimate links.

If we had a smaller index this might be a big easier to handle but we're indexing 12M blogs within Tailrank and on Spinn3r.

Another way around this of course would be to blacklist every blog running Wordpress 2.2 or earlier but we're talking millions of blogs here and we don't want to unfairly harm anyone.

To date our approach has been to wait until Tailrank has identified the spam, and then blacklist any blogs that have been compromised.

Unfortunately this is a war of attrition with the spammer just spending a few more days and hacking another dozen or so sites.

The only positive aspect of this is that it's encouraging people to upgrade to Wordpress 2.5.

We're also working on some secondary algorithms to catch this a bit sooner and we'll probably ship these in Spinn3r 2.5 which is due shortly.