Spinn3r 3.0: New Features, New Architecture, New APIs - More Goodness
I'm proud to announce that we have just released Spinn3r 3.0 after more than a year of development.
This has been quite a lot of work based on feedback from our customer base and ships with some really awesome functionality.
Most of this time has been spent on architecture but a good deal has been spent implementing features for our rapidly growing user base.
When you outsource a major component of your infrastructure, like crawling, you tend to lean on it heavily and push it to the very edge.
Spinn3r has benefited significantly from our user base as they have suggested a number of excellent features. This has dramatically increased our reliability, performance, and feature set.
A good deal of work here has been spent on scalability, performance, and optimizations, including serious improvements to our core backend infrastructure.
There's quite a lot that's new in this release so I'll just dive in.
We're now powering startups who have raised in excess of $100M in VC funding.
What is interesting is that a large portion of the industry is standardizing around our infrastructure. Why wouldn't they? We've been in production for over three years now and have been in production applications the entire time.
We haven't had a chance to announce this until now so we're pretty excited that this is finally public.
We have researchers at Harvard, Carnegie Melon, Stanford, Caltech, University of Maryland Baltimore County, University of Washington
University of Southern California, Nanyang Technological University, University Of Edinburgh, National Institute of Informatics in Japan, California Institute of Technology, University of Hannover, in Germany, and on and on.
Textmap is also another search engine using Spinn3r. Their paper, Large-Scale Sentiment Analysis for News and Blogs from the 2007 International Conference for Weblogs and Social Networking (ICWSM) does a good job explaining their system.
We also have a number of our customers performing entity extraction and sentiment analysis and I think that this space is going to be really maturing in the next few years.
We're sponsoring the International Conference for Weblogs and Social Media in San Jose this year.
We provided them with four months of data - nearly 400GB of blog data.
It turned out to be a huge success with more than 100 research groups requesting access. We've also provided them with direct access to Spinn3r and will continue to do so for the foreseeable future.
We will almost certainly sponsor ICWSM 2010 with a similar corpus. Possibly expanding it with more data including our comment API, permalink content and content extract and would increase the size to around 4TB.
New Admin Console
We now have a new web application to help developers interface with Spinn3r.
The general idea is that while Spinn3r provides a very powerful API, it was sometimes difficult for our new customers to get up and running. Further, once they were up and running, they would report a problem without a way to pinpoint what they were seeing.
Now with our console they can just login, drill down on the specific datapoint they are interested in, and send in a URL documenting their question.
We have statistics on anything you can imagine. We have hosting provider breakdown, language breakdown, posts per hour, comments per hour, links per hour. We even have most of these broken down by blog host.
Here's a screenshot of our language breakdown:
As you can see we're heavily biased by English content as most of our customers are in the US.
We've also instrumented statistics about our customers including their individual API lag (or lack thereof), number of registered sources and their throughput, etc.
Further, we've now implemented web versions of most of our popular APIs to help easily debug Spinn3r.
For example, our customers can give us the URL to an A-list weblog and we can show them the most recent posts within Spinn3r:
There are interfaces for most features of Spinn3r. Of course one can always use the API directly and we have a great command line interface as well.
User commentary across blogs can be valuable for search engines and users but right now there are no real standards for indexing comments made within the blogosphere.
The wfw:comments and Atom threading standards exist (and we support them) but these are only supported within a minority of blogging systems.
We've written hand tuned parsers for fetching the remaining comments and we support the majority of content management systems.
The only restriction is that we don't re-index content right now. This is going to change shortly after we ship 3.0 and probably make it into 3.1.
If you'd like beta access please send us an email and we'll provide you with additional documentation.
Hybrid Real Time Indexing
Spinn3r is directly integrated into the vast ping architecture. If we receive a ping from a weblog we immediately launch our crawlers to fetch the update.
The difficulty here is that not everyone sends pings. We have had a hybrid crawler for a about 6-12 months now which allows us to support both pings and sources on different indexing intervals. Currently about 70% of our content is fetched from pinged sources and the other 30% is fetched once every thirty minutes.
We've expanded our archive capacity and now host more than 7 months of content comprising some 21TB (twenty one terabytes) of content.
We have online capacity for up to 66TB of content and can expand to about 300TB by purchasing additional hardware. From this point moving on we plan on keeping all archives for all time.
This is made possible due to the database migration we performed as part of Spinn3r 3.0 as well as our new datacenter migration.
Full Source History
We've extended our API so that it's no longer just a raw crawler API. Now it's a full blown database of the entire blogosphere.
You can give us a weblog, feed, or permalink and we can show you the entire history and you can page through the API going back in time.
We completely rewrote our API result handling in 3.0 and we can now support a much higher throughput than before.
Assuming you're not bottlenecked by bandwidth, you should be able to sustain 10-30x over real time indexing.
This means it only takes on average 1 hour to download 30 hours worth of content (with higher throughput possible assuming you have the necessary bandwidth).
This might sound like overkill but when our customers need access to archive data, or they've been offline for a long period of time, they want to catch up quickly.
We're also working on some API extensions to handle parallel downloading which should, in theory, mean we can index Spinn3r content as fast as we want just by throwing more hardware (and bandwidth) at the problem.
Right now it's not really a problem though as the blogosphere could increase by size in 30x and we would easily be able to handle the load.
We're also going to be sharing our internal statistics with the public in the goal of sharing as much as possible about Spinn3r. Our monitoring architecture allows us to have thousands and thousands of metrics so we can literally monitor everything possible about the blogosphere.
Our biggest competitor to date has been build vs buy. We have a few competitors out there but they can't compete with us on feature set or pricing.
We've made a very compelling case in Spinn3r 3.0 for a significant cost savings that can be made by switching from a proprietary (or Open Source) crawler and using Spinn3r.
If you're using Spinn3r you can save upwards of $45k per month in hosting costs. With the current economic situation it's starting to become very compelling to switch.
A number of people (including our customers) have asked me about Tailrank and what we're doing with the product.
We're shutting down the consumer facing website because the architecture was difficult (and expensive) to maintain alongside Spinn3r 3.0. We're going to be selling off the assets and integrating the technology into a future version of Spinn3r.
If you're interested in purchasing the assets to Tailrank feel free to send me a private email or just contact us directly.
At this point Tailrank is not a market in which we want to focus. The space is too fragmented, has too much competition, and very little room for innovation.
The clustering and ranking technology present in Tailrank 3.0 (which never shipped) will be refactored and integrated in the Spinn3r 3.1 or 3.2 timeframe (2-6 months).
At some point we might integrate memetracking directly into the Spinn3r corpus but at the moment our customers have a number of pending requests that we plan on servicing first.
As usual, we're always focused on customers and constantly improving the product. We're also working on a few things in parallel which use much larger data sets.
We're going to be pushing a few new architecture changes in the next sixty days which will allow us to perform additional computations and extract more metadata. The mainline thinking in information retrieval today is that while algorithms are smart, data is smarter.
We've taken that approach throughout our design. Of course one of the main problems is the sheer size of the infrastructure required to push this much data. With Spinn3r 3.0 where nearly there and will have some interesting product features in the next quarter.