.

This is Spinn3r's offficial weblog where we discuss new product direction, feature releases, and all our cool news.

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data and you can focus on building your application / mashup.

Spinn3r handles all the difficult tasks of running a spider/crawler including spam prevention, language categorization, ping indexing, and trust ranking.

If you'd like to read more about Spinn3r you could read our Founder's blog or check out Tailrank - our memetracker.

Spinn3r is proudly hosted by ServerBeach.

Archives

June 2009
May 2009
April 2009
February 2009
January 2009
December 2008
October 2008
September 2008
August 2008
July 2008

Ignoring Blogroll and Sidebar Content in Search

200812191035Google Blog Search shipped with an update a few months back to index the full HTML of each new blog post.

The only problem is that they indexed the full HTML and not the article content:

I wanted to give everyone a brief end-of-the-year update on the blogroll problem. When we switched blogsearch to indexing the full text of posts, we started seeing a lot more results where the only matches for a query where from the blogroll or other parts of the page that frame the actual post. (There's been a lot of discussion of the problem. You can search for [google blogsearch] using Google Blogsearch.)

We're in the midst of deploying a solution for this problem. The basic approach is to analyze each blog to look for text and markup that is common to all of the posts. Usually, these comment elements include the blogroll, any navigational elements, and other parts of
the page that aren't part of the post. This approach works well for a lot of blogs, but we're continuing to improve the algorithm. The
search results should ignore matches that only come from these common elements. The indexing change to implement it is deployed almost everywhere now.

Spinn3r customers have had a solution for this problem for nearly a year now.

The quality of mainstream media RSS feeds is notoriously lacking. For example, CNN has RSS feeds but they only have a one line description instead of the full content of the post.

This has always been a problem with RSS search engines such as Feedster or Google Blog Search - what's the point of using a search engine that's not indexing 80% of potential content?

We're also seeing the same thing with a number of the A-list blogs. RSS feeds turn into a liability when bandwidth increases significantly every month with each new user. The more traffic a blog gets the greater the probability that they'll enable partial RSS feeds in order to reduce their bandwidth costs and increase click through rates.

Spinn3r 2.1 adds a new feature which can extract the 'content' of a post and eliminate sidebar chrome and other navigational items.

It does this by using an internal content probability model and scanning the HTML to determine what is potentially content and what's potentially a navigation item.

See the yellow text in the image on the right? That was identified algorithmically and isolated form the body of the post.

To be fair it's a difficult problem but I've had a few years to think about it.

Comments

hi
It does this by using an internal content probability model and scanning the HTML to determine what is potentially content and what's potentially a navigation item
infodatatreasury.com is the biggest and best dating site.Find millions of online classifieds and post free ads

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

If you have a TypeKey or TypePad account, please Sign In