Ignoring Blogroll and Sidebar Content in Search
Google Blog Search shipped with an update a few months back to index the full HTML of each new blog post.
The only problem is that they indexed the full HTML and not the article content:
I wanted to give everyone a brief end-of-the-year update on the blogroll problem. When we switched blogsearch to indexing the full text of posts, we started seeing a lot more results where the only matches for a query where from the blogroll or other parts of the page that frame the actual post. (There's been a lot of discussion of the problem. You can search for [google blogsearch] using Google Blogsearch.)We're in the midst of deploying a solution for this problem. The basic approach is to analyze each blog to look for text and markup that is common to all of the posts. Usually, these comment elements include the blogroll, any navigational elements, and other parts of
the page that aren't part of the post. This approach works well for a lot of blogs, but we're continuing to improve the algorithm. The
search results should ignore matches that only come from these common elements. The indexing change to implement it is deployed almost everywhere now.
Spinn3r customers have had a solution for this problem for nearly a year now.
The quality of mainstream media RSS feeds is notoriously lacking. For example, CNN has RSS feeds but they only have a one line description instead of the full content of the post.This has always been a problem with RSS search engines such as Feedster or Google Blog Search - what's the point of using a search engine that's not indexing 80% of potential content?
We're also seeing the same thing with a number of the A-list blogs. RSS feeds turn into a liability when bandwidth increases significantly every month with each new user. The more traffic a blog gets the greater the probability that they'll enable partial RSS feeds in order to reduce their bandwidth costs and increase click through rates.
Spinn3r 2.1 adds a new feature which can extract the 'content' of a post and eliminate sidebar chrome and other navigational items.
It does this by using an internal content probability model and scanning the HTML to determine what is potentially content and what's potentially a navigation item.
See the yellow text in the image on the right? That was identified algorithmically and isolated form the body of the post.
To be fair it's a difficult problem but I've had a few years to think about it.
hi
It does this by using an internal content probability model and scanning the HTML to determine what is potentially content and what's potentially a navigation item
infodatatreasury.com is the biggest and best dating site.Find millions of online classifieds and post free ads
Posted by: sebastian | May 13, 2009 at 04:20 AM