As will undoubtedly make the rounds everywhere in the blogosphere today, Google has just launched Google Blog Search. Google's perhaps the single company most identitied with search, so their entrance into the blog search space is a big milestone even though the idea of blog search has been around for years.
For the basics of what the Blog Search team has done, you can take a look at the Frequently-Asked Questions list which does a good job of covering the basics. But at a time when everyone will be talking about (and hopefully thinking about) blog search again, it makes sense to review where we've been so far, and what problems need to be solved in the realm of blog search.
First, the new Blog Search works. All the basic functions you'd expect from Google search results are present, including ranking results by date or by relevance. (Interestingly, the default is by relevance, like other Google searches, instead of by date, which is the default for most blog displays.) But more importantly, the advanced search offers powerful functionality such as searching by date ranges and limiting to individual blog authors, in addition to features like searching for words in a blog post title or by language, which have been deployed in the past on other services.
The new features in Google Blog Search are useful because of the (perhaps subtle) distinction in how it works, compared to the traditional searches powered by Google's googlebot indexer. Google Blog Search works by crawling XML feeds, rather than simply crawling the HTML output of a blog. Because feeds are, at least ideally, better structured than the published HTML of most blogs, it's possible to extract information like authorship of a post in a fairly consistent way.
The potential for getting the value of structured data out of feeds is one of the reasons we've created technologies like the Six Apart Update Stream. To get an idea what the Update Stream does, it makes sense to know the informal name we've been using for it inside our office: AtomStream.
AtomStream is an endless flow of Atom posts, presenting the updates to LiveJournal and TypePad free for consumption by any tool or application which wants to consume them. There's even client libraries availablewhich our Tatsuhiko Miyagawa developed to support easy consumption of the stream. As Ben Trott outlined in his earlier post, and Brad Fitzpatrick indicated when launching the service, we think it's important to make all of the public posts from our services available in a consistent way so that valuable services like search can be built on top of them.
Right now, people are wasting a lot of resources pinging and crawling for information that's intended to be public, instead of focusing on the quality and reliability and utility of the data they are trying to present. And many solutions aren't neutral in how they send the data, placing undue burdens on individual aggregation points that can become costly to support. In short: We want blog data to be useful and ubiquitous.
All the technical details aisde, content indexing should just work, and that's our goal for people who use our platforms. Publish with TypePad or LiveJournal or Movable Type, and all the blog search systems, including Google Blog Search, should just pick up your posts automatically if they're public, thanks to AtomStream. We're not 100% there yet, but we think we're getting there.
And there's still a lot more potential for helping people discover the blogs, feeds, and content they're interested in. David Galbraith raised some interesting points when Ben first talked about AtomStream:
Why should news or weblog search be architecturally different from ordinary web search?
Reliable ping servers and decent specs would mean they wouldn't have to be, and we would be able to search the whole web for the most recent information.
There's many other excellent analyses of the opportunities around blog search, of course. Mary Hodder did a good job of separating two very distinct uses in her look at specific aspects such as keyword search and her earlier post on link tracking. These are two of the most popular applications, but a range of related goals are also ripe for development, such as basic blog discovery by topic or community; synthetic or search-powered smart feed generation (which Google's new tool also does); analytical, trend-based or summary-oriented search for marketers/researchers; or even more structured filtering and extraction, such as those powered by microformats or structured blogging.
We're glad to see the conversation around blog search get kick-started again, and we're hoping to see even more powerful new platforms launch. We'll be adding more rich data and functionality to our own services, including refining AtomStream to be more valuable over time. Meanwhile, Google's Blog Search will ideally mark a milestone for helping advance the conversation around blog search beyond the basics of reliability and data quality and on to more complex considerations like the ones listed above as well as the ones nobody's even though of yet.