Topic Auto-Discovery
The introduction of topic servers like the Internet Topic Exchange and Reversible started me thinking about TrackBack, and how to make it easier for sites like these to gain traction and be used by non-technical users. Currently, the sites are not autodiscoverable--you have to know about them in order to use them. This limits usage to a certain set of technical people and tinkerers, generally; but the idea behind these sites--that of providing a centralized topic/category repository for distributed content--is one that would benefit the entire community.
So, the question is: how can we make these services transparent to users?
We have TrackBack auto-discovery for individual posts (some would argue that it could be improved, and they'd generally be correct, but the point is, it exists). What we need for this, though, is a step beyond that: we need auto-discovery for topics or categories. For example, envision the following:
- I make a new post in my weblogging tool, and I select the category "Perl" for that post.
- My weblogging tool contacts the Internet Topic Exchange and asks it if it has any Perl-related categories. It chooses the closest match (for some definition of "closest"), and sends the TrackBack ping URL back to my weblogging tool.
- My weblogging tool then sends a TrackBack ping to the Topic Exchange category.
Some notes:
The topic server--in the above scenario, the Internet Topic Exchange--could act as either the repository of these TrackBack pings, or it could even redirect me to other TrackBack-enabled Perl sites around the web by sending back offsite ping URLs.
An alternative to Step 2, above, would be that the topic server could publish a list of all of the topics it knows about, with associated ping URLs. Weblogging tools could pick up this list and cache it periodically, to lessen the load on the topic server.


10 Comments
see also: Net::ITE
If you can figure how it should parse a closest match I'd add it ;-)
Ben, I'd stay away from making it completely transparent for three reasons:
1) If you pick one or two of them, you burn software basically in perpetuity (i.e. people are slow to upgrade) that will send a disproportionate amount of traffic to those sites. What happens later if the trailblazers go belly up or some other sites have some kiler features that will never be used because they aren't burned in and hence never get traffic.
2) You don't want to overwhelm any of those sites with too much traffic than they can handle. This could become a scaling problem with a few hundred thousand bloggers posting about the Bloggies and Weblog power distribution.
3) Many things that people write are crap. Often they are poorly (or subjectively) labelled. This could make the topic servers as relevant as search used to be back in the old days before Google.
I think a better way to do it would be to keep a list of the topic server on the Moveable Type servers. Have Moveable Type periodically refresh their cache of topic servers so that they can be fluidly added and deleted. Then, give users a choice (i.e. another checkbox) to ping a topic server after publishing the entry.
The next part is difficult, but I'm sure with some thought we could overcome it: give the user a choice of which topic server to ping. I was thinking along the lines of a popup like the secondary category popup, but popups suck in general. Perhaps this could be done beforehand in the blog config, much like we choose a default search engine for in-browser searches. (e.g. Google in Mozilla toolbar).
Anyway, that's all for now.
RE: Categories... The most difficult part of user-created categories is that seldom will two people (or a 100,000) come up with the same category for a given parcel of content. My suggestion would be that either MT or the MT-suggested category servers offer a single category schema (DMOZ categories would be a wonderful idea, as they'r well thought-out, and responsive to changes in the internet! Also, using a third-party categorist would allow create an API of sorts for other uses for the data), perhaps allowing the user to amend the category, e.g. DMOZ's COMPUTERS | INTERNET could be amended to be COMPUTERS | INTERNET | blogging
It's not as articulate as I'd hoped, but hopefully it gets the idea out there.
>What I dislike, however, is that trackbacks aren't sent within my site. -Erik
I agree. I think I saw something like this on Sam Ruby's blog?
Jay: I definitely agree.
>DMOZ categories would be a wonderful idea
The Internet Topic Exchange channels are already using a DMOZ scheme in addition to the topic directory.
I'm not clear why this is a three step process, and not a two (or even one) step process, especially if one continues the line of thinking that was toyed around with in the "Trackback, Moving Forward" post. If a TB ping becomes an RSS POST, that chunk of RSS can contain the category information (and potentially a lot of other useful data). The server on the receiving end can then do as it pleases with whatever data has been included in the chunk of RSS, and simply send back a "thanks." The server can than try to match a category based on the user's category, or analyze the post's content and attempt to make a match that way.
Having the weblog software query the server for a TB URL seems like an extra step. I do look forward to seeing this type of funcitonality in future releases.
How about just doing this: whenever people cast a post to a shared channel, display its name and link to the channel below the post, so readers can also discover the channel, and the topic service?
Ben:
It's currently possible to get a list of topics out of the Internet Topic Exchange via the XML-RPC API or via the plain topic list. So it's definitely possible to do your alternative suggestion (that weblogging tools find out about all the topics and handle the auto-selection by themselves).
One weblogging tool, Georg Bauer's Python Desktop Server, already lets you select topics from the ITE list. An approach like this would work well with MT, I think.
Jay:
1) If we consider the ITE interface (the URLs, usage of TrackBack, the XML-RPC API, etc) as the standard way to implement a topic server, we could publish a list of topic servers somewhere (movabletype.org or weblogs.com, perhaps) and tools could auto-update from there. That would avoid tying people to one site.
2) I don't think scalability is much of a problem, for the ITE at least ... even 100K bloggers posting 5 times a day won't overload the ITE once I've changed from CGI to mod_python (which will happen when the load gets a bit higher). Right now it should be able to handle about 30K new posts a day, but a slight bit of optimisation should be able to bring it up to the 1.5M mark.
Seb:
Exactly. That's how liveTopics will be doing it for Radio.
Hi Ben,
Just a thought on dealing with topic/category sharing. On the one hand it's unlikely to be feasible to map topics as literals between people (John's 'cats' are feline, Jane's 'cats' are taxonomic). On the other hand, I can't see the end user going to much trouble over selecting a category from a 'standard' like dmoz.
I do think there may be a middle route though, possible if the categories are identified using URIs.
If the category is initially created in the user's own namespace, e.g. I might have
http://dannyayers.com/categories#cats
then this term is uniquely identified. Later, either I or a third party can come along and decide that this is equivalent to
http://dmoz.org/categories#cats
(or whatever their syntax is)
The existence of a mapping, locally or on a 3rd party's server should be enough for many applications, though I think it would be feasible to make a substitution within the blog later if desired. The mapping could be easily expressed in semweb-friendly RDF.
I'm sure this has been talked about already, but it would be nice to get a bot to do the category matching/mapping based on the name and words in the content of each category (not easy, but feasible).
Thought I'd throw my $.02 into the ring here...
I created a TrackBack endpoint for any arbitrary URL a while back:
http://www.popdex.com/blog/archives/000029.html
This is only useful for TrackBack in relation to a specific URL, not a Topic. My idea was that a TrackBack ping could be sent to a TB aggregator (like Popdex) and all the Trackbacks for a particular URL could be displayed in one place.... Nothing does this right now (because the key part of TB is the excerpt + extra info). Just crawling for links doesn't get you all that extra info.
I hacked the MT code Entry.pm and added three lines to ping Popdex TB url endpoint for each auto-discovered URL:
(hope it's alright if I post this here...)
sub save {
my $entry = shift;
## If we need to auto-discover TrackBack ping URLs, do that here.
require MT::Blog;
my $blog = MT::Blog->load($entry->blog_id);
if ($blog->autodiscover_links) {
my $archive_url = $blog->archive_url;
my %to_ping = map { $_ => 1 } @{ $entry->to_ping_url_list };
my %pinged = map { $_ => 1 } @{ $entry->pinged_url_list };
my $body = $entry->text;
while ($body =~ m!]+)\1!gsi) {
my $url = $2;
next if $url =~ /^$archive_url/;
if (my $item = discover_tb($url)) {
$to_ping{ $item->{ping_url} } = 1
unless $pinged{$item->{ping_url}};
}
# --- BEGIN POPDEX MOD ---
my $popdex_ping = "http://popdex.com/api/tb/?link=" . $url;
$to_ping{ $popdex_ping } = 1
unless $pinged{$popdex_ping};
# --- END POPDEX MOD ---
}
$entry->to_ping_urls(join "\n", keys %to_ping);
}
The TB pings then show up on a Popdex citation page for a particular url (i.e.):
http://www.popdex.com/c/1371672
Yes... there is the same problem of gettings lots of TB pings where people just post dribble, but I think you'd occassionally get some good nuggets. As one of you guys said earlier, too, it makes a lot more sense for blogs to "Push" their blog post info to indexes, instead of them (popdex/daypop/blogdex/organica/technorati) constantly crawling all these sites.
i agree, that sounds like a sound concept. you've got me thinking.