Making search better: smarter algorithms, or richer metadata?

Ephraim Schwartz’s article on search fatigue starts with a poke at Microsoft (I did the same a couple of months ago), but goes on to look at the more interesting question of how search results can be improved. Schwartz quotes a librarian called Jeffrey Beall who gives a typical librarian’s answer:

The root cause of search fatigue is a lack of rich metadata and a system that can exploit the metadata.

It’s true up to a point, but I’ll back algorithms over metadata any day. A problem with metadata is that it is never complete and never up-to-date. Another problem is that it has a subjective element: someone somewhere (perhaps the author, perhaps someone else) decided what metadata to apply to a particular piece of content. In consequence, if you rely on the metadata you end up missing important results.

In the early days of the internet, web directories were more important than they are today. Yahoo started out as a directory: sites were listed hierarchically and you drilled down to find what you wanted. Yahoo still has a directory; so does Google; another notable example is dmoz. Directories apply metadata to the web; in fact, they are metadata (data about data).

I used to use directories, until I discovered AltaVista, which as wikipedia says was “the first searchable, full-text database of a large part of the World Wide Web.” AltaVista gave me many more results; many of them were irrelevant, but I could narrow the search by adding or excluding words. I found it quicker and more useful than trawling through directories. I would rather make my own decisions about what is relevant.

The world agreed with me, though it was Google and not AltaVista which reaped the reward. Google searches everything, more or less, but ranks the results using algorithms based on who knows what – incoming links, the past search habits of the user, and a zillion other factors. This has changed the world.

Even so, we can’t shake off the idea that better metadata could further improve search, and therefore improve our whole web experience. Wouldn’t it be nice if we could distinguish synonymns like pipe (plumbing), pipe (smoking) and pipe (programming)? What about microformats, which identify rich data types like contact details? What about tagging – even this post is tagged? Or all the semantic web stuff which has suddenly excited Robert Scoble:

Basically Web pages will no longer be just pages, or posts. They’ll all be split up into little objects, stored in a database (a massive, scalable one at that) and then your words can be displayed in different ways. Imagine a really awesome search engine that could bring back much much more granular stuff than Google can today.

Maybe, but I’m a sceptic. I don’t believe we can ever be sufficiently organized, as a global community, to follow the rules that would make it work. Sure, there is and will be partial success. Metadata has its place, it will always be there. But in the end I don’t think the clock will turn back; I think plain old full-text search combined with smart ranking algorithms will always be more important, to the frustration of librarians everywhere.

Technorati tags: search, semantic web, google, scoble, altavista, tagging

Tim Anderson's IT Writing

Making search better: smarter algorithms, or richer metadata?

Tech Writing