Making search better: smarter algorithms, or richer metadata?

Ephraim Schwartz’s article on search fatigue starts with a poke at Microsoft (I did the same a couple of months ago), but goes on to look at the more interesting question of how search results can be improved. Schwartz quotes a librarian called Jeffrey Beall who gives a typical librarian’s answer:

The root cause of search fatigue is a lack of rich metadata and a system that can exploit the metadata.

It’s true up to a point, but I’ll back algorithms over metadata any day. A problem with metadata is that it is never complete and never up-to-date. Another problem is that it has a subjective element: someone somewhere (perhaps the author, perhaps someone else) decided what metadata to apply to a particular piece of content. In consequence, if you rely on the metadata you end up missing important results.

In the early days of the internet, web directories were more important than they are today. Yahoo started out as a directory: sites were listed hierarchically and you drilled down to find what you wanted. Yahoo still has a directory; so does Google; another notable example is dmoz. Directories apply metadata to the web; in fact, they are metadata (data about data).

I used to use directories, until I discovered AltaVista, which as wikipedia says was “the first searchable, full-text database of a large part of the World Wide Web.” AltaVista gave me many more results; many of them were irrelevant, but I could narrow the search by adding or excluding words. I found it quicker and more useful than trawling through directories. I would rather make my own decisions about what is relevant.

The world agreed with me, though it was Google and not AltaVista which reaped the reward. Google searches everything, more or less, but ranks the results using algorithms based on who knows what – incoming links, the past search habits of the user, and a zillion other factors. This has changed the world.

Even so, we can’t shake off the idea that better metadata could further improve search, and therefore improve our whole web experience. Wouldn’t it be nice if we could distinguish synonymns like pipe (plumbing), pipe (smoking) and pipe (programming)? What about microformats, which identify rich data types like contact details? What about tagging – even this post is tagged? Or all the semantic web stuff which has suddenly excited Robert Scoble:

Basically Web pages will no longer be just pages, or posts. They’ll all be split up into little objects, stored in a database (a massive, scalable one at that) and then your words can be displayed in different ways. Imagine a really awesome search engine that could bring back much much more granular stuff than Google can today.

Maybe, but I’m a sceptic. I don’t believe we can ever be sufficiently organized, as a global community, to follow the rules that would make it work. Sure, there is and will be partial success. Metadata has its place, it will always be there. But in the end I don’t think the clock will turn back; I think plain old full-text search combined with smart ranking algorithms will always be more important, to the frustration of librarians everywhere.

 

Infinitely scalable web services

Amazon’s Jeff Barr links to several posts about buiding scalable web services on S3 (web storage) and EC2 (on-demand server instances).

I have not had time to look into the detail of these new initiatives, but the concept is compelling. This is where Amazon’s programmatic approach pays off in a big way. Let me summarise:

1. You have some web application or service. Anything you like. Football results; online store; share dealing; news service; video streaming; you name it.

2. Demand of course fluctuates. When your server gets busy, the application automatically fires up new server instances and performance does not suffer. When demand tails off, the application automatically shuts down server instances, saving you money and making those resources available to other EC2 users.

3. Storage is not an issue; S3 has unlimited expandibility.

This approach makes huge sense. Smart programming replaces brute force hardware investment. I like it a lot.

 

Technorati tags: , ,

120 days with Vista

Is there any more to say about Vista? Probably not; yet after reading 30 days with Vista I can’t resist a few comments.

The author, Brian Boyko, says:

On two separate computers I had major stability problems which resulted in loss of data. This is an unforgivable sin …. Additionally, Vista claims backwards compatibility, but I’ve had major and minor problems alike with many of my games, more than a few third-party applications, my peripherals, and, in short, I encountered problems that actively prevented me from getting my work done. Based on my personal experiences with Vista over a 30 day period, I found it to be a dangerously unstable operating system, which has caused me to lose data.

As for me, I installed Vista RTM on four computers shortly after it was released to manufacturing in November last year. Two plain desktops, one media center, one laptop. Just for the record, my experience is dull by comparison with Boyko’s. No lost data; all my important apps run fine; I am not plagued by UAC prompts; the OS is stable.

Have there been hassles? Yes. Tortoise SVN crashes Explorer from time to time; a perfectly good Umax scanner has no driver; Vista on the laptop had severe resume problems which only recently seem to have been fixed by a BIOS update. And Creative’s X-Fi drivers for Vista are terrible. There are also annoyances, like Vista’s habit of thinking your documents are music.

At the same time, I’ve seen nothing to change my opinion that the majority of Vista’s problems are driver-related. Overall I like it better than XP; it doesn’t get in the way of my work and I would hate to go back.

When I do use XP, some of the things I miss are the search box in the Start menu (the Vista Start menu is miles better in other ways as well); the thumbnail previews in the task bar and in alt-tab switching; and copy and paste which doesn’t give up at the first hurdle. I also miss Vista’s more Unix-like Home directories, sensibly organized under Users rather than buried in Documents and Settings.

Security-wise, I consider both User Account Control and IE’s protected mode to be important improvements.

Forget the “Wow”. This is just the latest version of Windows; and it’s not as good as it should be, five years on from XP.

Nevertheless, it is a real improvement, and I’ve been happy with it over the last four months.

 

Technorati tags: ,