ButtUgly: Main_blogentry_250105

Why Technorati tags don't

Tagging has become the latest hype word-du-jour, mostly due to services such as del.icio.us, Flickr, and now, Technorati. Clay Shirky and others have written strong statements for this folksonomy phenomenon.

I personally love tags. They are a very cool way of attaching meaning to information - essentially put the semantics in the web in the "Semantic Web" sense, even if the metadata is dissociated from the pages themselves. But as a non-English speaker I see a potentially fatal flaw here: Most Internet users don't speak English as their first language. Even if I speak decent English and use a lot of English services, I still tag things in both English and my native language.

And that means that tags will become "language polluted." Take a look at the Technorati tag for "Macintosh", for example. Many of the blog entries are in Japanese.

If you look at Orkut, many of the parts of it suddenly became "owned" by Brasilians, which essentially drove away English speakers (I haven't checked how they have handled this). USENET coped with this by having separate hierarchies for each country (so sfnet is all Finnish) and "accepted" languages on each newsgroup. But tags don't have any way to determine the language.

The situation is worse than it should be, because entries on RSS feeds and blogs almost never state what their language is. In fact, I would guess that most RSS feeds claim that the language is "en-US" regardless of their actual content. People like me write in two languages on the same blog. Atom has the possibility of setting the language-per-entry, but I sincerely doubt that anyone will bother to set the language, unless they are relatively passionate about the subject.

There are three cases of "language collision" on tags (I'm using English and Finnish as an example only here).

The tag is different in English and in Finnish. For example "fishing" and "kalastus". This should pose no problem, as the folksonomies grow on each of the tags independently.
The tag is the same in English and in language Finnish, but the meaning of the tag is different. In this case, the dominant mass of the users will "hijack" the tag.
The tag is the same in both languages, but the web pages will be in different languages. This is the case with things like trade marks (Apple, Macintosh, Nokia), or when people like to tag Finnish pages with English tags (like me: I use the word "blog" to mark any significant articles about blogs, regardless of the language). This reduces the usefulness of tags for people who do not understand Finnish.

There is also an additional tagging problem with languages such as Finnish: the same word can be conjugated and written in multiple ways, depending on the context. It is somewhat the same as the problem of using different words for the same concept, but it does make the number of potential strings increase three-fourfold.

There are few solutions to this problem: and probably all of them involve some sort of heuristic to determine the language of the tag and the web page. Tagging is still a relatively new technique to be adopted in mass classification of things, but in order for it to become truly successful, one must still remember localization. Otherwise, it will be the dominance of the masses that drive the use - and it ain't gonna be English.

Comments

I think the community thought is that other languages will adopt their own tagging folksonomy and that's ok.

The real proble with tagging is that it's cool, but will people tag long-term? Often cool, means cool today, annoying tomorrow. I can't see the pace of manual tagging continuing forever forward. It's gonna drop off and get left behind.

--Randy Charles Morin, 25-Jan-2005

But it will not work in case #3, because Apple is Apple in every language.

Perhaps tags should also be ordered by time as weblogs are, instead of considering them to be like wikis - static structures which you can revisit.

--JanneJalkanen, 25-Jan-2005

More info... Comments? Back to weblog

"Main_blogentry_250105_1" last changed on 25-Jan-2005 17:22:48 EET by JanneJalkanen.