Call it Mission: *Impossible* for coders

2003 June 9

Like the elite group of government agents on the 1960s television show, a group of computer scientists and natural language experts were given a «mission» earlier this week: within a month, build a program that translates between English and a randomly chosen language.

<br />Translation for specific purposes


Translation for specific purposes

The project, funded by the Defense Advanced Research Projects Agency, challenges researchers to quickly build translation tools when unforeseen needs arise.

The exercise is designed to imitate the need for translation during a national security threat, like a terrorist act, war or humanitarian crisis.

The element of surprise in the project is critical. Since the beginning of June, computational linguistics research groups from around the country have been gathering resources on the pop-quiz language, Hindi.

«During the Cold War, the United States only had to keep up with a handful» of languages, said Doug Oard, associate professor in the College of Information Studies at the University of Maryland, College Park. «Now, it's very hard to predict where things are going to become of key interest.»

Research groups at the University of Maryland and the Information Sciences Institute at the University of Southern California and Johns Hopkins University, among others, will spend this month pooling data from dictionaries, religious texts, news sources and native speakers.

Satistical machine translation

The information system will churn through the data and build statistical models that turn words and phrases into their English counterparts. In this particular exercise, the goal is to feed a Hindi document into the system and get an English version back. Researchers also want to build an engine that can do automatic summarization of documents and classify texts by theme.

During the process, called statistical machine translation, the computer counts the number of times that a particular word is swapped for the word in the other language. It also tracks smaller details like the order of the words.

In March 2003, several smaller groups of researchers did a practice run for the project. DARPA gave them two weeks to build a system that could translate Cebuano, a language spoken in the Philippines, into English.

Many of the researchers didn't know where Cebuano was spoken and locating resources was difficult. Hindi presents a different problem: Vast resources exist but no standard method of coding the characters.

«Right now there is still this chaotic coding system, which makes life very hard for us,» said Franz Josef Och, a researcher at USC's Information Sciences Institute who is working on the project. «In English, everybody encodes in ASCII, basically,» but languages with other scripts do not. «Right now all the groups are addressing the encoding problems.»

Information filtering

Given all of the clutter on the Internet, some resources may not be useful, but the machine should be able to filter out low-quality information.

«The hope is that all these bad translations are only random noise,» Och said. «The systematic pattern that we observe in these correct translations will dominate the system.»

In theory, this Hindi-and-English system could be useful for the military or the media, for instance, who want to monitor the ongoing tension between Pakistan and India.

«You'd be able to read what the Indian newspapers are saying and what Hindi organizations are putting up on their websites, whether they are terrorists or high schools, for example,» said Eduard Hovy, director of the natural language group at the Information Sciences Institute.

«Every paper has a slant, and the slant that the local population is reading is important to understand if you may be going there,» Oard said.

An exercise, not a project

Still, the challenge is only an exercise for these researchers, and there are no plans to continue funding the system built this month.

«It is a nice illustration of how we can put together what we already know, but it doesn't really represent new research challenges for us,» Hovy said.

Yet it's possible that commercial vendors or some part of the government might be interested in developing these kinds of systems, he added.

Participants discussed the Cebuano exercise at a recent Human Language Technology Conference and other researchers from around the world seemed interested in the challenge, Hovy said.

«It was surprising to see the enthusiasm that other people felt,» he said. «It's quite possible something will happen again.»

Building these machine translation systems likely will inspire new research ideas for scientists.

«We're clearly in a world where the problem of getting the message to you has been, in large measure, solved,» he said. «Now the (important) part is recognizing the message when it arrives and making use of it.»

Further information: www.wired.com, June 7, 2003