Kevin Knight's home page
Machine Translation Glossary (Kevin Knight)

Machine Translation - techniques for allowing construction workers and architects from all over the world to communicate better with each other so they can get back to work on that really tall tower.

Artificial Intelligence - plot device used in movies like 2001, Terminator, and Matrix. Nothing to worry about.

N-gram - measurement of the weightiness of words and phrases. For example, "and" weighs 1 gram, while "mass destruction" weighs 2 grams.

Smoothing - the technical process of explaining why the machine's output is not perfect, to an observer unfamiliar with machine translation.

Model - a highly simplified, idealized version of a real process or object, often found in Computational Linguistics journal and Vogue.

Perplexity - the usual result of a drop in Bleu score. See also: It should work.

Translation Table - currently the fastest known computer-science method of filling memory space.

Distortion - what happens when you hold an English sentence up to a carnival funhouse mirror.

Sentence Alignment - a manual pastime often enjoyed by first-year graduate students while eating pizza and watching TV.

Word Alignment - believed to be the next upcoming "hit" or "craze" among first-year graduate students.

Decoding - the technical process of mechanically converting Arabic "code" text into normal English. By contrast, English- to-Arabic machine translation requires the reverse technical process, known as encoding.

Bleu Score - a device for showing that more data helps.

Test Data - data hastily assembled at the end of a project just to make sure everything works fine.

Cognates - related words in different languages, such as "Ohio" (English) and "Ohayoo" (Japanese).

It should work - one of two important statistical machine translation slogans. See also: But it's just one command.

Joint probability - probability of obtaining a joint (see toke-nization)

Toke-nization - like, dude, this punctuation mark was like floating around, all by itself!

Romanization - the insidious global- cultural process of forcing French people to watch the movie "Gladiator."

Monotone Translation - a translation that carefully preserves the same level of boring-ness present in the source text.

Syntax - a type of branching tree in which the fruit is, like, really high up there.

Semantics - like when she says "that's fine" but she really means "that's not fine." Beyond the scope of current techniques.

Corpus - singular of "corpi." Often used in the old Latin phrase habeus corpus bilinguis disci ("I have the bilingual corpus on CD-ROM")

Translation Memory - situation in which a translation system has already seen the source text translated before. Contrasts with translation deja vu, in which system only thinks it has seen the source text before.

Comparable Corpora - two text collections that have nothing to do with each other.

But it's just one command - i.e., one UNIX command. Used for example to describe how to get complete word- sequence frequencies for a 10-million word text collection.

Probability - any real number between zero and one, inclusive, satisfying all the following three propositions: it is less than or equal to 1, it is greater than or equal to 0, and it is a real number.

EM Training - a computational technique for converting a pile of feathers into scrambled eggs, and vice-versa.