Luke’s Languages: You will always what?


Good morning to you! Or, if you prefer to be greeted in the French supplied by Google Translate, bonjour à vous!

That may have seemed like a pleasant greeting, but it’s actually the introduction to today’s topic, namely computer translation. It’s a topic dear to many linguists’ hearts, especially syntacticians, since computer science and modern syntax kind of shared a cradle.

But it’s so far from perfect. To give you an idea of what we’re in for, I’ll point out two things that the Francophones will already have noticed: “bonjour” means “good day”, not specifically “good morning”; and “vous” would be inappropriate if you were a close friend and not a reader I pretend to be familiar with in the hope of being charming.

Oh, and when you translate it back to English, you get “Hello there!”

Machine translation has been around for the last half-century, and it’s really been exploding over the last two decades. There are a few reasons it took so long to get in gear. One is that you need some serious financing and expertise to construct the resources needed to do it. Another is that we didn’t have a great idea of how to do it.

Actually, we still don’t. At present, there are two broad ways to translate. The first you might reasonably have expected; it’s called rule-based translation. There’s the morphological dictionary for both the source and the target language, with all kinds of information about how each word is used—its part of speech (noun, verb…), its conjugations and declensions (run, runs, ran…), the domain of its meaning, and so on. Then there’s the bilingual dictionary to give the equivalent in the other language. Then there’s the grammar that deconstructs a sentence and figures out how to put it in the right order it in the other language. Getting complex already? Okay, now we need a set of those for each pair of languages we want to translate to and from.

But in the end, it’s effective, right? Well, up to a point. One of its strengths is that to some degree it’s actually understanding what it’s doing. You can trace the logic of the program and figure out why it made the choices it did, and you can then alter the rule or update the dictionary entry. But one of the shortcomings is that it’s really, really hard to capture the grammar of a language. You need a linguist to analyze it. And worse, not all the finer points of a language are easy to express; one of the major areas of research in linguistics is accounting for abnormalities. Language is one of the most complex things our brains do, and until we understand how we ourselves understand it, it will be impossible to simulate the process.

So there’s an alternative: statistical machine translation. It was thought of as early as the rule-based stuff was, but it only really became feasible with the computational power and the digitalization of corpora we’ve had since the nineties.

Here’s how it works. The Canadian government publishes its proceedings in English and French. That’s a corpus. A programmer sets up the software to align each English sentence with the French one occurring at the corresponding place in the corresponding document. For example, the sentence “We need to get started as soon as possible” might line up with “Il faut commencer dès que possible”.

Now, imagine some grade nine student taking French against her will who wants to translate the sentence “We need to get started as soon as possible”. She types it in. The machine doesn’t understand the sentence at all, but that doesn’t matter. It can just hand her the professional human translation of her sentence. Nice.

Okay, now take the scale of the corpus Google’s got. One of its sources is the 200 billion words of published United Nations proceedings it acquired in 2006, which are in six languages: Arabic, Chinese, English, French, Russian, and Spanish. Just in case you missed it, that’s 200 billion words. And the machine is going to translate it using the biggest pieces it can find matches for, performing a probabilistic analysis of the many contexts each term appears in to find the most likely equivalent.

It’s inhuman, and it’s brilliant.

Of course, it has its shortcomings too. For one thing, such huge corpora don’t exist between all languages. This can be partially resolved the way Google does it: if there’s no corpus of, say, Belorussian to Arabic, first translate to English and then to Arabic. But with every step you lose a little more accuracy and flatten more nuances.

And the corpora you do have will be specialized (the UN mostly talks about politics), and some terms will be better represented than others. Consider the phrase “Japanese prisoner of war camp”. Is it (a) a Japanese camp for prisoners of war or (b) a camp for Japanese prisoners of war? It’s only through seeing it translated in our corpora that we can guess. Google Translate’s attempt to render it in French is “Camp japonais de prisonniers de guerre”, which is meaning (a). But try “Russian prisoner of war camp”, no doubt much less frequent, and it yields “Prisonnier russe du camp de la guerre”, which means neither (a) nor (b) but the unlikely “Russian prisoner of the war camp”.

And sometimes the sheer weight of how frequently a phrase is used overrides what the phrase actually means. Google Translate turns the English “I write for fun, and I believe I always will” into the French “J’écris pour le plaisir, et je crois que je t’aimerai toujours”. That is, “I believe I will always love you”. What?! you say. Where did “love you” come from?! Simple. In French you can’t omit the verb there after “will”, so it has to be supplied by context. All the machine knows is that the phrase “I believe I always will” is mostly used in this context. Therefore, the probability that this is the right translation is very high.

The way rule-based translation understands what it’s doing as it goes word by word through the phrase seems kind of better now, don’t it?

Of course, there are many more fascinating problems with both approaches, and many more variations of both approaches, that I’d love to talk about. Some people even propose that automatic translation is impossible in principle: Can machines ever properly choose between truly ambigious meanings? Does everything in every language have a guaranteed equivalent in every other language? (Puzzle over that for a while next time you’re bored.) The best summary is that no approach so far quite captures how humans translate.

And for those of you who were hoping I’d argue that you should indeed use Google Translate for your French composition—sorry, but the verdict is still no. That is, unless you want to accidentally tell your prof that you’ll always love him.