Beyond "Please fondle my buttocks"

As more Arabic speakers take up blogging, better translation programs could promote cross-cultural understanding -- and avoid Monty Python-like misunderstandings.

Published March 31, 2005 9:31PM (EST)

Tens of thousands of people around the Middle East read Ali Abdulemam's online discussion forum every day. Yet few Westerners knew that authorities in Bahrain detained Abdulemam over his Web site for two weeks in early March; although his detention sparked local protests, the story didn't grab much attention beyond the Arab world. Westerners who don't understand Arabic can't read the controversial posts (which criticized the government's treatment of the Shiite majority), nor can they read Abdulemam's personal blog, a firsthand look at life in the Middle East.

Current technology lets bloggers around the world make their voices heard, but only to a degree. If a blogger speaks Arabic and a reader speaks English, or vice versa, it doesn't matter how fast or how global the Internet is. A blogger and reader who can't understand each other might as well be living in the days of the Pony Express.

To push the frontier of cross-cultural communication and bridge the gap in understanding between citizens of the United States and the Middle East, some pioneers have called for an enhancement of blogging technology, suggesting that machine translation software might be able to crack the language barrier.

Tim Oren, a Silicon Valley venture capitalist, is not the only blogger who supports machine translation, but he may be the most prolific, having written about the idea on both his own blog, Due Diligence, and the group blog Winds of Change. He may also be the boldest. He's currently trying to persuade machine translation companies to integrate their programs into popular blogging tools. He even has a plan to make translated texts searchable by what his pseudonymous online collaborator, Lewy Katorz, dubs "Rosettabots."

As the name "Rosettabot" suggests, the idea of interlingual understanding predates blogs. Even the idea of promoting such understanding on the Internet predates blogs. When Oren worked for CompuServe over a decade ago, he and his colleagues applied an automatic translation program to electronic bulletin boards, the online meeting places of the time. In a forum devoted to multicultural exchange, messages appeared nearly simultaneously in English, Spanish, French and German.

The software available at that time produced such clunky output that many interlingual conversations involved jokes about the resulting translations. The forum blossomed nonetheless, attracting thousands of users and reaping large profits. "It was quite successful, both monetarily and socially," Oren says. "So I've at least got one proof that something like this can work."

Oren's current project eclipses wishes to make money or encourage friendship among strangers. At a Harvard conference on the Internet and society last December, Oren met some champions of the blooming Arabic blog scene. "The first wave of [Arabic bloggers] are incredibly articulate in English," Oren says. "But part of what everybody's trying to do is kick off an independent Arabic blogosphere. And I think it would be a great shame if that resulted in losing connections to some of those folks."

Many of the earliest Middle Eastern bloggers, such as Iraqis Salam Pax and Zeyad, write in fluent English, but a growing number plug Arabic characters into the English interfaces of popular blog-publishing tools. And at the Harvard conference, the nonprofit Spirit of America unveiled an Arabic-language publishing tool that has made blogging accessible to a wider swath of the population. While the disparate nature of the Web makes it hard to determine exactly how many Arabic blogs exist, the total is clearly mounting; for instance, more than 250 users have created blogs with the Spirit of America tool since its launch in December 2004.

Like the blogs in Farsi that have swept Iran, Arabic blogs have sneaked what substitutes for a free press into some countries that lack one. What's more, the Comment button on most blogs encourages discussion among people who would otherwise never meet, whether they live in the Middle East or the United States. Many people in the international blogging community see their vocation as a route to cross-cultural understanding, and some view machine translation as a vehicle for speeding down that route.

Janice Abraham, a technology consultant for the Spirit of America software project, has long supported machine translation as a way of helping children from different countries communicate. "It's hard to make up stories about people, or propaganda about people, if other people are perfectly capable of reading a large enough body of opinion," Abraham says.

Omar, an Iraqi blogger who worked with Abraham via instant messaging, sees the benefits of translation from a writer's point of view. He draws a worldwide audience to Iraq the Model, an English-language, generally U.S.-friendly blog he co-writes with his brother Mohammed, and translation could bring similar audiences to other blogs. "Bloggers and all writers in general always seek more readership for their posts," he writes in an e-mail, "and Arabic bloggers are not an exception."

On the other hand, Omar adds, many Arabic blogs include cultural references that don't resonate with non-Arabs. Moreover, machine-translated prose may not capture the real punch of Arabic, says fellow Iraqi "Riverbend," whose blog Baghdad Burning portrays the severity of war in expressive English. "Arabic is sort of a dramatic, flowery language," she e-mails, "and a literal translation -- as most automatic translations tend to be -- makes things seem very strange and overdramatized."

Machine translation technology has come a long way since Oren's CompuServe experiment, but has it come as far as our ability to contact a stranger in a strange land? As anyone who clicks on Google's translations can attest, some machine translations confuse more than enlighten. Yet translation technology is presently deployed in arguably more menacing zones than the blogosphere, given that the military funds a good deal of translation research.

Many existing systems, such as the one behind the Google translations, blindly follow grammatical rules originally fed in by human beings. Like travelers' phrasebooks, rule-based programs perform well in restricted domains, according to Marie Meteer, director of the Speech and Language Processing group at BBN Technologies in Cambridge, Mass. As the number of possible conversation topics grows, though, it simply becomes too difficult to codify rules for transforming one language into another, especially when those languages have very different structures.

Newer machine translation programs, meanwhile, learn languages by themselves, using statistical techniques that were recently extended from linguistics research to commercial software. They build up their own dictionaries of words and phrases by comparing many pairs of documents that translate each other, known as "parallel texts." (Few such matching documents exist, which is why the first statistical program, built at IBM Research in the early 1990s, worked off the bilingual proceedings of the Canadian Parliament.) In keeping with their name, statistical programs also track data on how often different words and phrases appear together, so that when it comes time to translate a never-before-seen sentence, they can calculate the most probable combination of words.

Because statistical programs reason from real-life examples, their translations often sound more natural than those derived from grammar maxims. And because statistical programs teach themselves, they don't require an army of grammarians to stuff them with rules. Which means that they can tackle more topics and more languages than older systems. For example, as part of a project funded by the Pentagon's Defense Advanced Research Projects Agency, BBN uses statistical software from the start-up Language Weaver to translate Al-Jazeera in real time.

But Meteer stresses that the output from machine translation software doesn't necessarily make for pleasant, or even accurate, reading. Users can scan the output to get the gist of a document, or they can search for keywords to locate a relevant piece of text, but the translations don't approach human quality. In fact, the primary customers of Language Weaver's software are not end readers, but translation companies that use the software to get a quick rough draft, which a human translator then smooths out.

"I think it's very dangerous to think that you're going to be chatting to people in different languages," Meteer says. "The real issue is, when something makes a mistake, you won't know what mistake it made." In other words, you have no way of knowing whether your English "Feel better soon" will be translated into the Arabic for "Feel better soon," some Arabic gibberish, or the Arabic for "I hate foreigners." Or at least you have no way of knowing before it's too late.

To improve translation technology, Oren suggests, perhaps software companies could harness the power of bloggers. Some bloggers, like the Iraqi Hammorabi and international volunteers at LiveJournal, already translate posts by hand. If translated texts had a standardized format, Oren surmises, then they could be harvested from the Web and used as parallel texts for training statistical translation software. Even better, Oren adds, if companies let loose their translation software on the blogosphere, bloggers could spruce up the machine-produced output, thereby generating a massive quantity of training material.

With this far-off goal in mind, Oren and Katorz, who know each other through Winds of Change, have proposed a new metadata format. Their HTML-like tags would mark pieces of text that translate each other, so that humans could add in easily identifiable translations, and special search engines -- or Rosettabots -- could seek these translations out.

Even if technology could eventually offer perfect translations, however, their very existence would lead us into an unpredictable future, according to Jay Rosen, a journalism professor at New York University. Since language has always delimited our communities, instant translation would create a worldwide public for the first time. "While that's exciting in some ways, it's incredibly taxing," says Rosen, himself a blogger. "It's hard enough to belong to one public and stay informed."

On top of that, a worldwide public might turn out to be less utopian than it sounds. "Experience teaches me that when communication improves, it doesn't mean that people get along," says Rosen. New communication technologies may be like Bruce Willis' cellphone in the movie "Die Hard," he explains. When the bad guys get hold of the phone, the new point of contact lets them threaten Willis in a way they couldn't before.

Yet Oren and many others have more optimistic takes on the potential of machine translation. Oren looks back to the multilingual CompuServe forum as an example. "What I think I learned from that long-ago experiment is that there's a certain amount of this that isn't at all about technology," he says. The technology simply bootstraps conversation.

So the software might not spit out an ideal translation of the blog that landed Ali Abdulemam in jail. But the translation might be good enough for citizens of other countries to recognize that they share interests with Bahrainis, and Arabic and non-Arabic speakers alike could then cobble together a meaningful interaction. "That's the real goal, to get people to talk to each other," Oren continues. "And if the technology facilitates that either directly or just by attracting the right audience, then that's cool."

By Lauren Aaronson

Lauren Aaronson is a freelance writer and graduate student in science journalism in New York.

MORE FROM Lauren Aaronson

Related Topics ------------------------------------------