Leaping over the language barrier with machine translation in Levantine Arabic

When a language you don’t understand appears in your Facebook news feed, you can click a button and translate it. This kind of language technology offers a way of communicating not just with the millions of people who speak your language, but with millions of others who speak something else.

Or at least it almost does.

Like so many other online machine translation systems, it comes with a caveat: it is only available in major languages.

TWB is working to eliminate that rather significant caveat through our language technology initiative, Gamayun. We named it after a mythical birdwoman figure in Slavic folklore — she is a magical creature that imparts words of wisdom on the few who can understand her. We think she’s a perfect advocate for language technology to increase digital equality and improve two-way communication in marginalized languages.

We have reached an important Gamayun milestone by leaping over the language barrier with a machine translation engine in Levantine Arabic. Here is how we got here, what we learned, and what is next.

What is behind developing a machine translation engine in Levantine Arabic?

In November 2019, we joined forces with a group of innovators and language engineers from PNGK and Prompsit to address WFP’s Humanitarian Action Challenge. Our goal was to use machine translation to enhance the way aid organizations understand the needs and concerns of Syrian refugees, to improve food security programming.

So we developed a text-to-text machine translation (MT) engine for Levantine Arabic tailored to the specifics of refugees’ experiences. To achieve this, we collaborated with Mercy Corps’ Khabrona.Info team. The team runs a Facebook page for Syrian Arabic refugees to provide them with reliable information and answers, such as about accessing food and other support. We took content shared on the Khabrona.Info Facebook page and manually translated it into English to adapt the engine. The training data and a demonstration version of our MT are available on our Gamayun portal.

How well does this machine translation engine perform?

To answer this question, we conducted an evaluation based on tests widely used by MT researchers. We found that our MT engine produced better translations for Levantine Arabic than one of the most used online machine translation systems.

We first asked experienced translators to rate the translations for both accuracy and fluency. We provided them with ten randomly selected source texts and translations generated by humans, Google’s MT, and our MT. All translations were fairly good, with scores ranging from zero for no errors to three for critical errors. Our MT engine performed slightly better than Google’s MT because it was adapted to the specifics of Levantine Arabic and its online colloquialisms about food security and other topics relevant to refugees’ experiences. The human translations performed slightly better than our MT, but were not perfect.

We also asked the experienced translators to rank the best, second best, and worst translations based on each source text. While the human translations were consistently ranked higher than both machine translation engines, our MT was preferred 70% of the time over Google’s MT.

We then used the standard metric for automated MT quality testing called BLEU. The bilingual evaluation understudy scores an MT translation according to how well it matches a reference for human translation. Scores range from zero for no match to 1.0 for a perfect match, but few translations score 1.0 because all translators will produce slightly different texts. Our generic MT engine trained on publicly available parallel English-Arabic text obtained a 0.195 score on a testing set of 200 social media posts. With further training with a small but specific set for Levantine Arabic and its online colloquialisms, it reached a 0.248 score. Instead, the Google MT translations scored 0.212 on the same testing set.

Take the short sentence أسعار المواد الغائية مرتفعة as an example: humans translated it as “food is expensive” and our MT returned “food prices are high;” Google’s MT, instead, translated it as “the prices of the materials are high.” All are grammatically correct results, but our MT tended to better pick up the nuances of informal speech than Google’s MT. This may seem trivial, but it is critical if MT is used to quickly understand requests for help as they come up or keep an eye on people’s concerns and complaints to adjust programming.

What makes these results possible?

We specifically designed our MT engine to provide reliable and accurate translations of unstructured data, such as the language used in social media posts. We involved linguists and domain experts in collating and editing the dataset to train the engine. This ensured a focus on both humanitarian domain language and colloquialisms in Levantine Arabic.

The agility of this approach means the engine can be used for various purposes, from conducting needs assessments to analyzing feedback information. The approach also meets the responsible data management requirements of the humanitarian sector.

What have we learned?

We have demonstrated that it is possible to build a translation engine of reasonable quality for a marginalized language like Levantine Arabic and to do so with a relatively small dataset. Our approach entailed engaging with the native language community and focusing on text scraped from social media. This holds great potential for building language technology tools that can spring into action in times of crisis and be adapted to any particular domain.

We also learned that even human translations for Levantine Arabic are not perfect. This shows the importance of building networks of translators for marginalized languages who can help build up and maintain language technology. Where there are not enough—if any—professional translators, a key first step is training bilingual people with the right skills and providing them with guidance on humanitarian response terminology. This type of capacity building can not only make technology work for marginalized language speakers in the longer term, but also ensure they have access to critical information in their languages in the shorter term.

What’s next?

We are refining our approach, augmented by external support, to achieve the full potential of language technology. We are currently working with the Harvard Humanitarian Initiative and IMPACT Initiatives using natural language processing and machine learning to transcribe, translate, and analyze large sets of qualitative responses in multilingual data collection efforts to inform humanitarian decision making. We have also joined the Translation Initiative for COVID-19 (TICO-19), alongside researchers at Carnegie Mellon and major tech companies including Amazon, Facebook, Google, and Microsoft to develop and train state-of-the-art machine translation models in 37 different languages on COVID-19.

Stay tuned to learn how we move forward with these projects. We’ll continue to develop language technology solutions to enhance two-way communication in humanitarian crises and amplify the voices of millions of marginalized language speakers.

Written by Mia Marzotto, Senior Advocacy Officer for Translators without Borders.

21 February. This is the date chosen by UNESCO for International Mother Language Day, which has been observed worldwide since 2000. This year deserves special attention as 2019 is the International Year of Indigenous Languages. Both initiatives promote linguistic diversity and equal access to multilingual information and knowledge.

Languages can be a huge resource. At the same time, the mother language that people speak can be a barrier to accessing opportunities. People who speak marginalized mother languages often belong to remote or less prosperous communities and, as a result, they are more vulnerable when a crisis hits.

Yet, the humanitarian and development sector has been largely blind to the importance of language. International languages such as English, French, Arabic, and Spanish dominate, excluding the people who most need their voices heard. Marginalized language speakers are denied opportunities to communicate their needs and priorities, report abuse, or get the information they need to make decisions.

If aid organizations are to meet their high-level commitments to put people at the center of humanitarian action and leave no one behind, this needs to change. To understand better how to address language barriers facing marginalized communities, two actions can lead our sector in the right direction.

Aerial view of Monguno, Borno State, Nigeria. Photo by Eric DeLuca, Translators without Borders.

Putting languages on the map

The first is language mapping. No comprehensive and readily accessible dataset exists on which language people speak where.

TWB has started to fill that gap by creating maps from existing data and from our own research. Our interactive map shows the language and communication needs of internally displaced people in northeast Nigeria. The map uses data collected by the International Organization for Migration’s Displacement Tracking Matrix team. This data shows, for instance, that access to information is a serious problem at over half of sites where Marghi is the dominant language. Aid organizations can use this map to develop the right communication strategy for reaching people in need.

Humanitarian and development organizations can add some simple standard questions to their household surveys and other assessments to gather valuable language data. Aid workers will then understand the communication needs and preferences of the 176 million people in need of humanitarian assistance globally.

But communication in a crisis situation – or in any situation – should not be one-way. That’s where the second action comes in.

Building machine translation capacity in marginalized languages

Language technology has dramatically shifted two-way communication between people who speak different languages. In order to truly help people in need, listen to and understand them, we need to apply technology to their languages as well.

TWB is leading the Gamayun Language Equality Initiative to make it happen. We have built a closed-environment, domain-specific Levantine Arabic machine engine for the UN World Food Programme. This initiative will improve accountability to Syrian refugees facing food insecurity. Initial testing indicates that Gamayun will provide an efficient method for accessing local information sources. It will enable aid organizations to better understand the needs of their target populations, especially in hard-to-reach areas.

TWB Fulfulde Team Lead conducting comprehension research. Waterboard camp in Monguno, Borno State, Nigeria. Photo by Eric DeLuca, Translators without Borders.

We need to continue building the parallel language datasets from humanitarian and development content that make machine translation a viable option. That will expand the evidence that machine translation can enable better communication, including by empowering affected people to hold aid organizations to account in their own language.

Taking action

These two actions can help the humanitarian and development sector improve lives by promoting two-way communication with speakers of marginalized languages. These actions will need to be expanded to be truly effective, but International Mother Language Day in the Year of Indigenous Languages is a great time to start.

To read:

- The IFRC 2018 World Disasters Report, which includes clear and compelling recommendations about the importance of language to ensure that the world’s most vulnerable people are not “left behind”

- UNESCO’s commitments to multilingualism in cyberspace for inclusive sustainable development, as part of the Information For All Programme

TWB’s white paper on the Gamayun Language Equality Initiative

To do:

- Consult our dashboard and think about how you can start collecting this data to inform your programs

- Follow our journey as we continue to move forward with Gamayun (and learn along the way!)

Email us if you have an idea to share or want to do more in this area: info@translatorswithoutborders.org

Written by Mia Marzotto, Senior Advocacy Officer for Translators without Borders.

Tag: Gamayun

The latest from TWB’s language technology initiative