The latest from TWB’s language technology initiative

Leaping over the language barrier with machine translation in Levantine Arabic

When a language you don’t understand appears in your Facebook news feed, you can click a button and translate it. This kind of language technology offers a way of communicating not just with the millions of people who speak your language, but with millions of others who speak something else.

Or at least it almost does.

Like so many other online machine translation systems, it comes with a caveat: it is only available in major languages.

TWB is working to eliminate that rather significant caveat through our language technology initiative, Gamayun. We named it after a mythical birdwoman figure in Slavic folklore — she is a magical creature that imparts words of wisdom on the few who can understand her. We think she’s a perfect advocate for language technology to increase digital equality and improve two-way communication in marginalized languages.

We have reached an important Gamayun milestone by leaping over the language barrier with a machine translation engine in Levantine Arabic. Here is how we got here, what we learned, and what is next.

What is behind developing a machine translation engine in Levantine Arabic?

In November 2019, we joined forces with a group of innovators and language engineers from PNGK and Prompsit to address WFP’s Humanitarian Action Challenge. Our goal was to use machine translation to enhance the way aid organizations understand the needs and concerns of Syrian refugees, to improve food security programming.

So we developed a text-to-text machine translation (MT) engine for Levantine Arabic tailored to the specifics of refugees’ experiences. To achieve this, we collaborated with Mercy Corps’ Khabrona.Info team. The team runs a Facebook page for Syrian Arabic refugees to provide them with reliable information and answers, such as about accessing food and other support. We took content shared on the Khabrona.Info Facebook page and manually translated it into English to adapt the engine. The training data and a demonstration version of our MT are available on our Gamayun portal.

How well does this machine translation engine perform?

To answer this question, we conducted an evaluation based on tests widely used by MT researchers. We found that our MT engine produced better translations for Levantine Arabic than one of the most used online machine translation systems.

We first asked experienced translators to rate the translations for both accuracy and fluency. We provided them with ten randomly selected source texts and translations generated by humans, Google’s MT, and our MT. All translations were fairly good, with scores ranging from zero for no errors to three for critical errors. Our MT engine performed slightly better than Google’s MT because it was adapted to the specifics of Levantine Arabic and its online colloquialisms about food security and other topics relevant to refugees’ experiences. The human translations performed slightly better than our MT, but were not perfect.

We also asked the experienced translators to rank the best, second best, and worst translations based on each source text. While the human translations were consistently ranked higher than both machine translation engines, our MT was preferred 70% of the time over Google’s MT.

We then used the standard metric for automated MT quality testing called BLEU. The bilingual evaluation understudy scores an MT translation according to how well it matches a reference for human translation. Scores range from zero for no match to 1.0 for a perfect match, but few translations score 1.0 because all translators will produce slightly different texts. Our generic MT engine trained on publicly available parallel English-Arabic text obtained a 0.195 score on a testing set of 200 social media posts. With further training with a small but specific set for Levantine Arabic and its online colloquialisms, it reached a 0.248 score. Instead, the Google MT translations scored 0.212 on the same testing set.

Take the short sentence أسعار المواد الغائية مرتفعة as an example: humans translated it as “food is expensive” and our MT returned “food prices are high;” Google’s MT, instead, translated it as “the prices of the materials are high.” All are grammatically correct results, but our MT tended to better pick up the nuances of informal speech than Google’s MT. This may seem trivial, but it is critical if MT is used to quickly understand requests for help as they come up or keep an eye on people’s concerns and complaints to adjust programming.

What makes these results possible?

We specifically designed our MT engine to provide reliable and accurate translations of unstructured data, such as the language used in social media posts. We involved linguists and domain experts in collating and editing the dataset to train the engine. This ensured a focus on both humanitarian domain language and colloquialisms in Levantine Arabic.

The agility of this approach means the engine can be used for various purposes, from conducting needs assessments to analyzing feedback information. The approach also meets the responsible data management requirements of the humanitarian sector.

What have we learned?

We have demonstrated that it is possible to build a translation engine of reasonable quality for a marginalized language like Levantine Arabic and to do so with a relatively small dataset. Our approach entailed engaging with the native language community and focusing on text scraped from social media. This holds great potential for building language technology tools that can spring into action in times of crisis and be adapted to any particular domain.

We also learned that even human translations for Levantine Arabic are not perfect. This shows the importance of building networks of translators for marginalized languages who can help build up and maintain language technology. Where there are not enough—if any—professional translators, a key first step is training bilingual people with the right skills and providing them with guidance on humanitarian response terminology. This type of capacity building can not only make technology work for marginalized language speakers in the longer term, but also ensure they have access to critical information in their languages in the shorter term.

What’s next?

We are refining our approach, augmented by external support, to achieve the full potential of language technology. We are currently working with the Harvard Humanitarian Initiative and IMPACT Initiatives using natural language processing and machine learning to transcribe, translate, and analyze large sets of qualitative responses in multilingual data collection efforts to inform humanitarian decision making. We have also joined the Translation Initiative for COVID-19 (TICO-19), alongside researchers at Carnegie Mellon and major tech companies including Amazon, Facebook, Google, and Microsoft to develop and train state-of-the-art machine translation models in 37 different languages on COVID-19.

Stay tuned to learn how we move forward with these projects. We’ll continue to develop language technology solutions to enhance two-way communication in humanitarian crises and amplify the voices of millions of marginalized language speakers.

Written by Mia Marzotto, Senior Advocacy Officer for Translators without Borders.

Transfer Learning Approaches for Machine Translation

This article was originally posted in the TWB Tech Blog on medium.com

TWB’s current research focuses on bringing language technology to marginalized communities

Translators without Borders (TWB) aims to empower people through access to critical information and two-way communication in their own language. We believe language technology such as machine translation systems are essential to achieving this. This is a challenging task given many of the languages we work with have little to no language data available to build such systems.

In this post, I’ll explain some methods for dealing with low-resource languages. I’ll also report on our experiments in obtaining a Tigrinya-English neural machine translation (NMT) model.

The progress in machine translation (MT) has reached many remarkable milestones over the last few years, and it is likely that it will progress further. However, the development of MT technology has mainly benefited a small number of languages.

Building an MT system relies on the availability of parallel data. The more present a language is digitally, the higher the probability of collecting large parallel corpora which are needed to train these types of systems. However, most languages do not have the amount of written resources that English, German, French and a few other languages spoken in highly developed countries have. The lack of written resources in other languages drastically increases the difficulty of bringing MT services to speakers of these languages.

Low-resource MT scenario

In scientific literature for machine translation, there is no particular consensus on which corpus size constitutes a low-resource scenario. But we can say roughly that a low-resource condition is when the size of the parallel training corpus is not sufficient for reaching an acceptable result with the standard MT approaches. This is usually judged with a standardized automatic evaluation metric called BLEU, which correlates with human translation assessments.

Figure 2, modified from Koehn and Knowles (2017), shows the relationship between the BLEU score and the corpus size for the three MT approaches.

A classic phrase-based MT model outperforms NMT for smaller training set sizes. Only after a corpus size threshold of 15M words, roughly equivalent to 1 million sentence pairs, classic NMT shows its superiority.

Low-resource MT, on the other hand, deals with corpus sizes that are around a couple of thousand sentences. Although this figure shows at first glance that there is no way to obtain anything useful for low resource languages, there are ways to leverage even small data sets. One of these is a deep learning technique called transfer learning, which makes use of the knowledge gained while solving one problem to apply it to a different but related problem.

Cross-lingual transfer learning

Zoph et al. (2018) applied transfer learning in machine translation and proved that having prior knowledge in translation of a separate language pair can improve translating a low-resource language.

Figure 3 illustrates their idea of cross-lingual transfer learning.

The researchers first trained an NMT model on a large parallel corpus — French–English — to create what they call the parent model. In a second stage, they continued to train this model, but fed it with a considerably smaller parallel corpus of a low-resource language. The resulting child model inherits the knowledge from the parent model by reusing its parameters. Compared to a classic approach of training only on the low-resource language, they record an average improvement of 5.6% BLEU over the four languages they experiment with. They further show that the child model doesn’t only reuse knowledge of the structure of the high resource target language but also on the process of translation itself.

The high-resource language to choose as the parent source language is a key parameter in this approach. This decision is usually made in a heuristic way judging by the closeness to the target language in terms of distance in the language family tree or shared linguistic properties. A more sound exploration of which language is best to go for a given language is made in Lin et al. (2019).

Multilingual training

The path that was cleared by cross-lingual transfer learning led naturally to the use of multiple parent languages. The straightforward approach, first described by Dong et al. (2015), mixes all the available parallel data in the languages of interest and sends them into training as illustrated in Figure 4.

What results from the example is one single model that translates from the four languages (French, Spanish, Portuguese and Italian) to English.

Multilingual NMT offers three main advantages. Firstly, it reduces the number of individual training processes needed to one, yet the resulting model can translate many languages at once. Secondly, transfer learning makes it possible for all languages to benefit from each other through the transfer of knowledge. And finally, the model serves as a more solid starting point for a possible low-resource language.

For instance, if we were interested in training MT for Galician, a low-resource romance language, the model illustrated in Figure 4 would be a perfect fit as it already knows how to translate well in four other high-resource romance languages.

A solid report on the use of multilingual models is given by Neubig and Hu (2018). They use a “massively multilingual” corpus of 58 languages to leverage MT for four low-resource languages: Azeri, Belarusian, Galician, and Slovakian. With a parallel corpus size of only 4500 sentences for Galician, they achieved a BLEU score of up to 29.1% in contrast to 22.3% and 16.2% obtained with a classic single-language training with statistical machine translation (SMT) and NMT respectively.

Transfer learning also enables what is called a zero-shot translation, when no training data is available for the language of interest. For Galician, the authors report a BLEU score of 15.5% on their test set without the model seeing any Galician sentences before.

Case of Tigrinya NMT

Tigrinya is an Ethiopian language spoken by around 7.9 million people in Eritrea and Ethiopia. It is neither supported by any commercial MT provider, nor has any publicly available models. TWB is currently developing open datasets and MT for Tigrinya in cooperation with the Masakhane initiative.

Tigrinya is no longer in the very low-resource category thanks to the recently released JW300 dataset by Agic and Vulic. Nevertheless, we wanted to see if a higher resource language could help build a Tigrinya-to-English machine translation model. We used Amharic as a parent language, which is written with the same Ge’ez script as Tigrinya and has larger public data available.

The datasets that were available to us at the time of writing this post are listed below. After JW300 dataset, the largest resource to be found is Parallel Corpora for Ethiopian Languages.

Our transfer-learning-based training process consists of four phases. First, we train on a dataset that is a random mix of all sets totaling up to 1.45 million sentences. Second, we fine-tune the model on Tigrinya using only the Tigrinya portion of the mix. In a third phase, we fine-tune on the training partition of our in-house data. Finally, 200 samples earlier allocated aside from this corpus are used for testing purposes.

As a baseline, we skip the first multilingual training step and use only Tigrinya data to train on.

We see a slight increase in the accuracy of the model on our in-house test set when we use the transfer learning approach. The results in various automatic evaluation metrics are as follows:

Conclusion

Neural machine translation is a data-hungry technology. Although this severely reduces the possibility to expand it to the majority of the world’s languages, we can still apply various techniques to make it available to more people than if we limited ourselves to approaches tuned towards high-resource languages. Methodologies like transfer learning and linguistically informed data mixture have a role to play in helping everyone communicate in their language.

Written by Alp öktem, Computational Linguist for Translators without Borders

Digital development, language gaps, and a prophetic bird

Language technology can help those in need use technology to proactively communicate and access information.

We are in the midst of an unprecedented surge of increasingly powerful technologies that can help solve humanitarian and development challenges. Yet meaningful access to these technologies is not equally available to all people. Hundreds of millions of the world’s poorest, least educated, most vulnerable populations often find themselves on the wrong side of a dangerous digital divide.

Language can be the key that unlocks new digital opportunities for all.

Language is a barrier for technology use

Under the umbrella of information and communication technologies for development (ICT4D, or, simply, ICT), technology efforts have become commonplace in the development world over the past few decades. Emerging machine learning and artificial intelligence applications (“AI for Good”) promise to help achieve sustainable development goals globally. In Kenya, Ghana and Côte d’Ivoire, an app called “Eneza Education” delivers mobile courses to 5 million people. In India, Khushi Baby supplies low-cost wearable technology to monitor child vaccinations.

While these digital applications have the potential to shift communications and empower vulnerable people, they face a number of major hurdles. Access to hardware is an obvious issue, as is access to networks. But even when those issues are resolved, there is the more fundamental barrier of language. Typically digital technology requires basic literacy skills and often foreign language skills, especially considering that more than 50 percent of websites are in English. This turns into a self-fulfilling prophecy with speakers of marginalized languages unable to interact with new tools. Without thoughtful consideration of language barriers, new digital opportunities may only magnify inequality and further exclude marginalized communities, especially speakers of under-served languages.

The world’s most marginalized communities often live in complex linguistic contexts that can further complicate the use of technology. For example, there are 68 languages in Kenya and most people do not speak either Swahili or English, the languages generally used in ICT technologies. Moreover, the digital divide for low-literate ICT users in oral-language communities, such as Berber women in Morocco, is even higher. This is not a rare phenomenon: as many as 7,000 languages are spoken today, two-thirds of which do not have a written form.

Language technology for all

Language technology can address these barriers. Languages that are ‘commercially viable’ have seen an enormous growth in digital tools, both for text and voice. Today, tools like Skype allow for people to carry on lucid conversations even when they don’t speak the same language. The advent of neural machine translation and natural language processing has greatly increased communications among those languages in which for-profit companies have invested.

The trick is to include this language technology in the development of tools for the humanitarian and development sectors.

This is why Translators without Borders is overseeing Gamayun: The Language Equality Initiative.

Named after a prophetic bird of wisdom and knowledge from Slavic folklore, the initiative aims to create more equitable access to language technology that will lead to greater knowledge and wisdom for vulnerable people.

The initiative effectively elevates marginalized languages to the level of commercial languages by ensuring development of machine translation in voice and text in those languages. It also encourages humanitarian tech developers to integrate these engines into their tools and to measure whether they improve communications. Ultimately, the goal is for people in need to have direct access to these tools for their own use, thereby controlling the communications they provide and receive.

To accomplish this, Gamayun must first build a repository of spoken and written datasets for under-served languages. The data comes from humanitarian or development sources, making the resulting translation engines more useful in humanitarian- and development-specific contexts.

Successfully building these datasets requires a massive amount of human input. The data is presented as parallel sets in which a sentence or string of text in a language critical to the humanitarian world is paired with a “source” language. As Gamayun scales, we are seeking datasets from the translation and localization industry, and asking for terminology input from humanitarian sectors. Unstructured data, such as content from open social media outlets, also can be used to train the engines; and, importantly, linguists and context specialists are used to evaluate that data to help make the engines more fit for purpose.

TWB is building datasets in a wide range of languages, but the main focus at first is Bangla, Swahili, and Hausa. These languages are collectively spoken by 400 million people, and were selected because of their associated risk for crisis. The communities that speak these languages have a strong presence online; online communities in those languages will help build, maintain and improve the datasets and the engines.

Meanwhile, Gamayun looks at integration of machine translation engines (voice and text) in applications and tools to evaluate effectiveness in improving communications. TWB and its humanitarian partners are evaluating a number of machine-translation use cases, including in needs assessment tools, two-way communication bots, and call centers, as well as the type of fit-for-purpose machine translation engines are most useful. In some cases, ‘off the shelf’ engines from major technologists work well; in other cases, it is important to contextualize the engine to get the best results.

Access is not enough – the shift of control

Building datasets and engines in marginalized languages, and integrating those engines into tools developed by the sector will improve language equality. But to truly bridge the gap, the tools need to be in the hands of those who are in need. Only they have the best sense of exactly what information they need and, likewise, what information they have and can share.

As a recent report by the Pathways for Prosperity puts it, “impact is ultimately determined by usage; access alone is not sufficient.” While there remain many other barriers to access, including hardware and bandwidth issues, in the area of language, we are poised to greatly increase access and even move beyond. Ultimately, reduction of language barriers through technology has the potential to shift control of communications to people in need. In such a world, vulnerable populations can use the same tools as those who speak ‘commercial’ languages, accessing any information they want, and providing their own information in the language they speak.

We must support speakers of under-served languages as technology continues to evolve and allows us all to be stewards of our own information and communication.

Written by Mia Marzotto, TWB's Senior Advocacy Officer.

#LanguageMatters. So Does Technology.

Improving access to information in the right languages for the world’s poorest, most vulnerable individuals is the core mission of Translators without Borders (TWB). Often, however, there are too few translators or interpreters available, especially during times of crisis when impacted populations and humanitarian responders do not speak the same language.

To alleviate the dearth of translators and interpreters, TWB invests in the skills of our 26,000 strong community of language professionals. We also invest in state-of-the-art tools and technology that enable us to serve many kinds of humanitarian needs.

Translators Guinea Language Technology. — TWB-trained translators in Guinea.

The right combination of skills and technology helps our translators deliver high-quality, accurate information to partner organizations such as Doctors without Borders and the International Federation of Red Cross and Red Crescent Societies, often under chaotic, time-sensitive conditions. Our volunteers work to industry standards, building marketable skills that may lead to paying jobs.

Over the long-term, the data we’re creating will play a key role in bringing more underserved languages online and into the digital age. Continue reading “#LanguageMatters. So Does Technology.”

Changing the world while sitting on your sofa

Changing the world through language

Listen to Translator without Borders Executive Director, Aimee Ansari talk about changing the world through language at TedxYouth@Bath in November 2016.