Language data fills a critical gap for humanitarians

Until now, humanitarians have not had access to data about the languages people speak. But a series of open-source language datasets is about to improve how we communicate with communities in crisis. Eric DeLuca and William Low explain how a seemingly simple question drove an innovative solution.

“Do you know what languages these new migrants speak?”

Lucia, an aid worker based in Italy, asked this seemingly simple question to researchers from Translators without Borders in 2017. Her organization was providing rapid assistance to migrants as they arrived at the port in Sicily. Lucia and her colleagues were struggling to provide appropriate language support. They often lacked interpreters who spoke the right languages and they asked migrants to fill out forms in languages that the migrants didn’t understand.

Unfortunately, there wasn’t a simple answer to Lucia’s question. In the six months prior to our conversation with Lucia, Italy registered migrants from 21 different countries. Even when we knew that people came from a particular region in one of these countries, there was no simple way to know what language they were likely to speak.

The problem wasn’t exclusive to the European refugee response. Translators without Borders partners with organizations around the world which struggle with a similar lack of basic language data.

Where is the data?

As we searched various linguistic and humanitarian resources, we were convinced that we were missing something. Surely there was a global language map? Or at least language data for individual countries?

The more we looked, the more we discovered how much we didn’t know. The language data that does exist is often protected by restrictive copyrights or locked behind paywalls. Languages are often visualized as discrete polygons or specific points on a map, which seems at odds with the messy spatial dynamics that we experience in the real world. 

In short, language data isn’t accessible, or easily verifiable, or in a format that humanitarians can readily use.

We are releasing language datasets for nine countries

Today we launch the first openly available language datasets for humanitarian use. This includes a series of static and dynamic maps and 23 datasets covering nine countries: DRC, Guatemala, Malawi, Mozambique, Nigeria, Pakistan, Philippines, Ukraine, and Zambia.

This work is based on a partnership between TWB and University College London. The pilot project received support from Research England’s Higher Education Innovation Fund, managed by UCL Innovation & Enterprise. With support from the Centre for Translation Studies at UCL, this project was the first of its kind in the world to systematically gather and share language data for humanitarian use.

The majority of these datasets are based on existing sources — census and other government data. We curated, cleaned, and reformatted the data to be more accessible for humanitarian purposes. We are exploring ways of deriving new language data in countries without existing sources, and extracting language information from digital sources.

This project is built on four main principles:

TWB Language Data Initiative

1. Language data should be easily accessible

We started analyzing existing government data because we realized there was a lot of quality information that was simply hard to access and analyze. The language indicators from the 2010 Philippines census, for example, were spread over 87 different spreadsheets. Many census bureaus also publish in languages other than English, making it difficult for humanitarians who work primarily in English to access the data. We have gone through the process of curating, translating, and cleaning these datasets to make them more accessible.

2. Language data should work across different platforms

We believe that data interoperability is important. That is, it should be easy to share and use data across different humanitarian systems. This requires data to be formatted in a consistent way and spatial parameters to be well documented. As much as possible, we applied a consistent geographic standard to these datasets. We avoided polygons and GPS points, opting instead to use OCHA administrative units and P-codes. At times this will reduce data precision, but it should make it easier to integrate the datasets into existing humanitarian workflows.

We worked with the Centre for Humanitarian Data to develop and apply consistent standards for coding. We built an HXL hashtag scheme to help simplify integration and processing. Language standardization was one of the most difficult aspects of the project, as governments do not always refer to languages consistently. The Malawi dataset, for example, distinguishes between “Chewa” and “Nyanja,” which are two different names for the same language. In some cases, we merged duplicate language names. In others, we left the discrepancies as they exist in the original dataset and made a note in the metadata.

Even when language names are consistent, the spelling isn’t always. In the DRC dataset, “Kiswahili” is displayed with its Bantu prefix. We have opted instead to use the more common English reference of “Swahili.”

Every dataset uses ISO 639-3 language codes and provides alternative names and spellings to alleviate some of the typical frustrations associated with inconsistent language references.

3. Language data should be open and free to use

We have made all of these datasets available under a Creative Commons Attribution Noncommercial Share Alike license (CC BY-NC-SA-4.0). This means that you are free to use and adapt them as long as you cite the source and do not use them for commercial purposes. You can also share derivatives of the data as long as you comply with the same license when doing so.

The datasets are all available in .xlsx and .csv formats on HDX, and detailed metadata clearly states the source of each dataset along with known limitations. 

Importantly, everything is free to access and use.

4. Language data should not increase people’s vulnerability

Humanitarians often cite the potential sensitivities of language as the primary reason for not sharing language data. In many cases, language can be used as a proxy indicator for ethnicity. In some, the two factors are interchangeable.

As a result, we developed a thorough risk-review process for each dataset. This identifies specific risks associated with the data, which we can then mitigate. It also helps us to understand the potential benefits. Ultimately, we have to balance the benefits and risks of sharing the data. Sharing data helps humanitarian organizations and others to develop communication strategies that address the needs of minority language speakers.

In most cases, we aggregated the data to protect individuals or vulnerable groups. For each dataset, we describe the method we used to collect and clean the data, and specify potential imitations. In a few instances, we chose to not publish datasets at all.

How can you help?

This is just the beginning of our effort to provide more accessible language data for humanitarian purposes. Our goal is to make language data openly available for every humanitarian crisis, and we can’t do it alone. We need your help to:

  1. Integrate and share this data. We are not looking to create another data portal. Our strategy is to make these datasets as accessible and interoperable as possible using existing platforms. But we need your feedback so we can improve and expand them.
  2. Add language-related questions into your ongoing surveys. Existing language data is often outdated and does not necessarily represent large-scale population movements. Over the past year, we have worked with partners such as IOM DTM, REACH, WFP, and UNICEF to integrate standard language questions into ongoing surveys. This is essential if we are to develop language data for the countries that don’t have regular censuses. The recent multi-sectoral needs assessment in Nigeria is a good example of how a few strategic language questions can lead to data-driven humanitarian decisions.
  3. Use this language data to improve humanitarian communication strategies. As we develop more data, we hope to provide the tools for Lucia and other humanitarians to design more appropriate communication strategies. Decisions to hire interpreters and field workers, develop radio messaging, or create new posters and flyers should all be data-driven. That’s only possible if we know which languages people speak. An inclusive and participatory humanitarian system requires two-way communication strategies that use languages and formats that people understand.

Clearly, the answer to Lucia’s question turned out to be more complicated than any of us expected. This partnership between TWB and the Centre for Translation Studies at UCL has finally made it possible to incorporate language data into humanitarian workflows. We have established a consistent format, an HXL coding scheme, and processes for standardizing language references. But the work does not stop with these nine countries. Over the next few months we will continue to curate and share existing language datasets for new countries. In the longer term we will be working with various partners to collect and share language data where it does not currently exist. We believe in a world where knowledge knows no language barriers. Putting language on the map is the first step to achieving that.

Eric DeLuca is the Monitoring, Evaluation, and Learning Manager at Translators without Borders.

William Low is a Senior Data and GIS Researcher at University College London.

Funding for this project was provided by Research England’s Higher Education Innovation Fund, managed by UCL Innovation & Enterprise.

Language Technology Could Help 157 Million People Get Access To Information

I was exhausted.  It had been a great week in Bangladesh, but the overload of language, smells, refugee camp, seeing old friends, meeting new friends, government, donors, and all the while pretending like I wasn’t jetlagged, was taking its toll.  I just wanted to go to sleep.

My last meeting was in Dhaka with someone in the Prime Minister’s office.  I had little hope of staying awake through the meeting.

And yet, I was captivated.

Bangladesh Help Desk Signage
Bangladesh Help Desk Signage

The literacy rate in Bangladesh is considered low (72.8% according to UNESCO in 2016) but is just below the global average. Literacy among women is lower (69.9%); but, in general, the majority of the people have at least basic literacy skills.  There is 90 percent mobile phone penetration and 96 percent mobile internet access. The International Mother Language Institute, the body in Bangladesh that supports the promotion, spread, and preservation of Bangla languages, says that 41 languages are spoken in the country, only five of which have written scripts.  In the humanitarian response for Rohingya refugees in Cox’s Bazar, Translators without Borders (TWB) finds the situation particularly difficult. Rohingya has no agreed written script. Very few of the refugees can read and write, there are few people who speak Rohingya and anything else well. Add to this mix low radio coverage – not only do the Rohingya not have radios, even if they did there is not even radio coverage in parts of the camps, and about one million people living in poor and difficult conditions that speak many different dialects and you begin to understand why communicating effectively is difficult.

It’s vitally important that there is two-way communication between the people – refugees and local Bangladeshis – and the government and aid workers. Take the issue of the coming monsoon. The formal and makeshift refugee camps have sprouted up all over the Cox’s Bazar district, an area that includes a national park and lush forest. But now the trees have been torn down to make room for shelters and for firewood.  This makes the soil very unstable and dangerous, with monsoon rains promising huge mud pits and the possibility of landslides. It is also a hilly area; tents are built on the sides of hills that will become slippery and unstable with heavy rains and wind. Refugees, as well as local residents, need to know where to go, what to do if there’s an emergency, how to get help for those needing medical attention, and what to do if food gets swept away.  

The challenges abound. The digital world seems a world away.    

And yet, enter Dr. Jami.  In a buzzy, busy office with a high level of excitement and a relatively good gender balance, I was suddenly in the middle of a high tech environment.  Dr. Jami launched directly into what he wanted us to know and do.

Dr. Jami runs the Access to Information (A2I, inevitably) project in the Prime Minister’s office. The aim is to help the people of Bangladesh quickly and easily get information on public services. One of A2I’s projects is the digitization of government institutions; they have developed over 1,000 key government websites.  Dr. Jami is not a language guy (he’s a solutions architect), but he proceeds to tell me quickly that Bangla was only standardized in Unicode five years ago, so there is very little data available from which to build good translation engines.  While there’s 90 percent mobile phone penetration, in 2018 GSMA estimated that only 28-30 percent of those were smartphones. Yet, 96 percent of internet access is via phones. Whaaa? How does that work? It’s also startling how little desktops and laptops are used to access the internet.  

I asked a taxi driver, who was using a smartphone, if he used his phone for the internet.  He replied, “No, but I use it for Facebook.”

There are no data charges for Facebook in Bangladesh – unless you want to see videos or pictures.  Internet use is Facebook and Facebook is only text. Those who are illiterate, or only barely literate, won’t have smartphones.

To Dr. Jami, who needs more people to have smartphones to help ensure they can get access to information, the cost is not the barrier:  There are very inexpensive smartphones in Bangladesh. He believes it is fear of technology, which he believes is associated with illiteracy. To reach his goal of migrating 70 percent of the current mobile phone users to smartphones, he must address fear.

Language is an issue.  With a population of over 157 million people, and one of the most widely spoken languages in the world, you’d think that the language technology for Bangla would be outstanding.  It’s not. That’s surprising. And without that technology, equipping 1,000 websites with dynamic information in Bangla is nearly impossible, not to mention making them interactive and/or adding audio.

The work that A2I is doing is globally relevant, of course.  Other countries are already seeking their support to bring better access to information to their people.  He mentions that they are already working in South Sudan – which has the 2nd lowest literacy rate in the world.  Again, the language barrier is huge. And, again, there is little digital language data.  

Dr. Jami has heard of TWB’s Gamayun project – can we help?  Can we be a neutral broker to bring together the limited language data out there and leverage our knowledge of language and the language industry to help Bangladeshis get access to information about basic services?  

Dr. Jami and the TWB team will continue this conversation – there are still many questions to be asked and answered.  But I was impressed by the enthusiasm and the accomplishments of his team. And I am really excited to see where Dr. Jami and other countries take this exciting initiative.

Written by Translators without Borders' Executive Director Aimee Ansari. This article was also published on HuffPost UK.


Read a related post on The #LanguageMatters blog, ‘Language: Our Collective Blind Spot in the Participation Revolution’.  In TWB’s last blog post, Executive Director Aimee Ansari explains why we need to create and disseminate a global dataset on language and communication for crisis-affected countries. 

Language: Our Collective Blind Spot in the Participation Revolution

Two years ago, I embarked on an amazing journey. I started working for Translators without Borders (TWB). While being a first-time Executive Director poses challenges, immersing myself in the world of language and language technology has by far been the more interesting and perplexing challenge.

 

Students, Writing, Language
Students practising to write Rohingya Zuban (Hanifi script) in Kutupalong Refugee Camp near Cox’s Bazar, Bangladesh.

Language issues in humanitarian response seem like a “no-brainer” to me. A lot of others in the humanitarian world feel the same way – “why didn’t I think of that before” is a common refrain. Still, we sometimes struggle to convince humanitarians that if people don’t understand the message, they aren’t likely to follow it. When I worked in South Sudan for another organisation, in one village, I spoke English, one of our team interpreted to Dinka or Nuer, and then a local teacher translated to the local language (I don’t even know what it was). I asked a question about how women save money; the response had something to do with the local school not having textbooks. It was clear that there was no communication happening. At the time, I didn’t know what to do to fix it. Now I do – and it’s not difficult or particularly expensive.

That’s the interesting part. TWB works in 300 languages, most of which I’d never heard of, and this is a very small percentage of the over 1,300 languages spoken in the 15 countries currently experiencing the most severe crises. There’s also no reliable data on where exactly each language is spoken. I’ve learned so much about language technology that my dog can almost talk about the importance of maintaining translation memories and clean parallel datasets.

Communicating with conflict-affected people

The International Committee of the Red Cross and the Harvard Humanitarian Initiative have just published a report about communicating with conflict-affected people that mentions language issues and flags challenges with digital communications. (Yay!) Here are some highlights:

  • Language is a consistent challenge in situations of conflict or other violence, but often overlooked amid other more tangible factors.

  • Humanitarians need to ‘consider how to build “virtual proximity” and “digital trust” to complement their physical proximity.’

  • Sensitive issues relating to sexual and gender-based violence are largely “lost in translation.” At the same time, key documents on this topic are rarely translated and usually exclusively available in English.

  • Translation is often poor, particularly in local languages. Some technology-based solutions have been attempted, for example, to provide multilingual information support to migrants in Europe. However, there is still a striking inability to communicate directly with most people affected by crises.

TWB’s work, focusing on comprehension and technology, has found that humanitarians are simply unaware of the language issues they face.

  • In north-east Nigeria, TWB research at five sites last year found that 79% of people wanted to receive information in their own language; less than 9% of the sample were mother-tongue Hausa speakers. Only 23% were able to understand simple written messages in Hausa or Kanuri; that went down to just 9% among less educated women who were second-language speakers of Hausa or Kanuri, yet 94% of internally displaced persons receive information chiefly in one of these languages.
  • In Greece, TWB found that migrants relied on informal channels, such as smugglers, as their trusted sources of information in the absence of any other information they could understand.

  • TWB research in Turkey in 2017 found that organizations working with refugees were often assuming they could communicate with them in Arabic. That ignores the over 300,000 people who are Kurds or from other countries.

  • In Cox’s Bazar, Bangladesh, aid organizations supporting the Rohingya refugees were working on the assumption that the local Chittagonian language was mutually intelligible with Rohingya, to which it is related. Refugees interviewed by TWB estimate there is a 70-80% convergence; words such as ‘safe’, ‘pregnant’ and ‘storm’ fall into the other 20-30%.

What can we do?

Humanitarian response is becoming increasingly digital. How do we build trust, even when remote from people affected by crises?

‘They only hire Iranians to speak to us. They often can’t understand what I’m saying and I don’t trust them to say what I say.’ – Dari-speaking Afghan man in Chios, Greece.

Speak to people in their language and use a format they understand: communicating digitally – or any other way – will mean being even more sensitive to what makes people feel comfortable and builds trust. The right language is key to that. Communicating in the right language and format is key to encouraging participation and ensuring impact, especially if the relevant information is culturally or politically sensitive. The right language is the language spoken or understood and trusted by crisis-affected communities; the right format means information is accessible and comprehensible. Providing only written information can hamper communication and engagement efforts with all sectors of the community from the start – especially women, who are more likely to be illiterate.

Lack of data is the first problem: humanitarians do not routinely collect information about the languages people speak and understand, or whether they can read them. It is thus easy to make unsafe assumptions about how far humanitarian communication ‘with communities’ is reaching, and to imagine that national or international lingua francas are sufficient. This can be done safely without harming the individuals or putting the community at risk.

Budgets: Language remains below the humanitarian radar and often absent from humanitarian budgets. Budgeting for and mobilizing trained and impartial translators, interpreters and cultural mediators can ensure aid providers can listen and provide information to affected people in a language they understand.

Language tools: Language information fact-sheets and multilingual glossaries can help organizations better understand key characteristics of the languages affected people speak and ensure use of the most appropriate and accurate terminology to communicate with them. TWB’s latest glossary for Nigeria provides terminology in English/Hausa/Kanuri on general protection issues and housing, land and property rights.

A global dataset on language

TWB is exploring ways of fast-tracking the development and dissemination of a global dataset on language and communication for crisis-affected countries, as a basis for planning effective communication and engagement in the early stages of a response. We plan to complement this with data mining and mapping of new humanitarian language data.

TWB has seen some organizations take this on – The World Health Organization and the International Federation of Red Cross and Red Crescent Societies have both won awards for their approaches to communicating in the right language. Oxfam and Save the Children regularly prioritize language and the International Organization for Migration and the United Nations Office for the Coordination of Humanitarian Affairs are starting to routinely include language and translation in their programs. A few donors are beginning to champion the issue, too.

TWB has only really been able to demonstrate the possibilities for two or three years – and it’s really taking off. It’s such a no-brainer, so cost-effective, it’s not surprising that so many organizations are taking it on. Our next step is to ensure that language and two-way communication are routinely considered, information is collected on the languages that crisis-affected people speak, accountability mechanisms support it, and we make the overall response accessible for those who need protection and assistance.

Written by Aimee Ansari, Executive Director, Translators without Borders.