The paper On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? is well known, among other things because two of the authors apparently lost their jobs at Google for publishing it.
Here is an excerpt from it:
When we perform risk/benefit analyses of language technology, we must keep in mind how the risks and benefits are distributed, because they do not accrue to the same people. On the one hand, it is well documented in the literature on environmental racism that the negative effects of climate change are reaching and impacting the world’s most marginalized communities first. Is it fair or just to ask, for example, that the residents of the Maldives (likely to be underwater by 2100) or the 800,000 people in Sudan affected by drastic floods pay the environmental price of training and deploying ever larger English LMs, when similar large-scale models aren’t being produced for Dhivehi or Sudanese Arabic?
This is referring to the energy costs of training a large language model such as chatGPT (though the paper was written in 2021), implying that the resulting CO2 emissions may contribute to increasing sea levels. Dhivehi is the language of the Maldives. Granted, a single training run is not going to noticeably rise the sea level around the Maldives even in the most pessimistic scenario, at least not as much as flying tourists there in the first place. But still one naturally wonders whether the Dhivehi language is supported by GPT-4.
I asked GPT-4 to translate some sentences from English to Dhivehi and back (in a new chat). On my small unscientific benchmark, Please lock the door after 6PM, thanks becomes Give it to 6 people, thank you after the round trip to Dhivehi. I am feeling really bad, I think I need to go to the hospital becomes I felt sick, so I have to go to the hospital (good enough). Two hours of Thai massage is a bit too much comes back as You look tired from the mosque activities which is… bizarre? Finally Is this granola gluten-free? is preserved word for word.
So Dhivehi proficiency in GPT-4 is still hit-or-miss at best: granola and gluten-free are loanwords to begin with, but at least we get to go to the hospital, presumably by boat. For comparison I tried the same exercise with Bengali, a language that also has a very different script with respect to English but has orders of magnitude more speakers (it’s the seventh most spoken language in the world). The sentences come back essentially identical. So the issue is, very likely, the absence of Dhivehi text in GPT-4’s training corpus.
To be fair to OpenAI, it is not the first time that technology is not kind to Dhivehi, whose native script, Thaana, was about to be obliterated by uncooperative telex machines back in the ‘70s. Interestingly, if Wikipedia is to be trusted, the Thaana script uses eastern Arabic numerals as letters, similarly to how the Cherokee syllabary uses western Arabic numerals; it must be one of the very few languages to do so. It would have been beyond sad if it had been replaced by an improvised phonetic transcription into boring Latin characters, as was about to happen.
At any rate, I made this market on Manifold so you can help predict whether the next iteration of GPT will finally support Dhivehi. While you think about it, I leave you with a nice picture of "އަށަރު" (Ah'arru, i.e. the sea according to GPT-4).
Note that if the Dhivehi characters are rendered as boxes with or without a cross inside, you likely need to install the relevant fonts. This one worked for me: https://fontzone.net/font-details/mv-boli