Friday, March 3, 2017

Lexical Distance Among the Languages of Europe: a hoax, a political pamphlet or actual popular science?

It can only please a linguist to see a piece of  linguistic knowledge going viral on social networks. When I first saw the image titled Lexical Distance Among the Languages of Europe below, I was not just happy - I thought: what an enormous amount of work must have been invested to produce this chart!

Lexical Distance Network Among the Major Languages of Europe

For a layman this may seem a simple thing to do: just compare the electronic dictionaries of the languages (ideally those using the same methodology in lexicography, to control for the effects from this area), and count the words that are identical. But it is impossible to compare electronic dictionaries of different languages because it often happens that a word that is pronounced the same in the two languages receives a different spelling in each of them (for instance the word группа 'group' in Russian is pronounced and means the same as the word група 'group' in Ukrainian, but since, for arbitrary non-linguistic reasons, they have a different spelling - they won't be recognized as identical). Moreover, it is often the case that between two closely related languages certain words occur with slightly different pronunciation, but with the same meaning and almost identical morphological forms (e.g. Russian раздел and Ukrainian роздiл, pronounced almost the same, declined almost the same, and both meaning 'section'). A fair way to calculate the lexical distance would include a thorough etymological research (research into the history of words), a measure of phonological and morphological change since the common ancestor word (how much the pronunciation and the different forms of the words have changed), a correction factor for the shared or related complex words (since their parts are already counted in other words), especially in the cases where complex words share only certain parts, to mention just a few complex measures, most of which are unavoidably subject to approximation.

So upon seeing a chart like this a linguist is also immediately prompted to look for the methodology. Especially when some results seem utterly surprising. For instance, every Slavicist will be in disbelief seeing that Serbian is closer to Russian than Polish, Ukrainian or Belorussian. Even more so seeing that Croatian is as close to Slovak as to Slovenian, and Slovak is as close to Croatian as to Czech. And that Romanian is related to Albanian, but not to Serbian or to Russian, while Albanian is related to Slovenian, but not to Serbian. Some of these peculiarities receive a reasonable explanation from the aspect of political science: that Ukrainian is very different from Russian (Crimea, Donbass) and that Romanian is unaffected by it (Transnistria), that Croatian has different ties than Serbian, that Albanian is distant from and unaffected by Serbian (Kosovo).

The earliest occurrence of this chart is from 2013. Here, as well as on several other pages with the chart, it is referred to "K. Tyshchenko (1999), Metatheory of Linguistics", a book published in Ukrainian, by a Ukrainian linguist (the original title: Метатеорія мовознавства), freely available online. I have carefully examined the book, and could not find even the lexical distance data, let alone the methodology how it was gathered. All I did find was a lot of what appealed to me as obscure, arbitrary and problematic linguistic methodology and even more of classifications which didn't seem to make much sense. And I also found one reference that might have to do with the chart, namely reference to a set within a linguistic museum exhibition, which represents Indo-European languages and their relations (chapter Мiжфакультетський Лiнґвiстичний навчальний музей Київського унiверситетуsection 2.2. Ґалерея мов свiту 'Galery of the world's laguages').

Since the chart was shared by a large number of people, including many linguists, I think that it is important that according to my little investigation - unless someone manages to find the specification of its methodology - which I doubt will happen, and this methodology proves adequate - the chart should be treated as a hoax with a political agenda. But its idea remains beautiful, and I would like to see a chart of the same kind, plotted from the real linguistic data.


  1. "unless someone manages to find the specification of its methodology - which I doubt will happen"
    This is my attempt:

  2. Thanks a lot, you gave a very clear explanation there. I completely agree also about the points of improvement which could bring about a more reliable estimate, but I would also add a much bigger, and more carefully picked sample of words.