Exploring the Uniform Information Density Hypothesis

When we talk about the use of a subject (S), object (O), and a verb (V) in a simple clause, there are 6 possible combinations of orders in which these elements can be placed. If that particular word order is predominantly used in a particular language, it could be said that that word order is the canonical word order. This is not to say that other word orders are not possible in a particular language; they are, but they might convey a different emphasis or voice in the clause. Some word orders may be even more flexible, as case markings on the subject and object could help denote these two elements.

But there is a rather disproportionate distribution of such word orders across the languages of the world. In fact, I have focused on one of the rarest canonical word orders in an essay I wrote just about last year.

In short, a large majority of languages have a canonical word order where the subject precedes the object. And by that, we mean over 95% of them. Among these, SOV and SVO dominate with 45% and 42% of languages using these canonical word orders respectively.

And the first question I had was why. There does not seem to be a distinct, one-size-fits-most, pattern in distribution of canonical word orders across and within language families, other than the general pattern that in most cases, the subject precedes the object. Within the Austronesian language family, for example, Malay follows a SVO canonical word order, Māori follows VSO, and Malagasy follows a rather unusual VOS.

When looking up the various hypotheses proposed to explain the distribution of our word order, there is one particular hypothesis that popped into the picture. The Uniform Information Density hypothesis. My first source was this publication in the conference called Advances in Neural Information Processing Systems in 2010.

Within that paper, the authors argued against competing hypotheses that explain this distribution of canonical word order, including universal grammar and the esoteric common ancestor of all languages, Proto-World. For this class of hypotheses, the authors noted that these could not explain the drift that led to over half of the world’s languages using word orders that are not SOV, the most common word order.

Other explanations like certain linguistic principles were criticised as being a circular argument; while they offer some explanation for this distribution, the authors think that these explanations do not offer sufficient depth in justification for why this distribution exists in the first place.

So, what is Uniform Information Density (UID)?

In essence, when we convey language verbally, there is a tendency to maintain a constant rate of amount of information that is transmitted. We also see another term here called entropy. In this scenario, entropy describes the uncertainty of the underlying meaning of an utterance. Under UID, information serves to reduce this uncertainty of the underlying meaning of a given utterance.

This, the authors argue, is observed in rates of speech as well; if high entropy content is transmitted, the speakers would tend to slow down. And conversely, with low entropy content being transmitted, speakers would tend to speed up. In fact, this hypothesis has been implicated to explain certain contractions like “it’s” and “you’re”.

So how does UID work in explaining word order? Welcome to the world of computational linguistics, where corpus data, language models, and perhaps, lots of mathematics come into play here. I will try to leave out as much of the technicalities as possible, but we will still discuss the main concept in play here.

Firstly, the groundwork. Suppose there is a set of objects O, and a set of actions A, where the actions denote or express some relationship between the objects. This way, an event would be defined here as a triplet of a first object o1, an action that links the objects a, and a second object a2, which can be written as (o1, a, o2). There are numerous objects in the set O, as there are numerous actions in the set A. As such, if we draw events from the sets in an independent manner, that is, the drawing from a set does not affect the probability of subsequent draws from the same set, we would eventually end up with a probability distribution over the set of events O x A x O.

Using these events, a total of 6 possible word orders can be formed. The authors defined each utterance as a three-word-long sequence of these events. This very mathematical situation could be made more concrete by assigning names or concrete nouns and actions to the set, such as Bob, Bread, and Drink, respectively.

Now, we can have a world in which the main three-word events include things like ‘Alice eats bread’, ‘Carol drinks tea’, and so on. This is where we will have to talk about entropy, which in this case, essentially boils down to uncertainty. Take, for example, the event ‘Alice eats bread’. Now let us obscure all of these elements. Here, we would be entirely uncertain over what this event is. And so, we can add one word to reveal some information about the event, thereby reducing the uncertainty over the event, and hence reducing the entropy. And then, we can add another word to reduce this uncertainty further, and a final one in the three-word event to bring this uncertainty to zero. As you can see, with each element in an event being revealed, there is a drop in uncertainty and entropy as more information is being conveyed.

With one word revealed, the entropy could be calculated using one of the conditional probability distributions, amongst the other mathematical arguments. Essentially, what is the probability of the other two elements in the event, given the element which has been revealed? If we were to express this mathematically, it would be P(o1, o2 | a), P(o1, a | o2), or P(o2, a | o1). Similarly, with two elements in an event revealed, this would be, given the two elements that have been revealed, what is the probability of the last element in the event, or P(o1 | o2, a), P(a | o1, o2), or P(o2 | o1, a).

In an UID scenario, we would expect the entropy to decrease uniformly with each element being revealed, i.e. with one word revealed, the entropy would be 2/3 of that when nothing is revealed, and with two words revealed, the entropy would be 1/3. This allows us to draw up an entropy trajectory for each of the word orders, and compare them against the UID scenario. But of course, we do not live in an ideal world, and we would expect deviations. This deviation score would be computed using the difference between the entropy trajectory for the word order and the UID scenario, with the lowest being the most UID-like word order, and the highest being the least UID-like word order. And instead of using the fictional world of events to illustrate this theoretical rationale, the researchers used corpora of texts, transcripts, and the like.

From the results of this 2010 study, we can see that object-first word orders (that is, OSV and OVS) are the most unfavourable ones under UID, and when the object precedes the verb, there would be an information ‘trough’, especially for the fictional world of events. This could lend some explanation to why these word orders are so rare in natural languages. However, what I found more bewildering is, the SOV word order is not as favourable under UID compared to SVO, yet it is the most common canonical word order amongst natural languages. The authors did not quite theorise what was going on, other than the mention of the possibility of other factors that would make this word order more common.

One of the main caveats of this study is, words do not convey information all that uniformly. For instance, the mention of an action like ‘drink’ could signal that one of the objects in the event could be semantically or semiotically connected to the action, and that the object could be a liquid. This raises the probability of objects connected to the action compared to other nouns, a pretty restricted set of objects. Going back to our fictitious world of limited events, if ‘tea’ is one of the objects, and the other objects are not semantically nor semiotically connected to the action ‘drink’, then the probability of ‘tea’ being one of the objects in the event given the action ‘drink’ would be 1, a certainty.

As such, it would be interesting to explore what makes SOV word order so common, since UID alone cannot fully explain it. Would this be accounted for by other linguistic or behavioural characteristics in speakers of SOV languages, such as talking speeds in syllables per second? Other avenues of investigation could also look into why such sporadic examples of languages with object-first canonical word orders still exist despite the unfavourability under UID, and what these languages feature that could mitigate this unfavourability.

Some other ideas I have to explain this include, perhaps in more complex sentences with more constituent elements, the entropy in the SOV word order could be closest to what the UID hypothesis posits. However, I do not have a way of empirically testing for this yet. This might also open up to more subtypes of these kinds of word orders, such as prepositions vs postpositions use, the use of clitics and noun case systems, and things like that. However, as it stands, the UID hypothesis alone does not appear to explain pretty well why the SOV word order is the most preferred canonical word order amongst the world’s languages.

Nevertheless, studies like this offer an insight into rationalising the observed distribution of canonical word orders amongst the world’s natural languages from a computational linguistics perspective. While the methodologies are technical in nature, I highly recommend you to take a look at this publication, and form your own opinions on the extent to which UID can help in explaining why some word orders are more common than others.

Leave a comment