In quantitative linguistics, you may have encountered word frequency tables, listing out how abundant one word is relative to another, and the sort. Scrolling through these lists, there seems to be a trend. In English, the most commonly used words include “the”, “of”, and “and”, while in Spanish, they are “de”, “la”, and “que”. These rankings of word frequencies are typically derived from compilation of huge databases of written corpus, texts and other sources, from which statistical analyses are performed.
But what happens when we look at how often each word occurs relative to the next ranking one? Here, we start to observe an interesting pattern. Take the Brown Corpus of American English text for example. We see the most frequently occurring word, “the”, occurring about 7% of the time, with the second-most frequently occurring word, “of”, appearing about 3.5% of the time, and “and”, the third-most frequently occurring word, appeared about 2.4% of the time in the text, and so on. We start to see that the frequency of any word is inversely proportional to its rank in the frequency table. Simply speaking, the most frequently-occurring word appears twice as often as the next most frequently-occurring one, and thrice as often as the third-most occurring word, and so on. This pattern of inversely-related rank-frequency distribution is what is normally referred to as Zipf’s Law.
As it seems, Zipf’s Law appears to apply for all languages, natural or constructed. However, no one is sure why this happens. Randomly generated lists have produced conflicting evidence, with the paper by Ferrer-i-Cancho & Elvevåg (2010) demonstrating a lack of support, and Li (1992) showing support for Zipf’s Law in randomly generated texts. Zipf attempted to explain this distribution by proposing the principle of least effort — a principle by which neither speaker nor listener communicating by a particular language would want to put in more effort to understand each other. When an equal distribution of effort is reached, the observed Zipf distribution would thus show up. Yet, the relative universality of this pattern appears too regular to be true, right?
Studying Zipf’s Law involves the heavy use and applications of statistics to understand and analyse word-frequencies in various texts. Usually, a goodness of fit may be checked against a hypothesised power law distribution, using a statistical test known as the Kolmogorov-Smirnov test. Following which, the log likelihood ratio of this power law distribution may be compared or tested against other kinds of distributions. There may be statistical packages available for R to test Zipf’s Law, and similar ones may exist for other statistical analyses languages like SPSS, Stata etc. Many of these tests conducted using corpus texts and databases typically take millions of words into consideration and tested. However, I do want to see if a certain post series I have done follows Zipf’s Law. The dataset I have in mind is the Method Review series that I have written up to January 2021. What now?
This would be an extremely small dataset, at thousands of words, compared to the millions of words taken from sites like Wikipedia and other sources like these. Next up, would be statistical know-how. While having a statistical background, I do tend to get a little lazy with some aspects, or struggle to obtain the desired output for some statistical tests.
So, sacrificing professionalism for entertainment, we are going to use the frequency distribution calculator that we conveniently found on https://frequencydistributioncalculator.com/. A drawback is that this calculator only tells you if your dataset follows Zipf’s Law, but not the statistical parameters you would often find in more professional texts. We might do a more thorough statistical test at some point down the road, but this is what we will make do with.
The eight posts on Method Review so far contains 15072 words, with 93881 characters. After ignoring stuff like case and punctuation, we see that our post series does not follow Zipf’s Law, or at least that is what the calculator says. We might do a proper statistical analysis some time down the road. But anyway, the most frequently occurring words are, in descending podium positions, “the”, “to”, and “in”. There were 2198 unique words appearing in the post series, although that actually included numerical characters. Interestingly, almost half of the words are denoted HL, or hapax legomena, basically words that only appear once in the text used in consideration.
This was intended to be some random delving into weird quirks about language, but if you want a proper statistical analysis determining if The Language Closet follows Zipf’s Law, feel free to let me know, and I will try to attain the statistical understanding behind distributions like these. If you really want to see if your texts follow Zipf’s Law, the calculator really makes things like these more accessible to the public, who may lack the statistical backgrounds needed for proper analysis. Thank you so much for reading.