Abstract:
In this thesis, we consider the strength of evidence for Zipf’s law in language. Historically, the statistical approaches taken to examine this have not generally been appropriate to the models proposed, although recent work has begun to change this. In Chapter 1, we provide some basic concepts on which this work relies. In Chapter 2, we examine Zipf’s law in comparison to the mixed and interspersed geometric models, by means of likelihood ratios. This is a theory-heavy approach, and illustrates that, even if Zipf’s law cannot be entirely dismissed, there is definite evidence that it fails to properly model real linguistic data. In Chapter 3, we apply significantly simpler indicators to both real linguistic data and ‘monkey models’, in which texts are produced by the random generation of characters. We observe here evidence that these models, which have occasionally been used as evidence against the Zipf distributions, do not reasonably model linguistic data, and additionally find a depth of behaviour suggesting that claims about linguistic samples as a whole should be considered with great caution. In Chapter 4, we examine some of the theoretical justification for Zipf’s law. Although we can see some linguistic influence here, the arguments are strained, and do not rise to the level where we can readily accept them without significant further work.