Tuesday, June 11, 2013

Word frequency lists

One of the things a Latin nerd is up against is poor word frequency lists. In English, there are lots of good lists. The frequency data from COCA is top notch.

If you look at their big list, you'll see that somehow they manage to deal with inflected forms: be, which is English's most inflected word, is in second place. Properly so. But then you go to a pre-cooked Latin frequency list and the various forms of esse are scattered. And to a degree I understand. It's easier to build a frequency list that ignores these sorts of things. Quo could belong to quo or qui/quae/quod or quis/quid. I suppose there are ways around it, but then you start getting into having to program a computer to know the difference. I don't want to think about teaching a computer the difference between cum1 and cum2. But to some degree that's small potatoes.

Perseus has a word frequency tool buried in the results page of the word study tool, and it's pretty cool. But as a frequency analysis for fax shows, it's got an idiosyncratic approach to defining the corpus (i.e. de senectute is its own corpus and so is epistulae ad familiares and so on and so on). So at Perseus you get an idea of frequency, so long as you're not interested in a broader vision of Latinity. Other lists give you an absolute ranking and no more. Some give you the lemma others give you the assorted word forms. And then there's a super list that I love (it's true) from Dickinson College Commentaries.

In any case, I've not found one that's good at tracking down collocations—its own can of worms. Oh, woe to someone whose interest in Latin goes beyond the literary, historical or pedagogical.