Today’s blog post is for the geeks. Seriously, if you don’t think frequency analysis for its own sake can be fun, you probably won’t enjoy this much.
Stan Carey has a blog post about the length of the chemical name of the largest known protein, considered as though it were a word. It takes three and a half hours to read aloud, so it would easily be the longest word in the English language were it not for the fact that it doesn’t count.
I decided to play around, so I started by taking the chemical name, and (after removing hyphens/whitespace from raw text) ran it through a character frequency analyser. This told me that the letter L occurs 14645 times, accounting for 22.9% of the text. At the low end, the letter D occurs a measly 238 times, which is just 0.4%. Letters not present at all are B, F, J, K, Q, W, X and Z.
Noticing that the chemical name contains a multitude of components ending in ‘yl’, I inserted a space after each occurence of that pair, then fed the result through a word frequency analyser. It gave me the following totals for each component word. Read the rest of this entry »