If conversation were chemistry

Today’s blog post is for the geeks. Seriously, if you don’t think frequency analysis for its own sake can be fun, you probably won’t enjoy this much.

Stan Carey has a blog post about the length of the chemical name of the largest known protein, considered as though it were a word. It takes three and a half hours to read aloud, so it would easily be the longest word in the English language were it not for the fact that it doesn’t count.

I decided to play around, so I started by taking the chemical name, and (after removing hyphens/whitespace from raw text) ran it through a character frequency analyser. This told me that the letter L occurs 14645 times, accounting for 22.9% of the text. At the low end, the letter D occurs a measly 238 times, which is just 0.4%. Letters not present at all are B, F, J, K, Q, W, X and Z.

Noticing that the chemical name contains a multitude of components ending in ‘yl’, I inserted a space after each occurence of that pair, then fed the result through a word frequency analyser. It gave me the following totals for each component word.

leucyl: 1147 ... glutamyl: 971 ... alanyl: 777 ... 
glutaminyl: 763 ... seryl: 730 ... lysyl: 697 ... 
aspartyl: 465 ... valyl: 430 ... arginyl: 425 ... 
threonyl: 413 ... isoleucyl: 391 ... glycyl: 290 ... 
asparaginyl: 289 ... histidyl: 238 ... phenyl: 206 ... 
methionyl: 204 ... prolyl: 176 ... tyrosyl: 142 ... 
tryptophyl: 131 ... cysteinyl: 117 ... leucine: 1

My next step was to divide the component words into syllables, which I take to be:

leu-cyl ... glu-ta-myl ... a-la-nyl ... 
glu-ta-mi-nyl ... se-ryl ... ly-syl ... 
as-par-tyl ... va-lyl ... ar-gi-nyl ... 
thre-o-nyl ... i-so-leu-cyl ... gly-cyl ... 
as-pa-ra-gi-nyl ... his-ti-dyl ... phe-nyl ... 
me-thi-o-nyl ... pro-lyl ... ty-ro-syl ... 
tryp-to-phyl ... cys-tei-nyl ... leu-cine

I then computed syllable frequencies, as shown below. My calculations from this point on may contain errors, since I used the Windows calculator and other non-automated tools, but I assume I would have noticed any large errors. Feel free to check my work.

nyl: 3194 ... cyl: 1828 ... glu: 1734 ... ta: 1734 ... 
leu: 1539 ... myl: 971 ... syl: 839 ... a: 777 ... 
la: 777 ... mi: 763 ... as: 754 ... se: 730 ... 
ryl: 730 ... gi: 714 ... ly: 697 ... o: 617 ... 
lyl: 606 ... par: 465 ... tyl: 465 ... va: 430 ... 
ar: 425 ... thre: 413 ... i: 391 ... so: 391 ... 
gly: 290 ... pa: 289 ... ra: 289 ... his: 238 ... 
ti: 238 ... dyl: 238 ... phe: 206 ... me: 204 ... 
thi: 204 ... pro: 176 ... ty: 142 ... ro: 142 ... 
tryp: 131 ... to: 131 ... phyl: 131 ... cys: 117 ... 
tei: 117 ... cine: 1

Each syllable can be divided into one or more components, which I’ll call onset (the initial consonant), glide (an r or l after the initial consonant), nucleus (the vowel) and coda (the final consonant). All the syllables have a nucleus, but they don’t necessarily have the other components. A summary of syllable types and their frequencies is:


Using a tilde (~) to represent null, onset frequencies are:

l: 3619 ... n: 3194 ... ~: 2964 ... t: 2958 ... 
g: 2738 ... s: 1960 ... c: 1946 ... m: 1938 ... 
r: 1161 ... p: 930 ... th: 617 ... v: 430 ... 
ph: 337 ... d: 238 ... h: 238

Glide frequencies are:

~: 22524
l: 2024
r: 720

The glide l occurs only after the onset g, and — because glutamyl and glutaminyl are so common — they occur together far more often than g occurs alone. The glide r can occur after th, p and t, and there are more thr‘s than th‘s, but fewer pr‘s than p‘s and a lot fewer tr‘s than t‘s.

If we consider the onset and glide as part of the same thing, then the adjusted onset frequencies are:

l: 3619 ... n: 3194 ... ~: 2964 ... t: 2827 ... 
gl: 2024 ... s: 1960 ... c: 1946 ... m: 1938 ... 
r: 1161 ... p: 754 ... g: 714 ... v: 430 ... 
thr: 413 ... ph: 337 ... d: 238 ... h: 238 ... 
th: 204 ... pr: 176 ... tr: 131

Nucleus frequencies are:

y: 10379 ... a: 5940 ... i: 2549 ... u: 1734 ... 
e: 1553 ... eu: 1539 ... o: 1457 ... ei: 117

And coda frequencies are:

~: 14135 ... l: 9002 ... s: 1109 ... 
r: 890 ... p: 131 ... ne: 1

We now have all the information we need to use Mark Rosenfelder’s vocabulary generator to create text with comparable statistical properties to the chemical name we started with. We will ignore the nulls in the syllable component lists, taking the six syllable types listed earlier as an adequate summary.

If we let onsets and glides occur in any combination (which of course they don’t, but we’ll get to that), the settings to enter are:

Syllable types: (1=onset, 2=glide, 3=nucleus, 4=coda)


Rewrite rules: (recycling letters that don't occur in the chemical name)


Edit: You also need F|Th B|Ph Q|Eu J|Ei Z|Ne because Gen buggers up capitalisation.



For other settings I suggest dropoff=medium and monosyllables=rare. (Hey Zompist, if Gen sent form data by the GET method I wouldn’t have to type out the settings; all I’d need is a link.) When I tried this, I got the following nonsense text:

Thilpil relelotalra seule ryle ticeslila grynap. Nilma cylly yteilylily galsip laleis nullylrys. Lasli nisny nanyltul larri lyli satilyl. Talutu tameus gruleis gatlys nlelrulileil gyhaneinleutri! Lial tlyteislilne cleulygleusil. Lilgir tra sliveullyleslil leul tasa mypcasgyla nacanystilral nyly cyrri sypnyga. Lynesla lalu lryrlyslal neugynivi llutle luny atylalsisleu mylly. Nasnias lisnylasa leureu ucis nralru nelynuar. Tylla srali gicy rius llo ylyne. Gua lygaly? Nusglaeul lileuga meu teiyl tirny rlacasy. Ya nygu gy nyla tlululyl? Naga peiy gynu nenri? Rae ysa tallatlugu lonanry. Nallageulil nynegeigasine tlaleti nulmal selu lleutyllyrmeuly. Lyles ly legle ara nysnaniltra nipu. Teltar masytus tyryp lreusysyne. Lil ny tivy tiagrelly tlamlyl nisgyl. Nallyte geuri nymy iphycei. Lirlalli teilalel typlla tutholmi yrene paveusla. Lyphytyr nanu nisleutaneteiltane tultelai utrise nyllu. Tica vagyl natilry. Nalsru glatal lyla leucir lrasna musli. Leuptly menys lrui otis nacis tupnatusluny. Nusylceinupyr lepsa soduna syla gieus nanelar agur yslri llaa. Leulli a le llumotha slyganis nlatla. Lyleurgeup llylreu saliti lei nassy glililaga. Galyslys pyylgyne renlymleisy lanulalgyr lanynly yltiy? Grimra nilnlytyr timre apli sillyi niltuy. Eulia nuthri nisleur syni nliytru eule?

But what if we only allow onset and glide combinations that are actually attested? Then the generator input would be:

Syllable types: (1=onset, 2=nucleus, 3=coda)


Rewrite rules:


Edit: You also need Α|Gl Β|Thr Γ|Pr Δ|Tr plus the supplements from earlier edit.



And the output I got was:

Tilphatiglaline la leu tipnili ascyr niglityalpeis? Nanel eisyl teuslalista gleulirteu cotur renicu. Lini ustalyy syllilya gly lylnusty naplel. Gli teillip leugluil ae sutylnalleil. Nesillei leigly? Ilyr ypyl glyteney anyr aneneu nyeunyty? Lylitus glala letieuti cespi glopeilysneu nysglolla lisu glulygirip nyla tygly glasgly? Leiglysne glalusal unyulil lysi tynynyni glumoglyr. Luteula leiral cyca cyl rilyrly. Ymyp alti aglys rurtelyelnul silusliny gly. Geucy enuupmu nitylyr leirlys euymas. Ynymisnu tanalyluta eirsyl leigly so ylyr. Cysly talisnathryneil gluta teussy liyleu. Lyalla iglinoslenesa leusis lalolpeir toetana nistysglyne lysra. Lelral taa urglytel nynales glunerysa lylcyta. Gleulyp asil neula ta lyrlap tilma. Turnuys glilse caitusca ligleu taglune laula glela isis. Mely caleirteu lutyl eusmu lyteulglelyr glurae. Vyrseral lonyr lyceule lynotylar ticatallala yurlul. Ismer leigleus illymeulygly thrycay camair y? Neusylneu rinytutepa illi ury eny teteir? Saltiscirla e yrnaceul salylylyl rarininu cicutylup. Segli tylnircy glama glaeistagli pyul teusnalo. Nanimiglar tarne ly eiglu. Lyylisilol oy nee illy? Lota puuyle nya glyli teti talei glary loylas ena. Lypal eiy cinatypyl sysris tiygle luny calglasly? Yney lyly gleunily eullise lurtystar einuse. Livi tuny siculeu trasgla nyply anys. Sisalalei nipseulayeu ratas tystyl a neuletiltugly.

If you’ve read this far, you may need to take some talisnathryneil. Cysly.


22 Responses to “If conversation were chemistry”

  1. Stan Says:

    Nice work, Adrian. This is very amusing, and not at all cylly. But it’s probably just as well languages generally aren’t engineered in a lab.

  2. Adrian Morgan Says:

    Thanks Stan. I’m not convinced it isn’t cylly; I may even have a ticatallala. But it’s gratifying that someone enjoyed it. (I’ve made some edits, but none that affect the numerical data.)

  3. Sofia // Papaya Pieces Says:

    And how do you pronounce that? ;)

  4. shubhamgoenka Says:

    Brilliance at its best!! Last line was the cherry .. :D
    how long did it take you to research, and write the post?

  5. katehobson Says:

    i’m surprised i read as far as i did. i could just barely follow, but hey, you gave me fair warning and i decided to be stubborn about it. impressive!

  6. awax1217 Says:

    You would have been a decoder in the war. Now get a job and figure out the next terrorist move.

  7. Midwestern Plant Girl Says:

    Um, yeah. I came up with 42. Which is the answer to everything =-)
    Thanks for the math blast, I will have to take some rgobenhrioubgenhrib nevgjeruiuervneibunhefbpeo, after reading that.
    Congrats on gettin’ pressed!!

  8. theadventuresofbeka Says:

    I love this! It combines two of my favorite things – biological science and language analysis.

  9. Adrian Morgan Says:

    Thanks for the comments.

    I was totally baffled when I learned this blog post would be featured on Freshly Pressed, because after completing the draft I didn’t think it was worth publishing! It is Stan Carey’s encouragement you have to thank for the fact that I published it at all. (Not wanting to waste the work I put into it was a lesser factor.) See the comment section of his post.

    I wouldn’t expect this to appeal to a broad audience, but I guess it appeals to a mindset that likes to ask, “I wonder what happens if we do THIS?” (which was essentially the driving question of the post). My blog stats indicate that not many people are following the links, though…

    Sofia: Your guess is as good as mine. I definitely don’t plan to record it. :-)

    shubhamgoenka: Thanks — I did the bulk of the work in just one morning session, saving my notes in a word processor document, and then it took another session or two to tidy things up, add or fix a few things, and write it in blog form.

    Midwestern Plant Girl: You’re welcome, and yeah, it’s probably as good a way as any to find the Ultimate Answer (maybe even a hint of the Question).

    theadventuresofbeka: Thanks — you’re obviously bang in the middle of the target audience here. :-)

    I’ve deleted a few spammy comments, including one that would have been fine with a smiley. Ambivalent about the generic compliments, as it’s hard to tell if they’re all sincere. [Edit: On reflection, I will delete overly generic compliments, even though some probably aren’t spam.]

  10. Michelle Rene Says:

    Reblogged this on Write Here Write Now and commented:
    This was so amusing! I am a huge fan of reasoning skills and really got a kick out of this!

  11. bristlehound Says:

    The biggest word I know is discombobulated, which as it happens applied now. Whatever happened to 42 as the answer to all things and does your very attractive, very long word chemically describe 42? Big questions and love them all. Thanks for your hard work, very enlightening . B

  12. fluidimagery Says:

    I didn’t understand this at all. (Special education reading teacher)…. but I read it and loved it!

  13. quirkywritingcorner Says:

    I was sort of there at the beginning, but got lost somewhere in the middle. I love this sort of nonsense, but let me take another seizure pill and I might be able to translate the gobbledygook for you. Thank you for the fun read.

  14. ashokbhatia Says:

    Reminds me of the movie ‘A Beautiful Mind’.
    Since you are so very scientifically inclined, here is a post you may like:

  15. gautam241997 Says:

    This is brilliant, science is beautiful in so many different ways, how did you come with this idea?

  16. cabbagetalk Says:

    Amusing. I don’t consider myself to be a geek. But I am stubborn. I did find it interesting and understood a minimal amount. But interesting to read all the same.

  17. Adrian Morgan Says:

    Thanks all. This blog’s record for the number of views in a single day was broken at around 10:00pm, just two hours before midnight. I haven’t managed to come up with personal replies to everyone, but I hope those who missed out don’t mind. I’m tired now. :-)

    bristlehound: Maybe you’d find the number 42 by measuring the brainwaves of someone who is discombobulated and enlightened at the same time! Quite a rare sensation, I should think. :-)

    quirkywritingcorner: That translation could be most interesting. :-) Google Translate’s autodetect thinks it’s Welsh, obviously because of all those L’s and Y’s.

    gautam241997: Well, I read Stan’s post, and it got me wondering what I might find if I analysed the chemical name with whatever tools I could think of. Frequency analysis was an obvious place to start, and everything flowed from there.

    cabbagetalk: I don’t think any of it is difficult to understand, but some things you can only understand if you are curious enough to look them up. I’m thinking of the vocabulary generator settings in particular. Anyway, in my opinion being a geek is more about what you enjoy than what you understand.

  18. Jane Says:

    This was a very great read, interesting and perfectly written. (i agree, you should become a decoder lol)

    I know this blog will get a lot of hits, so please don’t be mad at me for trying to spread the word on my friends fundraiser. Please just take the time to visit the page, and if possible, please share. Thank you.


  19. Gry Ranfelt Says:

    Lol, there are bound to be some interesting inspiration for fantasy names here or there

  20. krstokely Says:

    I’m a geek and I love chemistry and language – this lost me, it was a little over my head but I loved the notion! Always trust your geeky quest to figure things out and share your results!

  21. Adrian Morgan Says:

    Gry: Yeah, the guy who wrote the vocabulary generator is one of the gurus of online conlanging — inventing your own language, like Tolkien and others have done — and the generator is meant to help people to create words for their languages. Which includes fantasy names, obviously. Here’s a link to his WordPress blog, if you’re curious.

    krstokely: Let me know if there’s anything I can clarify. (I don’t expect readers to understand the vocabulary generator settings; I just put them in so people can try it for themselves. You’re allowed to ask, though.)

    Administrative comment: To reiterate what I said upthread, I’m deleting comments that are too vague. It’s not that I don’t appreciate the sentiment, and I understand most of them are sincere, but I don’t like it when comment threads are full of people saying “Great post!” without saying anything more. Also, probably a few of them were spam, but there’s no foolproof way to tell who is sincere and who isn’t, so all I can do is apply the same rules to everyone. I hope everyone understands that, and that no-one takes it personally. I’m not used to getting so many comments on a post.

    (I might also trim some of the reblog notifications at some point, but I don’t feel like making a decision about that just yet.)

  22. Adrian Morgan Says:

    Jane: Thanks for reading and commenting — your reference to the “decoder” comment proves that you really did take the time to read it, so you’re not just spamming. As for the fundraiser, I don’t currently have a youcaring account, but I’ve tweeted the link so it might reach a few more people there.

You are welcome to add your thoughts.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s