Which Alphabet Occurs Most in English Words ?
Back in my mind I know I can write a script to find out and my initial guess is character 'c' or character 's'. Wanna guess too ?
The answer against /usr/dict/words in Solaris 10 (with 25143 words, with a total of 181519 characters):
e 20079 a 16403 i 13954 r 13410 t 12778 o 12692 n 12055 s 10161 l 10023 c 8207 u 6465 m 5815 d 5758 p 5507 h 5172 g 4119 b 4108 y 3618 f 2657 w 1946 k 1922 v 1883 x 613 z 429 j 426 q 375
I am testing it against all the words with at least 2 alphabets. While I was doing this, I realised that Solaris version of awk/nawk's "split" cannot work with null FS (field separator). I had to use "substr" function to split the word into individual character. Below is the script:
#! /bin/sh nawk ' /^[a-zA-Z][a-zA-Z]+$/ { word=tolower($0) # in Solaris, it does not support null as FS # n=split(word,a,"") n=length(word) for(i=1;i<=n;++i) { char=substr(word,i,1) ++stat[char] } } END { for(i in stat) { print i, stat[i] } }' /usr/dict/words | sort -n -r -k 2
Now I know letter 'e' occurs 20079 times in 25143 words, so what is the percentage of occurrence ? Simple, just modify the above script to keep track of the sum of all alphabets and print out the percentage towards the end.
#! /bin/sh nawk ' /^[a-zA-Z][a-zA-Z]+$/ { word=tolower($0) # in Solaris, it does not support null as FS # n=split(word,a,"") n=length(word) sum+=n for(i=1;i<=n;++i) { char=substr(word,i,1) ++stat[char] } } END { for(i in stat) { printf("%s %i %.2f%\n",i,stat[i],100*stat[i]/sum) } }' /usr/dict/words | sort -n -r -k 2
Result is:
e 20097 11.11% a 16426 9.08% i 13976 7.73% r 13420 7.42% t 12793 7.07% o 12711 7.03% n 12073 6.68% s 10173 5.63% l 10030 5.55% c 8217 4.54% u 6476 3.58% m 5832 3.22% d 5770 3.19% p 5515 3.05% h 5180 2.86% g 4124 2.28% b 4112 2.27% y 3624 2.00% f 2664 1.47% w 1955 1.08% k 1928 1.07% v 1890 1.05% x 619 0.34% z 431 0.24% j 429 0.24% q 376 0.21%
Alphabet 'e' occurs 11.11% in English words.
Just rebooted to try the same script (change nawk to awk) on my Fedora Core 5 with 479625 words 4471395 characters (/usr/share/dict/words). The answer is still alphabet 'e'. Suprisingly, alphabet 'q' occurs less than 'x'
e 421846 10.75% i 346205 8.82% a 341938 8.71% n 283216 7.22% s 279915 7.13% o 278915 7.11% r 276519 7.05% t 252848 6.44% l 222301 5.67% c 168539 4.30% u 145006 3.70% d 126998 3.24% p 122902 3.13% m 119398 3.04% h 106289 2.71% g 92384 2.35% y 77498 1.98% b 74046 1.89% f 44212 1.13% v 38780 0.99% k 33997 0.87% w 27340 0.70% z 17233 0.44% x 11464 0.29% j 7287 0.19% q 6541 0.17%
1 Comments:
It is not surprizing that 'E' is the most common letter in english words. Cos it is a vowel.
I believe the major character will remain unchanged. But what about the minor character and the difference between the US and UK dict?
---------------------------
---------------------------
Change to another language and it will be another letter. This was discuss in a book about the Maths and history of Cryptography.
Post a Comment
<< Home