Saturday, August 18, 2007

Which Alphabet Occurs Most in English Words ?

The other day my son was asking me about a Chinese character question and I posted back another question to him to think about. At that time I have no clue what the answer was. The question is like this, which English alphabet has the highest frequency of occurrence in English words.

Back in my mind I know I can write a script to find out and my initial guess is character 'c' or character 's'. Wanna guess too ?

The answer against /usr/dict/words in Solaris 10 (with 25143 words, with a total of 181519 characters):

e 20079
a 16403
i 13954
r 13410
t 12778
o 12692
n 12055
s 10161
l 10023
c 8207
u 6465
m 5815
d 5758
p 5507
h 5172
g 4119
b 4108
y 3618
f 2657
w 1946
k 1922
v 1883
x 613
z 429
j 426
q 375

I am testing it against all the words with at least 2 alphabets. While I was doing this, I realised that Solaris version of awk/nawk's "split" cannot work with null FS (field separator). I had to use "substr" function to split the word into individual character. Below is the script:

#! /bin/sh


nawk '
/^[a-zA-Z][a-zA-Z]+$/ {
        word=tolower($0)

        # in Solaris, it does not support null as FS
        # n=split(word,a,"")

        n=length(word)
        for(i=1;i<=n;++i) {
                char=substr(word,i,1)
                ++stat[char]
        }
}
END {
        for(i in stat) {
                print i, stat[i]
        }
}' /usr/dict/words | sort -n -r -k 2

Now I know letter 'e' occurs 20079 times in 25143 words, so what is the percentage of occurrence ? Simple, just modify the above script to keep track of the sum of all alphabets and print out the percentage towards the end.

#! /bin/sh


nawk '
/^[a-zA-Z][a-zA-Z]+$/ {
        word=tolower($0)

        # in Solaris, it does not support null as FS
        # n=split(word,a,"")

        n=length(word)
        sum+=n
        for(i=1;i<=n;++i) {
                char=substr(word,i,1)
                ++stat[char]
        }
}
END {
        for(i in stat) {
                printf("%s %i %.2f%\n",i,stat[i],100*stat[i]/sum)
        }
}' /usr/dict/words | sort -n -r -k 2

Result is:

e 20097 11.11%
a 16426 9.08%
i 13976 7.73%
r 13420 7.42%
t 12793 7.07%
o 12711 7.03%
n 12073 6.68%
s 10173 5.63%
l 10030 5.55%
c 8217 4.54%
u 6476 3.58%
m 5832 3.22%
d 5770 3.19%
p 5515 3.05%
h 5180 2.86%
g 4124 2.28%
b 4112 2.27%
y 3624 2.00%
f 2664 1.47%
w 1955 1.08%
k 1928 1.07%
v 1890 1.05%
x 619 0.34%
z 431 0.24%
j 429 0.24%
q 376 0.21%

Alphabet 'e' occurs 11.11% in English words.

Just rebooted to try the same script (change nawk to awk) on my Fedora Core 5 with 479625 words 4471395 characters (/usr/share/dict/words). The answer is still alphabet 'e'. Suprisingly, alphabet 'q' occurs less than 'x'

e 421846 10.75%
i 346205 8.82%
a 341938 8.71%
n 283216 7.22%
s 279915 7.13%
o 278915 7.11%
r 276519 7.05%
t 252848 6.44%
l 222301 5.67%
c 168539 4.30%
u 145006 3.70%
d 126998 3.24%
p 122902 3.13%
m 119398 3.04%
h 106289 2.71%
g 92384 2.35%
y 77498 1.98%
b 74046 1.89%
f 44212 1.13%
v 38780 0.99%
k 33997 0.87%
w 27340 0.70%
z 17233 0.44%
x 11464 0.29%
j 7287 0.19%
q 6541 0.17%

Labels: ,

1 Comments:

Blogger Ah Choo said...

It is not surprizing that 'E' is the most common letter in english words. Cos it is a vowel.

I believe the major character will remain unchanged. But what about the minor character and the difference between the US and UK dict?
---------------------------
---------------------------
Change to another language and it will be another letter. This was discuss in a book about the Maths and history of Cryptography.

10:23 AM  

Post a Comment

<< Home