I recently started writing a game where you build words using letter tiles. To create the game, I needed to know the frequency of letters across regular words in the English language, so I could present a useful set of letter tiles. Letter frequency is discussed in various places, including on Wikipedia, but I wanted to calculate the letter frequency myself.
Linux provides a list of words in the /usr/share/dict/words
file, so I already have a list of likely words to use. The words
file contains lots of words that I want, but a few that I don't. I wanted a list of all words that weren't compound words (no hyphens or spaces) or proper nouns (no uppercase letters). To get that list, I can run the grep
command to pull out only the lines that consist solely of lowercase letters:
$ grep '^[a-z]*$' /usr/share/dict/words
This regular expression asks grep
to match patterns that are only lowercase letters. The characters ^
and $
in the pattern represent the start and end of the line, respectively. The [a-z]
grouping will match only the lowercase letters a to z.
Here's a quick sample of the output:
$ grep '^[a-z]*$' /usr/share/dict/words | head
a
aa
aaa
aah
aahed
aahing
aahs
aal
aalii
aaliis
And yes, those are all valid words. For example, "aahed" is the past tense exclamation of "aah," as in relaxation. And an "aalii" is a bushy tropical shrub.
Now I just need to write a gawk
script to do the work of counting the letters in each word, and then print the relative frequency of each letter it finds.
Counting letters
One way to count letters in gawk
is to iterate through each character in each input line and count occurrences of each letter a to z. The substr
function will return a substring of a given length, such as a single letter, from a larger string. For example, this code example will evaluate each character c
from the input:
{
len = length($0); for (i = 1; i <= len; i++) {
c = substr($0, i, 1);
}
}
If I start with a global string LETTERS
that contains the alphabet, I can use the index
function to find the location of a single letter in the alphabet. I'll expand the gawk
code example to evaluate only the letters a to z in the input:
BEGIN { LETTERS = "abcdefghijklmnopqrstuvwxyz" }
{
len = length($0); for (i = 1; i <= len; i++) {
c = substr($0, i, 1);
ltr = index(LETTERS, c);
}
}
Note that the index function returns the first occurrence of the letter from the LETTERS
string, starting with 1 at the first letter, or zero if not found. If I have an array that is 26 elements long, I can use the array to count the occurrences of each letter. I'll add this to my code example to increment (using ++
) the count for each letter as it appears in the input:
BEGIN { LETTERS = "abcdefghijklmnopqrstuvwxyz" }
{
len = length($0); for (i = 1; i <= len; i++) {
c = substr($0, i, 1);
ltr = index(LETTERS, c);
if (ltr > 0) {
++count[ltr];
}
}
}
Printing relative frequency
After the gawk
script counts all the letters, I want to print the frequency of each letter it finds. I am not interested in the total number of each letter from the input, but rather the relative frequency of each letter. The relative frequency scales the counts so that the letter with the fewest occurrences (such as the letter q) is set to 1, and other letters are relative to that.
I'll start with the count for the letter a, then compare that value to the counts for each of the other letters b to z:
END {
min = count[1]; for (ltr = 2; ltr <= 26; ltr++) {
if (count[ltr] < min) {
min = count[ltr];
}
}
}
At the end of that loop, the variable min
contains the minimum count for any letter. I can use that to provide a scale for the counts to print the relative frequency of each letter. For example, if the letter with the lowest occurrence is q, then min
will be equal to the q count.
Then I loop through each letter and print it with its relative frequency. I divide each count by min
to print the relative frequency, which means the letter with the lowest count will be printed with a relative frequency of 1. If another letter appears twice as often as the lowest count, that letter will have a relative frequency of 2. I'm only interested in integer values here, so 2.1 and 2.9 are the same as 2 for my purposes:
END {
min = count[1]; for (ltr = 2; ltr <= 26; ltr++) {
if (count[ltr] < min) {
min = count[ltr];
}
}
for (ltr = 1; ltr <= 26; ltr++) {
print substr(LETTERS, ltr, 1), int(count[ltr] / min);
}
}
Putting it all together
Now I have a gawk
script that can count the relative frequency of letters in its input:
#!/usr/bin/gawk -f
# only count a-z, ignore A-Z and any other characters
BEGIN { LETTERS = "abcdefghijklmnopqrstuvwxyz" }
{
len = length($0); for (i = 1; i <= len; i++) {
c = substr($0, i, 1);
ltr = index(LETTERS, c);
if (ltr > 0) {
++count[ltr];
}
}
}
# print relative frequency of each letter
END {
min = count[1]; for (ltr = 2; ltr <= 26; ltr++) {
if (count[ltr] < min) {
min = count[ltr];
}
}
for (ltr = 1; ltr <= 26; ltr++) {
print substr(LETTERS, ltr, 1), int(count[ltr] / min);
}
}
I'll save that to a file called letter-freq.awk
so that I can use it more easily from the command line.
If you prefer, you can also use chmod +x
to make the file executable on its own. The #!/usr/bin/gawk -f
on the first line means Linux will run it as a script using the /usr/bin/gawk
program. And because the gawk
command line uses -f
to indicate which file it should use as a script, you need that hanging -f
so that executing letter-freq.awk
at the shell will be properly interpreted as running /usr/bin/gawk -f letter-freq.awk
instead.
I can test the script with a few simple inputs. For example, if I feed the alphabet into my gawk
script, each letter should have a relative frequency of 1:
$ echo abcdefghijklmnopqrstuvwxyz | gawk -f letter-freq.awk
a 1
b 1
c 1
d 1
e 1
f 1
g 1
h 1
i 1
j 1
k 1
l 1
m 1
n 1
o 1
p 1
q 1
r 1
s 1
t 1
u 1
v 1
w 1
x 1
y 1
z 1
Repeating that example but adding an extra instance of the letter e will print the letter e with a relative frequency of 2 and every other letter as 1:
$ echo abcdeefghijklmnopqrstuvwxyz | gawk -f letter-freq.awk
a 1
b 1
c 1
d 1
e 2
f 1
g 1
h 1
i 1
j 1
k 1
l 1
m 1
n 1
o 1
p 1
q 1
r 1
s 1
t 1
u 1
v 1
w 1
x 1
y 1
z 1
And now I can take the big step! I'll use the grep
command with the /usr/share/dict/words
file and identify the letter frequency for all words spelled entirely with lowercase letters:
$ grep '^[a-z]*$' /usr/share/dict/words | gawk -f letter-freq.awk
a 53
b 12
c 28
d 21
e 72
f 7
g 15
h 17
i 58
j 1
k 5
l 36
m 19
n 47
o 47
p 21
q 1
r 46
s 48
t 44
u 25
v 6
w 4
x 1
y 13
z 2
Of all the lowercase words in the /usr/share/dict/words
file, the letters j, q, and x occur least frequently. The letter z is also pretty rare. Not surprisingly, the letter e is the most frequently used.
Comments are closed.