Find common bigrams
Bigrams are 2-letter combos. When designing a keyboard layout, it’s common to optimize for comfort and speed by analyzing bigrams. Here’s a simple shell script to do a quick-and-dirty bigram analysis.
Start with a corpus
Download Shai’s corpus for Colemak:
cd ~/Downloads
curl -fsSLo corpus.txt.xz https://colemak.com/pub/corpus/iweb-corpus-samples-cleaned.txt.xz
Extract the .txt file:
unxz corpus.txt.xz
Split into individual words
Separate that corpus into individual words, one per lined, and all lowecase letters:
tr '[:upper:]' '[:lower:]' < corpus.txt | tr '[:space:]' '\n' > corpus_word_list.txt
Count bigram frequency
Run this awk script to analyze the word list and generate bigrams:
awk '{
for (i = 1; i < length($0); i++) {
pair = substr($0, i, 2)
pairs[pair]++
}
}
END {
for (pair in pairs) {
printf "%s: %d\n", pair, pairs[pair]
}
}' corpus_word_list.txt | sort -k2,2nr > results.txt
And now, show the top ten most used bigrams:
$ head -n 10 results.txt
th: 10712957
he: 8729312
in: 8166065
an: 6560562
er: 6377359
re: 6028056
on: 5209590
at: 4541673
or: 4383020
nd: 4375566