Version 1.0, 8 August 2005
Written by and copyright Neil Smith
Available for download and use under the GNU Public Licence (GPL).
ngramwords is a simple Tcl program that generates random words based on a sample text.
The program reads a set of words and generates an n-gram model of the language. It then uses that model to generate new random words.
The idea is that the preceeding few letters in a word determine what the next letter could be. Let's say we're looking at bigrams, sequences of two letters (n = 2). If we take all the words in the language sample we've got, we can list all the bigrams that occur in all the words. We can also list, for each bigram, the letter that comes after it. We also record 'end of word' as being a possible successor letter for a bigram. We end up with a list of all the bigrams in the language sample, how frequent they are, and what letter follows. We also keep a list of the initial bigrams, so we know how words are allowed to start. This is our model of the language.
To generate new words, we pick a random starting bigram from the list of initial bigrams. This gives us the first two letters of our word. We then look up that bigram in our main list of bigrams, which gives us a list of letters that can follow this bigram. We pick one of those at random, and that gives us the third letter of our word. We then take the bigram of the second and third letters and look it up in the list of bigrams; from this, we generate the fourth letter. We then use the third and fourth letter to generate the fifth, and so on until we choose an 'end of word' marker.
Using larger values for n means that the generated words conform more closely to the words in the language sample, but there is a tendency to recycle the exising words if the sample is small. I find that using trigrams (n = 3) works well when there's a few hundred words.
Installation and Use
Copy the ngramwords file to your computer, preferably on your path. Make the file executable. The program is command-line only, so you'll neet to run a terminal to use it.
To invoke the program, call it with:
ngramwords [options] input-file
where input-file is a text file containing a list of words to build the language model from. Words must be separated by whitespace. If any words contain any characters outside the range [A-Z]|[a-z], including accented characters, those characters must be listed in the ligatures line. If no input file name is given, ngramwords will read words from standard input. Generated words will be sent to standard output. Options
ngramwords accepts several command-line options. They are:
- -g n : Uses n as length of n-grams. Default 3
- -n n : Generates n words. Default 20
- -l n : Words are at most n letters long. Default 15
- -i f : Reads input from file f. Default is standard input
- -o f : Writes output to file f. Default is standard output
- -s : Show dictionary contents
- -c : case sensitive (defult is to convert all letters to lowercase)
- -v : Verbose mode
Ligatures and accented characters
Many languages, when transliterated in to the Latin alphabet, use more than one Latin letter for each native letter. To accommodate this, this program allows each element of the n-gram to be a multi-letter 'token.' This is done by including in the input file a line such as:
ligatures= th ch ú
(this must be on a separate line in the input file). This will mean that 'ú' will be recognized as a valid character in this language, and that 'th' and 'ch' will be treated as a single letter. See the sample input file for an example.
Lines in the input file that start with a # character are treated as comments and ignored by the program.