I've done some investigating into the various names used by different cultures in and around Prax. The intention was that players should have an idea of a character's culture from his or her name. There are two main ways of doing this: the syllable-merging approach, and the n-gram model approach. I've written a little program to to both.
Michael Harvey developed a program that created random words and names from a set of syllables (the original zip file is available). I translated the program to Delphi to give it a nicer user interface: hopefully it's intuitive to use!. [ Zipped executable (runs under Win3.1 and later) | Source | How it works (Michael's original document) ]
Here are some example name element files (the raw outputs given need judgment before use):
- Praxian (use "Enhanced noun creation") [ sample output ]
- Sartarite male names and epithets, female names and epithets (based on Old English and Viking epithets) [ sample male output | sample female output ]
- Lunar male and female names (based on Latin: use male names as surnames) [ sample male output | sample female output ]
- Dara Happan (Sumerian / Assyrian) [ sample output ]
- Pavic (based on Gaelic) [ sample output ]
- Yelmalion male and female names (based on Basque) and family names [ sample male output | sample female output ]
- Oasis people (based on Inca) [ sample output ]
- Agamor (vaguely African) [ sample output ]
- Baboon [ sample output ]
- Boat People (based on hobbit names!) [ sample output ]
N-gram model approach
The other way to do this is to build a 'model' of the language. The idea is that the preceeding few letters in a word determine what the next letter could be. Let's say we're looking at bigrams, sequences of two letters (n = 2). If we take all the words in the language sample we've got, we can list all the bigrams that occur in all the words. We can also list, for each bigram, the letter that comes after it. We also record 'end of word' as being a possible successor letter for a bigram. We end up with a list of all the bigrams in the language sample, how frequent they are, and what letter follows. We also keep a list of the initial bigrams, so we know how words are allowed to start. This is our model of the language.
To generate new words, we pick a random starting bigram from the list of initial bigrams. This gives us the first two letters of our word. We then look up that bigram in our main list of bigrams, which gives us a list of letters that can follow this bigram. We pick one of those at random, and that gives us the third letter of our word. We then take the bigram of the second and third letters and look it up in the list of bigrams; from this, we generate the fourth letter. We then use the third and fourth letter to generate the fifth, and so on until we choose an 'end of word' marker.
Using larger values for n means that the generated words conform more closely to the words in the language sample, but there is a tendency to recycle the existing words if the sample is small. I find that using trigrams (n = 3) works well when there's a few hundred words.
ngramords (documentation) is a Tcl script that creates an n-gram model of a language and then generates new random words based on that model. It's released under the GNU Public Licence (GPL), meaning that it is free and open-source software.
I've mainly used this program to generate words for Tekumel-based games (mainly Tsolyani ones). The word list I use is based on one posted to the Tekumel mailing list and name lists at the Tekumel website. Here are some sample words.
- First, ngramords treats transliterated ligatures as single characters. Some languages, like Greek, have single letters (like theta and phi) that are transliterated into English as more than one letter ('th' and 'ph'). If these 'ligatures' are treated as separate letters, the language model loses information as the component parts of the ligature take up slots in the n-gram. If ngramords is told about these ligatures, it will treat them as single tokens.
- Second, if lc generates a word that is too long, it just returns what it's generated so far. This may not be a valid word if there are rules in the language for what the ends of words are like (for instance, all Japanese words end in 'n' or a vowel). In contrast, if ngramords generates a word that is too long, it abandons that word and starts again.