To run the learner, you need two text files: a file containing your learning data, and a file of features defining the segments in the initial state of the learner. Both are plain text files in Unicode. You do not have to use the IPA but we recommend it--it will make your plots more readable to people other than you.
You can set your own parameters for the inseparability threshold (default 1.0) and the alpha level for the Fisher's exact test (default 0.05). We recommend starting with the threshold of 1, since it has produced reasonable results in all the cases we have studied.
Learning data file
The learner assumes that your data file has one word per line (we recommend Unix line breaks). Segments are separated from each other by spaces. You can use combinations of letters (digraphs or longer) to represent segments.
For example, if you want to test the English sequences [ts] and [tʃ], you would give the learner a data file where "pizza" and "chimp" are transcribed as follows:
p i t s ə
t ʃ ɪ m p
If all goes well, the learner will produce a learning data file where the unifiable sequences are rewritten as single segments, and true clusters are left as clusters:
p i t s ə
tʃ ɪ m p
There are many examples of Learning Data files on the Demo Simulations page. There is no minimum or maximum limit on the number of words, or the length of the words themselves. But, since the Complex Segment Learner is a statistical learner, your results will be more interpretable if your data set is reasonably large.
What counts as "large"? The relevant number is segmental bigrams, not words. In one of the Russian simulations, the learning data file is a single line, created by transcribing a corpus of several million words and removing the word boundaries. We also got reasonable-looking results from smaller corpora of just a few hundred words in languages such as Quechua and Navajo. But a few hundred is on the small side. Aim for a data set of at least a few thousand words of typical length (6-20 characters).
Feature file format
The feature file follows the format of the UCLA Phonotactic Learner. It must be tab-separated. The first row is feature names, and the subsequent rows have the segments and their values, which must be one of +, -, 0. Feature values cannot be blank. There are many example feature files for you to look at on the Demo Simulations page.
If you run into repeated errors with your own feature file, try using this generic feature file with most of the IPA symbols uniquely identified.
Naming the features
You can name most of your features anything you like, but there is one requirement:
|Mismatching features||Value kept||Example||Language|
|[±lab(ial)]||[+labial]||[ɡb], [kp], [ɸʃ]||Ngbaka, Tswana|
|[±cor(onal)]||[+coronal]||[ɸʃ], [tʃk]||Tswana, Shona|
Types of features the learner can handle
You can use either binary features (+, -) or privative features (+/- vs. 0, or + vs. 0). The Complex Segment Learner benefits from having the expressivity of binary features, because this allows it to enrich the segmental inventory with fewer features. This has the downside of increasing the number of natural classes in your feature system, which can cause learners such as the UCLAPL to break.
The learner will throw a warning if you have distinctions solely in the presence vs. absence of a [+] or [-] value: for example, if [kp] is [+dorsal, +labial] and [p] is [+labial], then no constraint can refer to [p] without also referring to [kp]. This is probably the right result phonologically, but learners such as the 2008 implementation of the UCLA Phonotactic Learner will complain about such files, so our learner will warn you if they have this structure.