Complex Segment Learner

Complex Segment Learner


  1. Getting started
  2. Learning data file
  3. Feature file format
  4. Segment removal
  5. Naming the features
  6. Types of features the learner can handle
  7. Reading the plots
  8. Command line instructions
  9. What happens to your data
  10. Everything else

Getting started

To run the learner, you need two text files: a file containing your learning data, and a file of features defining the segments in the initial state of the learner. Both are plain text files in Unicode. You do not have to use the IPA but we recommend it--it will make your plots more readable to people other than you.

You can set your own parameters for the inseparability threshold (default 1.0) and the alpha level for the Fisher's exact test (default 0.05). We recommend starting with the threshold of 1, since it has produced reasonable results in all the cases we have studied.

Learning data file

The learner assumes that your data file has one word per line (we recommend Unix line breaks). Segments are separated from each other by spaces. You can use combinations of letters (digraphs or longer) to represent segments. For example, if you want to test the English sequences [ts] and [tʃ], you would give the learner a data file where "pizza" and "chimp" are transcribed as follows:

p i t s ə
t ʃ ɪ m p

If all goes well, the learner will produce a learning data file where the unifiable sequences are rewritten as single segments, and true clusters are left as clusters:
p i t s ə
tʃ ɪ m p

There are many examples of Learning Data files on the Demo Simulations page. There is no minimum or maximum limit on the number of words, or the length of the words themselves. But, since the Complex Segment Learner is a statistical learner, your results will be more interpretable if your data set is reasonably large.

What counts as "large"? The relevant number is segmental bigrams, not words. In one of the Russian simulations, the learning data file is a single line, created by transcribing a corpus of several million words and removing the word boundaries. We also got reasonable-looking results from smaller corpora of just a few hundred words in languages such as Quechua and Navajo. But a few hundred is on the small side. Aim for a data set of at least a few thousand words of typical length (6-20 characters).

Feature file format

The feature file follows the format of the UCLA Phonotactic Learner. It must be tab-separated. The first row is feature names, and the subsequent rows have the segments and their values, which must be one of +, -, 0. Feature values cannot be blank. There are many example feature files for you to look at on the Demo Simulations page.

If you run into repeated errors with your own feature file, try using this generic feature file with most of the IPA symbols uniquely identified.

Segment removal

When the learner generates new versions of the learning data and feature files, it removes any segments that do not occur in the learning data. Thus, if you have segments in your initial feature file that do not occur in the data, they will not be kept in the final version of the feature file. If the learner does not find any complex segments, it does not alter your original feature file.

Naming the features

You can name most of your features anything you like, but there is one requirement:

  • You must have a [syll(abic)] feature that separates consonants from vowels. The learner tracks consonants, which are [-syllabic] segments (e.g., [w, q, p, t, l, ʃ], etc.).
  • The learner will attempt to produce composite feature specifications for common complex segments (affricates, prenasalized stops, labialized/palatalized consonants). If you want the learner to provide sensible feature specifications for the new segments, you have to use the following feature names for common features:
  • Mismatching features Value kept Example Language
    [±nas(al)] [+nasal] [mb], [nr] Fijian
    [±son(orant)] [-sonorant] [mb], [tw] Shona
    [±strid(ent)] [+strident] [ts], [pʃ] Tswana
    [±cont(inuant)] [-continuant] [tw], [ts] Tswana
    [±cons(onantal)] [+consonantal] [tw], [kj] Shona
    [±lab(ial)] [+labial] [ɡb], [kp], [ɸʃ] Ngbaka, Tswana
    [±dor(sal)] [+dorsal] [gb], [kp] Ngbaka
    [±cor(onal)] [+coronal] [ɸʃ], [tʃk] Tswana, Shona
  • If the learner cannot make a uniquely identifying specification for a segment, it will produce a warning and keep going. You will have to fix the resulting feature file by hand if you want to use it in another learning model.

  • Types of features the learner can handle

    You can use either binary features (+, -) or privative features (+/- vs. 0, or + vs. 0). The Complex Segment Learner benefits from having the expressivity of binary features, because this allows it to enrich the segmental inventory with fewer features. This has the downside of increasing the number of natural classes in your feature system, which can cause learners such as the UCLAPL to break.

    The learner will throw a warning if you have distinctions solely in the presence vs. absence of a [+] or [-] value: for example, if [kp] is [+dorsal, +labial] and [p] is [+labial], then no constraint can refer to [p] without also referring to [kp]. This is probably the right result phonologically, but learners such as the 2008 implementation of the UCLA Phonotactic Learner will complain about such files, so our learner will warn you if they have this structure.

    Reading the plots

    The plots show three kinds of information:
    1. Number of iterations;
    2. The inseparability numbers for each of the top 15 most inseparable clusters in each iteration;
    3. Whether the cluster meets the frequency threshold needed for unification.
    The last two points are conveyed via the color and shape of the markers.
  • Blue circles show inseparability values for sequences that pass the Fisher's Exact Test threshold for frequency.
  • Black Xes indicate that the sequence is not frequent enough to be unified (p>0.05).
  • The red line in each plot marks the threshold for unification (set to 1). The red line may not be visible when the highest inseparability values in a simulation are very high--this means that 1 is too close to the y-axis to show through in the plot.
  • Command line instructions

    You can run the learner from the command line as a Python3 utility. To do so, first ensure you have the following dependencies installed:

  • SciPy
  • NLTK
  • NumPy
  • Matplotlib
  • At least Python 3.6 is required.

    To run the learner, download the module, navigate to its directory, and run as follows:

    	$ python3 --help 

    This will show you all the available options. A simple use case would be to give the learner a learning data file, a feature file, and an output directory--whihc you would do as follows:

    	$ python3 --ld /home/you/Desktop/LearningData.txt --feats /home/you/Desktop/Features.txt --outdir /home/you/Desktop

    The learner will create a directory called 'simulation' inside the path you gave it for saving the results ('outdir'), and it will put some plots and data from the simulation there. Note that if a directory called 'simulation' already exists at the destination, it will be overwritten without a prompt.

    What happens to your data

    The online version of the learner uploads your learning data and features files to our server in order to run the algorithm. If all goes well, you get a zipped file with the results that you can examine at your leisure. The files you upload are automatically deleted from the server on a regular basis, and they are not analyzed in any way. The site does not use cookies to track session information.

    Everything else

    If you have questions that are not answered here or in the paper, please email the authors.