Complex Segment Learner

Complex Segment Learner

European / Quebecois French (Indo-European)

European French is traditionally analyzed as having no complex segments, even though it is phonotactically fairly permissive, and allows [tʃ] in word-initial position (alongside [ps], [ks] and other stop-fricative clusters).

By contrast, Quebecois French is described as having the affricates [ts] and [dz], which occur before high front vowels [i, y]. This analysis is supported by some experimental evidence (Béland and Kolinsky 2005, Journal of Multilingual Communication Disorders 3:2, pp. 110-117)

We tried the learner on two data sources: a large corpus of child-directed speech, and the Lexique dictionary of French (where 71K words have transcriptions). We transcribed both datasets as either European French (unmodified transcriptions from Lexique and CHILDES) or as having the affrication rule (replacing t/d with ts/dz before {i, y}). Our learner finds the affricates in the CHILDES corpus, which represents token frequencies in connected speech. Training the learner on the Lexique transcribed as Quebecois does not bring the learner to threshold, although [d z] and [t s] are the most inseparable clusters in the data. (The same result obtains when we train the learner on CHILDES tokenized by word; no complex segments are found.)

Simulation data at a glance

Click on simulation name to view additional simulation details.

Simulation nameInitial state Learning DataInitial state features
Quebec Lexique LearningData.txt Features.txt
Euro Lexique LearningData.txt Features.txt
Quebec Childes_Cds_Type LearningData.txt Features.txt
Quebec Childes_Cds_Token LearningData.txt Features.txt
Euro Childes_Cds_Type LearningData.txt Features.txt
Euro Childes_Cds_Token LearningData.txt Features.txt

Simulation details for French quebec lexique

In this simulation, the learner does not unify [ts] and [dz], but they are at the top in terms of inseparability scores (dz = 0.91, ts = 0.64).

Input:

This dataset is Lexique, with {t, d} replaced by corresponding affricates {t s}, {d z} before {y/i}.

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Outputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ

Simulation Plots

/media/french/quebec/lexique/simulation/insep_plots.png


Simulation details for French euro lexique

Input:

The data come from Lexique, which is a dictionary with transcriptions provided for about 71,000 words. We did not change anything else in the European version of the dataset.

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Outputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ

Simulation Plots

/media/french/euro/lexique/simulation/insep_plots.png


Simulation details for French quebec childes_cds_type

The learner does not find any complex segments to unify when trained on type frequencies from child-directed speech in Quebecois. Note that just as in Lexique, [dz] and [ts] are on top in terms of inseparability (though well below the threshold of 1).

Input:

The corpus of French infant-directed speech comes from CHILDES, and was prepared by Maria Julia Carbajal, Camillia Bouchon, Emmanuel Dupoux, & Sharon Peperkamp (2018). We used their IPA correspondence table to transcribe the files. This is the version of the data tokenized by word, so only type frequencies are represented, i.e., the word "ty" appears just once in the data instead of multiple times.

The Quebecois variant was generated by substituting all occurrences of [t] and [d] before [i, y] with [t s] and [d z], respectively.

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Outputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ

Simulation Plots

/media/french/quebec/childes_cds_type/simulation/insep_plots.png


Simulation details for French quebec childes_cds_token

This simulation is the only one where the learner succeeds in identifying [ts] and [dz] in Quebecois.

Input:

The corpus of French infant-directed speech comes from CHILDES, and was prepared by Maria Julia Carbajal, Camillia Bouchon, Emmanuel Dupoux, & Sharon Peperkamp (2018). We used their IPA correspondence table to transcribe the files. Since the corpus consists of utterances rather than words, we split the utterances on commas, exclamation marks, question marks, and "unintelligible" signs, so each connected utterance appears without word breaks on its own line. This is the version of the data where token rather than type frequencies are represented, i.e., the word "ty" appears multiple times and not just once.

The Quebecois variant was generated by substituting all occurrences of [t] and [d] before [i, y] with [t s] and [d z], respectively.

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 LearningData.txt Features.txt [download] [view] ts, dz None
2 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Outputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ ts dz

Simulation Plots

/media/french/quebec/childes_cds_token/simulation/insep_plots.png


Simulation details for French euro childes_cds_type

Input:

The corpus of French infant-directed speech comes from CHILDES, and was prepared by Maria Julia Carbajal, Camillia Bouchon, Emmanuel Dupoux, & Sharon Peperkamp (2018). We used their IPA correspondence table to transcribe the files. This is the version of the data tokenized by word, so it represents type frequencies. That is, the word "ty" appears just once in the data instead of multiple times.

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Outputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ

Simulation Plots

/media/french/euro/childes_cds_type/simulation/insep_plots.png


Simulation details for French euro childes_cds_token

Input:

The corpus of French infant-directed speech comes from CHILDES, and was prepared by Maria Julia Carbajal, Camillia Bouchon, Emmanuel Dupoux, & Sharon Peperkamp (2018). We used their IPA correspondence table to transcribe the files. Since the corpus consists of utterances rather than words, we split the utterances on commas, exclamation marks, question marks, and "unintelligible" signs, so each connected utterance appears without word breaks on its own line. The corpus represents token frequencies rather than type frequencies; i.e., the word "ty" (you) would appear multiple times in connected speech.

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Outputp b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ

Simulation Plots

/media/french/euro/childes_cds_token/simulation/insep_plots.png