European / Quebecois French (Indo-European)

European French is traditionally analyzed as having no complex segments, even though it is phonotactically fairly permissive, and allows [tʃ] in word-initial position (alongside [ps], [ks] and other stop-fricative clusters).

By contrast, Quebecois French is described as having the affricates [ts] and [dz], which occur before high front vowels [i, y]. This analysis is supported by some experimental evidence (Béland and Kolinsky 2005, Journal of Multilingual Communication Disorders 3:2, pp. 110-117)

We tried the learner on two data sources: a large corpus of child-directed speech, and the Lexique dictionary of French (where 71K words have transcriptions). We transcribed both datasets as either European French (unmodified transcriptions from Lexique and CHILDES) or as having the affrication rule (replacing t/d with ts/dz before {i, y}). Our learner finds the affricates in the CHILDES corpus, which represents token frequencies in connected speech. Training the learner on the Lexique transcribed as Quebecois does not bring the learner to threshold, although [d z] and [t s] are the most inseparable clusters in the data. (The same result obtains when we train the learner on CHILDES tokenized by word; no complex segments are found.)

Simulation data at a glance

Click on simulation name to view additional simulation details.

Simulation name	Initial state Learning Data	Initial state features
Euro Lexique	LearningData.txt	Features.txt
Quebec Lexique	LearningData.txt	Features.txt
Euro Childes_Cds_Type	LearningData.txt	Features.txt
Euro Childes_Cds_Token	LearningData.txt	Features.txt
Quebec Childes_Cds_Type	LearningData.txt	Features.txt
Quebec Childes_Cds_Token	LearningData.txt	Features.txt

Simulation details for French euro lexique

Input:

The data come from Lexique, which is a dictionary with transcriptions provided for about 71,000 words. We did not change anything else in the European version of the dataset.

LearningData.txt | Features.txt

Summary of iterations:

Iteration	Learning Data produced	Features produced	Inseparability	New Segments added	Segments removed
1	No new learning data	No new features	[download] [view]	None	None

Summary of inventory changes

Stage	Consonant set
Input	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Output	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ

Simulation Plots

/media/french/euro/lexique/simulation/insep_plots.png

Simulation details for French quebec lexique

In this simulation, the learner does not unify [ts] and [dz], but they are at the top in terms of inseparability scores (dz = 0.91, ts = 0.64).

Input:

This dataset is Lexique, with {t, d} replaced by corresponding affricates {t s}, {d z} before {y/i}.

LearningData.txt | Features.txt

Summary of iterations:

Iteration	Learning Data produced	Features produced	Inseparability	New Segments added	Segments removed
1	No new learning data	No new features	[download] [view]	None	None

Summary of inventory changes

Stage	Consonant set
Input	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Output	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ

Simulation Plots

/media/french/quebec/lexique/simulation/insep_plots.png

Simulation details for French euro childes_cds_type

Input:

The corpus of French infant-directed speech comes from CHILDES, and was prepared by Maria Julia Carbajal, Camillia Bouchon, Emmanuel Dupoux, & Sharon Peperkamp (2018). We used their IPA correspondence table to transcribe the files. This is the version of the data tokenized by word, so it represents type frequencies. That is, the word "ty" appears just once in the data instead of multiple times.

LearningData.txt | Features.txt

Summary of iterations:

Iteration	Learning Data produced	Features produced	Inseparability	New Segments added	Segments removed
1	No new learning data	No new features	[download] [view]	None	None

Summary of inventory changes

Stage	Consonant set
Input	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Output	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ

Simulation Plots

/media/french/euro/childes_cds_type/simulation/insep_plots.png

Simulation details for French euro childes_cds_token

Input:

The corpus of French infant-directed speech comes from CHILDES, and was prepared by Maria Julia Carbajal, Camillia Bouchon, Emmanuel Dupoux, & Sharon Peperkamp (2018). We used their IPA correspondence table to transcribe the files. Since the corpus consists of utterances rather than words, we split the utterances on commas, exclamation marks, question marks, and "unintelligible" signs, so each connected utterance appears without word breaks on its own line. The corpus represents token frequencies rather than type frequencies; i.e., the word "ty" (you) would appear multiple times in connected speech.

LearningData.txt | Features.txt

Summary of iterations:

Iteration	Learning Data produced	Features produced	Inseparability	New Segments added	Segments removed
1	No new learning data	No new features	[download] [view]	None	None

Summary of inventory changes

Stage	Consonant set
Input	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Output	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ

Simulation Plots

/media/french/euro/childes_cds_token/simulation/insep_plots.png

Simulation details for French quebec childes_cds_type

The learner does not find any complex segments to unify when trained on type frequencies from child-directed speech in Quebecois. Note that just as in Lexique, [dz] and [ts] are on top in terms of inseparability (though well below the threshold of 1).

Input:

The corpus of French infant-directed speech comes from CHILDES, and was prepared by Maria Julia Carbajal, Camillia Bouchon, Emmanuel Dupoux, & Sharon Peperkamp (2018). We used their IPA correspondence table to transcribe the files. This is the version of the data tokenized by word, so only type frequencies are represented, i.e., the word "ty" appears just once in the data instead of multiple times.

The Quebecois variant was generated by substituting all occurrences of [t] and [d] before [i, y] with [t s] and [d z], respectively.

LearningData.txt | Features.txt

Summary of iterations:

Iteration	Learning Data produced	Features produced	Inseparability	New Segments added	Segments removed
1	No new learning data	No new features	[download] [view]	None	None

Summary of inventory changes

Stage	Consonant set
Input	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Output	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ

Simulation Plots

/media/french/quebec/childes_cds_type/simulation/insep_plots.png

Simulation details for French quebec childes_cds_token

This simulation is the only one where the learner succeeds in identifying [ts] and [dz] in Quebecois.

Input:

The corpus of French infant-directed speech comes from CHILDES, and was prepared by Maria Julia Carbajal, Camillia Bouchon, Emmanuel Dupoux, & Sharon Peperkamp (2018). We used their IPA correspondence table to transcribe the files. Since the corpus consists of utterances rather than words, we split the utterances on commas, exclamation marks, question marks, and "unintelligible" signs, so each connected utterance appears without word breaks on its own line. This is the version of the data where token rather than type frequencies are represented, i.e., the word "ty" appears multiple times and not just once.

The Quebecois variant was generated by substituting all occurrences of [t] and [d] before [i, y] with [t s] and [d z], respectively.

LearningData.txt | Features.txt

Summary of iterations:

Iteration	Learning Data produced	Features produced	Inseparability	New Segments added	Segments removed
1	LearningData.txt	Features.txt	[download] [view]	ts, dz	None
2	No new learning data	No new features	[download] [view]	None	None

Summary of inventory changes

Stage	Consonant set
Input	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ
Output	p b m f v t d s z n l ʃ ʒ ʁ j w ɥ k g ŋ ɲ ts dz

Simulation Plots

/media/french/quebec/childes_cds_token/simulation/insep_plots.png

Complex Segment Learner

European / Quebecois French (Indo-European)

Simulation data at a glance

Simulation details for French euro lexique

Input:

LearningData.txt | Features.txt

Summary of iterations:

Summary of inventory changes

Simulation Plots

Simulation details for French quebec lexique

Input:

LearningData.txt | Features.txt

Summary of iterations:

Summary of inventory changes

Simulation Plots

Simulation details for French euro childes_cds_type

Input:

LearningData.txt | Features.txt

Summary of iterations:

Summary of inventory changes

Simulation Plots

Simulation details for French euro childes_cds_token

Input:

LearningData.txt | Features.txt

Summary of iterations:

Summary of inventory changes

Simulation Plots

Simulation details for French quebec childes_cds_type

Input:

LearningData.txt | Features.txt

Summary of iterations:

Summary of inventory changes

Simulation Plots

Simulation details for French quebec childes_cds_token

Input:

LearningData.txt | Features.txt

Summary of iterations:

Summary of inventory changes

Simulation Plots