Complex Segment Learner

Complex Segment Learner

Standard Polish (Indo-European)

Polish is not discussed in the paper. While the language is related to Russian, it raises some very different issues. Just as in Russian, clear arguments in favor of complex segments are difficult to make on phonotactic grounds alone: both languages allow many combinations of stops and fricatives. But some of Trubetzkoy's other arguments, which fail for Russian, actually make sense for Polish. For one, Polish has a more symmetrical inventory of affricates: wherever there are contrastive strident fricatives, there are also voiced and voiceless affricates. Russian, on the other hand, has a gapped inventory with respect to voicing (all phonemic affricates are voiceless), as well as place (there is no retroflex affricate).

Polish also supplies an unusual argument in favor of an affricate analysis of [ʈʂ, ɖʐ], which famously contrast with stop-fricative sequences [tʂ, dʐ] (the fricatives in these are derived from historical rhotics; see the Brooks 1964 citation mentioned in the paper). The simulations we report below reflect this contrast in the transcriptions, because it is represented in the orthography. One corollary of the contrast is that the retroflex affricates have categorically inseparable first halves; there are no freely occurring retroflex stops otherwise. It is interesting, therefore, that our learner does not succeed in identifying [ɖʐ] in any of the simulations. This is because this affricate is a bit defective in the language's phonology--it occurs in loanwords and in morpho-phonological derived environments only, and it is overall rather rare (two orders of magnitude rarer than the voiceless counterpart). This shows that the "trick" of narrowly transcribing affricates, as in the English simulation, is not really a trick, since it does not always help the learner find the desired segments. Categorical inseparability is not the same as inseparability in our learner's calculations.

Typologically, the comparison between Russian and Polish is instructive: voiced affricates are typologically dispreferred (see Marzena Żygis's work on this), and in Polish, the typologically dispreferred voiced sequences are so type-rare that our learner has a difficult time unifying them.

Beyond affricates, one difference between Polish and Russian is that Russian has a clear and robust palatalization-velarization contrast in most of its consonants, whereas in Polish, the contrast is more difficult to analyze and is more controversial (see Gussmann 2007, Bethin 1992, and many others). In the Polex simulations, we transcribed all of the controversial sequences as clusters except for prepalatals, which can occur in word-final position (e.g., `stump' was transcribed as [p j e ɲ] not [pʲ e ɲ]). The learner does not unify any of the purported palatals.

Simulation data at a glance

Click on simulation name to view additional simulation details.

Simulation nameInitial state Learning DataInitial state features
Polex Narrow LearningData.txt Features.txt
Polex Broad LearningData.txt Features.txt
Childes_Cds Narrow LearningData.txt Features.txt
Childes_Cds Broad LearningData.txt Features.txt

Simulation details for Polish polex narrow

Input:

The data for this simulation come from POLEX (Vetulani 2000). All of the ~98,000 words are included. The transcription script for converting the POLEX ASCII pseudo-orthography into IPA is here. Voicing assimilation is transcribed in this simulation. The difference between the "narrow" and the "wide" transcriptions are in the representation of orthographic <ć, dź>: in the narrow transcription, they are transcribed as [cɕ, ɟʑ].

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 LearningData.txt Features.txt [download] [view] ts, ʈʂ, cɕ, ɟʑ None
2 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp b m f v k g x t d ʈ ɖ c ɟ s z ʂ ʐ ɕ ʑ n ɲ r l w j
Outputp b m f v k g x t d ʈ ɖ c ɟ s z ʂ ʐ ɕ ʑ n ɲ r l w j ts ʈʂ cɕ ɟʑ

Simulation Plots

/media/polish/polex/narrow/simulation/insep_plots.png


Simulation details for Polish polex broad

Input:

See the description of the "narrow" POLEX simulation for data provenance and transcription script. The "wide" thing here refers to the transcription of [tɕ, dʑ]. The learner finds the voiceless affricate even if it is transcribed with [t] as its first half, but it finds the voiced affricate only when it is given a narrow transcription.

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 LearningData.txt Features.txt [download] [view] tɕ, ʈʂ None
2 LearningData.txt Features.txt [download] [view] ts None
3 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp b m f v k g x t d ʈ ɖ s z ʂ ʐ ɕ ʑ n ɲ r l w j
Outputp b m f v k g x t d ʈ ɖ s z ʂ ʐ ɕ ʑ n ɲ r l w j tɕ ʈʂ ts

Simulation Plots

/media/polish/polex/broad/simulation/insep_plots.png


Simulation details for Polish childes_cds narrow

The learner does better on Polish CDS when the data are transcribed narrowly, with different "allophones" of the first halves of affricates for each place of articulation. The learner still fails to identify [ɖʐ] and [dz], which are well below the threshold. The affricate [ts] is just short of the threshold. This confirms that the voiced part of the affricate inventory is a challenge whether the learner is exposed to token or type frequencies.

Input:

The corpus of Polish Child-Directed Speech (CDS) comes from CHILDES; it ls already fully transcribed. We removed word boundaries and replaced affricate symbols with corresponding stop-fricative sequences.

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 LearningData.txt Features.txt [download] [view] ʈʂ, cɕ, ɟʑ None
2 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp b m f v k g x t d ʈ ɖ c ɟ s z ʂ ʐ ɕ ʑ n ɲ ŋ r l w j
Outputp b m f v k g x t d ʈ ɖ c ɟ s z ʂ ʐ ɕ ʑ n ɲ ŋ r l w j ʈʂ cɕ ɟʑ

Simulation Plots

/media/polish/childes_cds/narrow/simulation/insep_plots.png


Simulation details for Polish childes_cds broad

The inventory of Polish presents challenges for the learner regardless of the nature of the training data. In this simulation, the learner finds two of the affricates in the first iteration, but does not identify any unifiable sequences on the second iteration.

Input:

The corpus of Polish Child-Directed Speech (CDS) comes from CHILDES; it ls already fully transcribed. We removed word boundaries and replaced affricate symbols with corresponding stop-fricative sequences, as well as modifying affricate symbols such as the prepalatal [ʤ] with the more accurate retroflex [ɖʐ]. The contrast between stop-retroflex fricative clusters and retroflex affricates was represented through the place of articulation difference in the first half of the sequence, i.e., was transcribed as [tʂ] and as [dʐ].

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 LearningData.txt Features.txt [download] [view] dʑ, ʈʂ None
2 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp b m f v k g x t d ʈ ɖ c ɟ s z ʂ ʐ ɕ ʑ n ɲ ŋ r l w j
Outputp b m f v k g x t d ʈ ɖ c ɟ s z ʂ ʐ ɕ ʑ n ɲ ŋ r l w j dʑ ʈʂ

Simulation Plots

/media/polish/childes_cds/broad/simulation/insep_plots.png