Complex Segment Learner

Complex Segment Learner

Standard Russian (Indo-European)

Russian has its own section in the paper, as an example of a language where clear phonotactic arguments for affricates are lacking but statistical distributions are informative. Russian affricates are etymologically derived from dorsal and coronal stops, and they are quite frequent in the language--so much so that the learner identifies them without narrow transcriptions. While we discuss the two dictionary simulations in the paper, we also tried a simulation on a different data set: a kind of mock-up of a transcribed connected speech corpus. It was created by transcribing a bunch of Russian novels and removing the word boundaries. The main lesson we learn from this exercise is that the learner is sensitive to the type/token frequency distinction--it does not learn the same inventories when trained on words vs. connected speech. But we do not know which of the simulations produces the "right" result, since we do not know which representations the Russian speakers have for these sounds.

Simulation data at a glance

Click on simulation name to view additional simulation details.

Simulation nameInitial state Learning DataInitial state features
Novel_Corpus LearningData.txt Features.txt
Tikhonov LearningData.txt Features.txt
Stems LearningData.txt Features.txt
Zaliznjak LearningData.txt Features.txt
Morphemes LearningData.txt Features.txt

Simulation details for Russian novel_corpus

This simulation demonstrates that token frequencies are not sufficient for the learner to identify [ts] as an affricate in this corpus of Russian. The dictionaries make for a better data source than this connected speech model.

Input:

This corpus was put together by taking 13 Russian novels and stories by 11 authors: Gogol, Pushkin, Tolstoy, Turgenev, Dostoevsky, Gertsen, Karamzin, Kologrivova, Kovalevskaya, Rostopchina, and Xvoschinskaya. Each text was transcribed and had word boundaries and punctuation marks removed by script, and then all the texts were combined into one file. Attention was paid to the function vs. lexical status of the orthographic in the transcription process, but there may still be mistakes in the resulting corpus due to its sheer size. For further details, consult GitHub.

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 LearningData.txt Features.txt [download] [view] None
2 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j
Outputp pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j tɕ

Simulation Plots

/media/russian/novel_corpus/simulation/insep_plots.png


Simulation details for Russian tikhonov

Input:

The data source is Tikhonov's (2002) morphological dictionary of Russian, which can be downloaded in Cyrillic here. Transcription scripts are on GitHub.

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 LearningData.txt Features.txt [download] [view] None
2 LearningData.txt Features.txt [download] [view] ts None
3 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j
Outputp pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j tɕ ts

Simulation Plots

/media/russian/tikhonov/simulation/insep_plots.png


Simulation details for Russian stems

Input:

This corpus was created on the basis of Zaliznjak's paradigms and the Tikhonov morphologically parsed dictionary. It includes monomorphemic words (either in the genitive plural or the nominative singular, depending on the declension class).

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 LearningData.txt Features.txt [download] [view] None
2 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j
Outputp pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j tɕ

Simulation Plots

/media/russian/stems/simulation/insep_plots.png


Simulation details for Russian zaliznjak

This simulation replicates the Tikhonov one, which is described in more detail in the paper.

Input:

This data set is just the headwords from Zaliznjak's famous paradigm dictionary of Russian, transcribed into IPA using a script.

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 LearningData.txt Features.txt [download] [view] None
2 LearningData.txt Features.txt [download] [view] ts None
3 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j
Outputp pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j tɕ ts

Simulation Plots

/media/russian/zaliznjak/simulation/insep_plots.png


Simulation details for Russian morphemes

Input:

LearningData.txt | Features.txt

Summary of iterations:

IterationLearning Data producedFeatures producedInseparabilityNew Segments addedSegments removed
1 LearningData.txt Features.txt [download] [view] ɣ, ʑ
2 No new learning data No new features [download] [view] None None

Summary of inventory changes

StageConsonant set
Inputp pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j
Outputp pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n nʲ r rʲ l lʲ j tɕ

Simulation Plots

/media/russian/morphemes/simulation/insep_plots.png