Standard Russian (Indo-European)

Russian has its own section in the paper, as an example of a language where clear phonotactic arguments for affricates are lacking but statistical distributions are informative. Russian affricates are etymologically derived from dorsal and coronal stops, and they are quite frequent in the language--so much so that the learner identifies them without narrow transcriptions. While we discuss the two dictionary simulations in the paper, we also tried a simulation on a different data set: a kind of mock-up of a transcribed connected speech corpus. It was created by transcribing a bunch of Russian novels and removing the word boundaries. The main lesson we learn from this exercise is that the learner is sensitive to the type/token frequency distinction--it does not learn the same inventories when trained on words vs. connected speech. But we do not know which of the simulations produces the "right" result, since we do not know which representations the Russian speakers have for these sounds.

Simulation data at a glance

Click on simulation name to view additional simulation details.

Simulation name	Initial state Learning Data	Initial state features
Morphemes	LearningData.txt	Features.txt
Stems	LearningData.txt	Features.txt
Zaliznjak	LearningData.txt	Features.txt
Novel_Corpus	LearningData.txt	Features.txt
Tikhonov	LearningData.txt	Features.txt

Simulation details for Russian morphemes

Input:

LearningData.txt | Features.txt

Summary of iterations:

Iteration	Learning Data produced	Features produced	Inseparability	New Segments added	Segments removed
1	LearningData.txt	Features.txt	[download] [view]	tɕ	ɣ, ʑ
2	No new learning data	No new features	[download] [view]	None	None

Summary of inventory changes

Stage	Consonant set
Input	p pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j
Output	p pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n nʲ r rʲ l lʲ j tɕ

Simulation Plots

/media/russian/morphemes/simulation/insep_plots.png

Simulation details for Russian stems

Input:

This corpus was created on the basis of Zaliznjak's paradigms and the Tikhonov morphologically parsed dictionary. It includes monomorphemic words (either in the genitive plural or the nominative singular, depending on the declension class).

LearningData.txt | Features.txt

Summary of iterations:

Iteration	Learning Data produced	Features produced	Inseparability	New Segments added	Segments removed
1	LearningData.txt	Features.txt	[download] [view]	tɕ	None
2	No new learning data	No new features	[download] [view]	None	None

Summary of inventory changes

Stage	Consonant set
Input	p pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j
Output	p pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j tɕ

Simulation Plots

/media/russian/stems/simulation/insep_plots.png

Simulation details for Russian zaliznjak

This simulation replicates the Tikhonov one, which is described in more detail in the paper.

Input:

This data set is just the headwords from Zaliznjak's famous paradigm dictionary of Russian, transcribed into IPA using a script.

LearningData.txt | Features.txt

Summary of iterations:

Iteration	Learning Data produced	Features produced	Inseparability	New Segments added	Segments removed
1	LearningData.txt	Features.txt	[download] [view]	tɕ	None
2	LearningData.txt	Features.txt	[download] [view]	ts	None
3	No new learning data	No new features	[download] [view]	None	None

Summary of inventory changes

Stage	Consonant set
Input	p pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j
Output	p pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j tɕ ts

Simulation Plots

/media/russian/zaliznjak/simulation/insep_plots.png

Simulation details for Russian novel_corpus

This simulation demonstrates that token frequencies are not sufficient for the learner to identify [ts] as an affricate in this corpus of Russian. The dictionaries make for a better data source than this connected speech model.

Input:

This corpus was put together by taking 13 Russian novels and stories by 11 authors: Gogol, Pushkin, Tolstoy, Turgenev, Dostoevsky, Gertsen, Karamzin, Kologrivova, Kovalevskaya, Rostopchina, and Xvoschinskaya. Each text was transcribed and had word boundaries and punctuation marks removed by script, and then all the texts were combined into one file. Attention was paid to the function vs. lexical status of the orthographic in the transcription process, but there may still be mistakes in the resulting corpus due to its sheer size. For further details, consult GitHub.

LearningData.txt | Features.txt

Summary of iterations:

Iteration	Learning Data produced	Features produced	Inseparability	New Segments added	Segments removed
1	LearningData.txt	Features.txt	[download] [view]	tɕ	None
2	No new learning data	No new features	[download] [view]	None	None

Summary of inventory changes

Stage	Consonant set
Input	p pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j
Output	p pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j tɕ

Simulation Plots

/media/russian/novel_corpus/simulation/insep_plots.png

Simulation details for Russian tikhonov

Input:

The data source is Tikhonov's (2002) morphological dictionary of Russian, which can be downloaded in Cyrillic here. Transcription scripts are on GitHub.

LearningData.txt | Features.txt

Summary of iterations:

Iteration	Learning Data produced	Features produced	Inseparability	New Segments added	Segments removed
1	LearningData.txt	Features.txt	[download] [view]	tɕ	None
2	LearningData.txt	Features.txt	[download] [view]	ts	None
3	No new learning data	No new features	[download] [view]	None	None

Summary of inventory changes

Stage	Consonant set
Input	p pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j
Output	p pʲ b bʲ m mʲ f fʲ v vʲ k kʲ ɡ ɡʲ x ɣ xʲ t tʲ d dʲ s sʲ z zʲ ʂ ʐ ɕ n ʑ nʲ r rʲ l lʲ j tɕ ts

Simulation Plots

/media/russian/tikhonov/simulation/insep_plots.png

Complex Segment Learner

Standard Russian (Indo-European)

Simulation data at a glance

Simulation details for Russian morphemes

Input:

LearningData.txt | Features.txt

Summary of iterations:

Summary of inventory changes

Simulation Plots

Simulation details for Russian stems

Input:

LearningData.txt | Features.txt

Summary of iterations:

Summary of inventory changes

Simulation Plots

Simulation details for Russian zaliznjak

Input:

LearningData.txt | Features.txt

Summary of iterations:

Summary of inventory changes

Simulation Plots

Simulation details for Russian novel_corpus

Input:

LearningData.txt | Features.txt

Summary of iterations:

Summary of inventory changes

Simulation Plots

Simulation details for Russian tikhonov

Input:

LearningData.txt | Features.txt

Summary of iterations:

Summary of inventory changes

Simulation Plots