Standard Russian (Indo-European)
Russian has its own section in the paper, as an example of a language where clear phonotactic arguments for affricates are lacking but statistical distributions are informative. Russian affricates are etymologically derived from dorsal and coronal stops, and they are quite frequent in the language--so much so that the learner identifies them without narrow transcriptions. While we discuss the two dictionary simulations in the paper, we also tried a simulation on a different data set: a kind of mock-up of a transcribed connected speech corpus. It was created by transcribing a bunch of Russian novels and removing the word boundaries. The main lesson we learn from this exercise is that the learner is sensitive to the type/token frequency distinction--it does not learn the same inventories when trained on words vs. connected speech. But we do not know which of the simulations produces the "right" result, since we do not know which representations the Russian speakers have for these sounds.
Simulation data at a glance
Click on simulation name to view additional simulation details.
|Simulation name||Initial state Learning Data||Initial state features|
Simulation details for Russian novel_corpusThis simulation demonstrates that token frequencies are not sufficient for the learner to identify [ts] as an affricate in this corpus of Russian. The dictionaries make for a better data source than this connected speech model.
This corpus was put together by taking 13 Russian novels and stories by 11 authors: Gogol, Pushkin, Tolstoy, Turgenev, Dostoevsky, Gertsen, Karamzin, Kologrivova, Kovalevskaya, Rostopchina, and Xvoschinskaya. Each text was transcribed and had word boundaries and punctuation marks removed by script, and then all the texts were combined into one file. Attention was paid to the function vs. lexical status of the orthographic in the transcription process, but there may still be mistakes in the resulting corpus due to its sheer size. For further details, consult GitHub.