Creating a Corpus of Child Speech 
by Monolingual and Bilingual Russian Speakers

BiRCh corpus is a project in progress. The ultimate product will be twofold.

First, we are collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia and Ukraine, Germany, and the U.S. and Canada. We are aiming for data collection over a decade, and hoping that at least a few families from each geographical area will participate for 5-10 years. 
Transcipts of this data, amounting to several million words, time-aligned with the audio speech signal, and fully text searchable will constitute the "Audio-aligned Longitudinal Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh Longitudinal)" (Dubinina, Malamud & Denisova-Schmidt).

Second, we are building a 1-million word corpus based on a subset of this data, the "Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (Parsed BiRCh)" (Malamud, Dubinina, Lưu & Xue) with two basic components:

  • Transcripts which are time-aligned with the audio speech signal, and fully text-searchable.
  • A part-of-speech tagged and parsed version of the transcripts, also audio-aligned.

NEWS! Our project "Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)" has been awarded an NSF grant. The funds will be used to create the 1-million word corpus, and to conduct research on passives, impersonals, and politeness markers in monolingual and bilingual Russian.

Project description

The purpose of this project is to create an online, freely available audio-aligned parsed corpus of language produced by children acquiring Russian in monolingual and bilingual contexts. Though the language of immigrant communities is often stigmatized and deprecated (even by its speakers), it is of central importance to the cultural identity and practices of these communities, and its study is crucial to an understanding the fundamental properties of linguistic knowledge, language acquisition and maintenance.

The first and major step in creating a corpus is data collection. Therefore, the main aim of this project is to collect and aggregate longitudinal data on language development of monolingual children in Russia, Russian-English bilingual children in the US, and Russian-German bilingual children in Germany. The recordings will present a wealth of data on speech phenomena, such as speech rate and intonation, and time-aligning the digital recordings with the transcripts will allow researchers to rapidly find desired parts of the speech signal by searching the transcribed text.

The second aim of the project is the creation of a grammatically annotated corpus. Such corpus will serve as a powerful research tool for investigating the grammar of Russian as spoken in Russia and by immigrants in Germany and the U.S., and the different factors influencing language acquisition in monolingual and immigrant bilingual contexts. This resource ultimately will help advance knowledge in the field of linguistics, language acquisition and bilingualism.

Research of language grammar, meaning and use must be based on data that allows researchers to see linguistic structure, meaning, and context. As research in other subfields of linguistics has shown, large collections of language data annotated with information about linguistic structure can bring about major advances. For instance, parsed corpora of historical English (Kroch & Taylor 1999, Taylor et al. 2003, Kroch et al. 2004) led to groundbreaking discoveries about the processes that defined the shape of English today and allowed linguists to gain a greater understanding of the very nature of language change.

An annotated corpus of monolingual and bilingual child speech would provide crucial data for researchers investigating the culture and speech of immigrant and monolingual Russian communities, the development of heritage languages, and language acquisition more generally. It would also supply the necessary information for practitioners developing language materials for heritage learners, for parents raising bilingual children, and for policy makers drafting appropriate rules and procedures.

Citations for the BiRCh project

Dubinina, Irina, Sophia A. Malamud, and Elena Denisova-Schmidt. "Audio-aligned Longitudinal Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh Longitudinal)". 2013-present. 

Malamud, Sophia A., Irina Dubinina, Alex Lưu, and Nianwen Xue. "Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (Parsed BiRCh)" 2017-present.

Collaborators working on the project

MILa faculty: 

  • Irina Dubinina, Associate Professor of Russian, Director of the Russian Language Program, Brandeis University 
  • Sophia Malamud, Associate Professor of Linguistics, Department of Computer Science, Brandeis University 

Other Faculty: 

  • Nianwen Xue, Associate Professor of Linguistics and Computer Science, Department of Computer Science, Brandeis University 

Research Assistants, Brandeis: 

  • Alex Lưu, PhD student: computational tools; corpus linguistics, pragmatics, and computational linguistics research, including converting SynTagRus into a Penn Treebank-style syntactically parsed corpus of Russian 
  • Benjamin Rozonoyer, undergraduate student: morphological and syntactic annotation, automatic parsing, segmentation/sentence tokenization, transcription, adjudication 
  • Masha Shaposhnikova, undergraduate student: transcription, adjudication, research of disfluencies 

Other Research Assistants

  • Elena Bogomolova, Viktoria Bulavina, Alina Feklina, Yaroslava Fedorova, Olga Ivchenko, Hanna Komar, Alina Korovatskaya, Masha Kruk, Dmitry Leschiner, Vladislava Merkalova, Ekaterina Mironova, Dayana Mostovaya, Olena Prusikin, Ilya Rozonoyer, Olga Shtan', Anna Tarasova, Yan Shneyderman: transcription, disfluency annotation, checking, and pseudonymization
  • Ekaterina Mironova: morphological and syntactic annotation, segmentation/sentence tokenization, transcription, adjudication
  • Pavel Koval: morphological and syntactic annotation, segmentation checking 


This project is supported by 

  • A Leonardo da Vinci, EU grant to Elena Denisova-Schmidt [project BILIUM], 09/2012 - 07/2014 

  • Theodore and Jane Norman Award, Brandeis University to Irina Dubinina, summer 2014 

  • Provost Research Grant, Brandeis University to Sophia Malamud and Irina Dubinina, 07/2015 - 07/2016 

  • The Faculty Grant from the Mandel Foundation for Humanities to Sophia Malamud and Irina Dubinina, 01/2016 - 12/2017 

  • Provost Research Grant, Brandeis University to Irina Dubinina, 07/2016 - 07/2017 

  • Brandeis Dean of Arts and Sciences Collaborative Faculty-Student research award to Sophia Malamud, Irina Dubinina, Masha Shaposhnikova, Yan Shneyderman, spring 2017 

  • National Science Foundation Award BCS-1651083 to Sophia Malamud (PI), Irina Dubinina (co-PI), Nianwen Xue (co-PI), 08/2017-5/2022

  • Theodore and Jane Norman Award, Brandeis University to Sophia Malamud, 2018