Creating a Corpus of Child Speech 
by Monolingual and Bilingual Russian Speakers

BiRCh corpus is a project in progress. The ultimate product will be twofold.

First, we are collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia and Ukraine, Germany, and the U.S. and Canada. We are aiming for data collection over a decade, and hoping that at least a few families from each geographical area will participate for 5-10 years. 
Transcipts of this data, amounting to several million words, time-aligned with the audio speech signal, and fully text searchable will constitute the "Audio-aligned Longitudinal Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh Longitudinal)" (Dubinina, Malamud & Denisova-Schmidt).

Second, we are building a 1-million word corpus based on a subset of this data, the "Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (Parsed BiRCh)" (Malamud, Dubinina, Lưu & Xue) with two basic components:

  • Transcripts which are time-aligned with the audio speech signal, and fully text-searchable.
  • A part-of-speech tagged and parsed version of the transcripts, also audio-aligned.

NEWS! Our project "Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)" has been awarded an NSF grant. The funds will be used to create the 1-million word corpus, and to conduct research on passives, impersonals, and politeness markers in monolingual and bilingual Russian.

Project description

The purpose of this project is to create an online, freely available audio-aligned parsed corpus of language produced by children acquiring Russian in monolingual and bilingual contexts. Though the language of immigrant communities is often stigmatized and deprecated (even by its speakers), it is of central importance to the cultural identity and practices of these communities, and its study is crucial to an understanding the fundamental properties of linguistic knowledge, language acquisition and maintenance.

The first and major step in creating a corpus is data collection. Therefore, the main aim of this project is to collect and aggregate longitudinal data on language development of monolingual children in Russia, Russian-English bilingual children in the US, and Russian-German bilingual children in Germany. The recordings will present a wealth of data on speech phenomena, such as speech rate and intonation, and time-aligning the digital recordings with the transcripts will allow researchers to rapidly find desired parts of the speech signal by searching the transcribed text.

The second aim of the project is the creation of a grammatically annotated corpus. Such corpus will serve as a powerful research tool for investigating the grammar of Russian as spoken in Russia and by immigrants in Germany and the U.S., and the different factors influencing language acquisition in monolingual and immigrant bilingual contexts. This resource ultimately will help advance knowledge in the field of linguistics, language acquisition and bilingualism.

Research of language grammar, meaning and use must be based on data that allows researchers to see linguistic structure, meaning, and context. As research in other subfields of linguistics has shown, large collections of language data annotated with information about linguistic structure can bring about major advances. For instance, parsed corpora of historical English (Kroch & Taylor 1999, Taylor et al. 2003, Kroch et al. 2004) led to groundbreaking discoveries about the processes that defined the shape of English today and allowed linguists to gain a greater understanding of the very nature of language change.

An annotated corpus of monolingual and bilingual child speech would provide crucial data for researchers investigating the culture and speech of immigrant and monolingual Russian communities, the development of heritage languages, and language acquisition more generally. It would also supply the necessary information for practitioners developing language materials for heritage learners, for parents raising bilingual children, and for policy makers drafting appropriate rules and procedures.

Citations for the BiRCh project

Dubinina, Irina, Sophia A. Malamud, and Elena Denisova-Schmidt. "Audio-aligned Longitudinal Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh Longitudinal)". 2013-present. 

Malamud, Sophia A., Irina Dubinina, Alex Lưu, and Nianwen Xue. "Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (Parsed BiRCh)" 2017-present.

Collaborators working on the project

MILa faculty: 

  • Sophia Malamud (PI), Associate Professor of Linguistics, Department of Computer Science, Brandeis University
    • main contact for the project: email at smalamud AT brandeis DOT edu 
  • Irina Dubinina, Associate Professor of Russian, Director of the Russian Language Program, Brandeis University 

Other Faculty: 

  • Nianwen Xue, Associate Professor of Linguistics and Computer Science, Department of Computer Science, Brandeis University 

Research Assistants (*current team)

Brandeis students:

  • *Alex Lưu, PhD student: computational tools; corpus linguistics, pragmatics, and computational linguistics research, including converting SynTagRus into a Penn Treebank-style syntactically parsed corpus of Russian 
  • Parker Glenn, graduate student: deploying corpus search and visualisation engine
  • Dina Gorelik, undergraduate student: authorship annotation
  • Miriam Gölz, graduate student: transcription and annotation of German data
  • Bailey Johnson, graduate student: search interface evaluation
  • Erin Magill (REU), undergraduate student: syntactic annotation, research 
  • Dina Millerman, undergraduate student: authorship annotation
  • *Berta Muza (REU): undergraduate student: translating morphological guidelines, data preparation
  • *Ruth Rosenblum, graduate student: morphological and syntactic annotation, automatic parsing, segmentation/sentence tokenization, research
  • Benjamin Rozonoyer, undergraduate student: morphological and syntactic annotation, automatic parsing, segmentation/sentence tokenization, transcription, adjudication, research 
  • Keren Ruditsky (REU), undergraduate student: syntactic annotation, research
  • Masha Shaposhnikova, undergraduate student: transcription, adjudication, research
  • Yan Shneyderman, undergraduate student: transcription, disfluency annotation, checking, and pseudonymization, research
  • *Sasha Soboleva, undergraduate student: data preparation, research
  • *Anastasiia Tatlubaeva, graduate student: syntactic annotation

Other REU participants

  • Yana Miroshnychenko: data preparation, research
  • Yulia Zaborna: transcription, annotation, segmentation

Other Research Assistants

  • Lena Antsupova, Elena Bogomolova, Viktoria Bulavina, Kristina Bush, Natalia Dubinina, Alina Feklina, Yaroslava Fedorova, *Olga Ivchenko, Yaroslava Kharenko, Emma Kisselev, Hanna Komar, Alina Korovatskaya, Masha Kruk, Dmitry Leschiner, Vladislava Merkalova, Ekaterina Mironova, Dayana Mostovaya, Galina Paquette, Virginia Partridge, Olena Prusikin, Kristina Ragimova, Ilya Rozonoyer, Valiantsina Sokalava, Rossina Soyan, *Olga Shtan', Anna Tarasova, Maryana Zhelesnyak: transcription, disfluency annotation, checking, and pseudonymization
  • Ekaterina Mironova: morphological and syntactic annotation, segmentation/sentence tokenization, transcription, adjudication
  • *Pavel Koval: syntactic parsing team lead, morphological and syntactic annotation, segmentation checking 


This project is supported by 

  • A Leonardo da Vinci, EU grant to Elena Denisova-Schmidt [project BILIUM], 09/2012 - 07/2014 

  • Theodore and Jane Norman Award, Brandeis University to Irina Dubinina, summer 2014 

  • Provost Research Grant, Brandeis University to Sophia Malamud and Irina Dubinina, 07/2015 - 07/2016 

  • The Faculty Grant from the Mandel Foundation for Humanities to Sophia Malamud and Irina Dubinina, 01/2016 - 12/2017 

  • Provost Research Grant, Brandeis University to Irina Dubinina, 07/2016 - 07/2017 

  • Brandeis Dean of Arts and Sciences Collaborative Faculty-Student research award to Sophia Malamud, Irina Dubinina, Masha Shaposhnikova, Yan Shneyderman, spring 2017 

  • National Science Foundation Award BCS-1651083 to Sophia Malamud (PI), Irina Dubinina (co-PI), Nianwen Xue (co-PI), 08/2017-5/2022

  • Theodore and Jane Norman Award, Brandeis University to Sophia Malamud, 2018
  • We gratefully acknowledge support from the Center for German and European Studies at Brandeis University, which is supported by the German Academic Exchange Service (DAAD) with funds from the German Federal Foreign Office (Auswärtiges Amt.)