The Common Voice corpus: optimism and pending challenges

The publication of the latest Common Voice dataset confirms the consolidation of Catalan as the language with the most recorded hoursand validated on the platform. It is very good news for the Catalan linguistic community and for the development of AI tools in Catalan. The substantial growth in resources is a fact that brings optimism, but at the same time it shows some pending challenges. The availability of speech data allows an improvement in the quality and dimensions of language models. This is the case of the Text-To-Speech model, TTS CA Coqui Vits Multispeaker which, among the different datasets, is fed by the Common Voice V12 dataset. The new Common Voice V17 dataset now has a total of 3500 hours, of which 75% are validated. These are data collected by the study led by the Barcelona Supercomputing Center researcher, Carme Armentano , presented at the LREC-Colling 2024 event. Below are the key data on the presence of Catalan in the Common Voice initiative.

The Common Voice initiative, in data
Infogram

The identification of the speakers and the contributions

The Common Voice platform is a key enabler of collaborative work to collect open-source, multilingual voice data. Even so, the data show how 30% of contributions in expressions do not accompany demographic data of the users. This fact poses a major challenge when analyzing and exploiting the corpus of voice data that is derived from contributions to Common Voice. In addition, according to what the researchers point out in Becoming a High-Resource Language in Speech: The Catalan Case in the Common Voice Corpus, users have shown a greater willingness to record voice than to validate these fragments.

For this reason, the incorporation of textual data through the sentences becomes one of the crucial tasks of the process. The necessary validation of all contributions further increases this need by deleting almost 89% of the expressions collected. However, with the data provided, an important demographic bias is evident, not only because of the predominance of male voices but also because of their origin. Even in the last dataset, most of the contributions corresponded to the central dialect, while the rest did not even equal 20% of the contributions. The aforementioned elements and the same legal restrictions on contributions end up becoming challenges for the consolidation of Catalan as a language with a significant number of digital resources.

30 de May de 2024 | Scientific news |