The Aina Project shared the latest technological advances during this year’s edition of LREC-Coling 2024. The event took place between May 20 and 25 in Turin, Italy. Among the various developments, researchers from the Language Technologies Unit of the Barcelona Supercomputing Center have presented the most advanced AI resources published in the open through presentations and posters. A fact that reflects leadership in the field of natural language processing from open data and in Catalan.

Presentation of data

One of the works, led by Aitor González-Agirre, is entitled “A CURATEd CATalog: Rethinking the Extraction of Pretraining Corpora for Mid-Resourced Languages. A work that proposes an innovative methodology for the extraction of training data for languages that do not have a great support of digital resources. The resulting CATalog dataset is an example of a high-quality dataset optimal for pretraining language models. Resources that reinforce accuracy and ensure a more accurate representation of the language.

Researchers Javier Saiz and Ferran Espuña present the CATalog project

Researchers Javier Saiz and Ferran Espuña present the CATalog project

Regarding data, the latest results of the Common Voice Initiative, led by researcher Carme Armentano, have also been presented. The document “Becoming a High-Resource Language in Speech: The Catalan Case in the Common Voice Corpus” shows the success of the collective campaign promoted, in part, also through the Aina Project. The publication summarizes the key milestones achieved, but also highlights the challenges ahead, in order to achieve more accuracy and have more information about recorded voices.

BSC researcher Carme Armentano presents the results of the Common Voice initiative

BSC researcher Carme Armentano presents the results of the Common Voice initiative

Presentation of models

During the LREC-Colling 2024, some of the models generated by Aina were also presented, such as the Flor 6.3B. An operating model that was presented last December and that accumulates thousands of downloads. The publication “FLOR: On the Effectiveness of Language Adaptation discusses the improvements of model pre-understanding with vocabulary adaptation compared to other techniques used for model pre-understanding of language technologies.

Researchers Severino Da Dalt and Joan Llop present the Flor model

Researchers Severino Da Dalt and Joan Llop present the Flor model

These are just some of the works present at the event. A key space for sharing knowledge and synergies between the international scientific community. Other projects have also been tackled, such as the construction of a data infrastructure for Catalan or the analysis of gender bias in automatic translation. The Aina Project aims to promote the presence of Catalan in the digital age. At the same time, with the infrastructure developed so far, the project becomes a reference for initiatives in other countries and for international operators interested in promoting non-global languages. The memorandum of understanding signed by the BSC with the GSMA, Veon and Beeline in the framework of the Aina Challenge is an example.

27 de May de 2024 | Scientific news |