Promoting the use of Catalan in the digital age

Infrastructures of Aina

The Aina project is coordinated by the Barcelona Supercomputing Center (BSC-CNS). This center is a benchmark in the field of supercomputing and hosts the MareNostrum 5 supercomputer, which enables massive data processing and the execution of advanced and innovative language models.

aina

The supercomputer has a capacity of 314 Petaflops and is located in the UPC-Campus Nord-Barcelona Tech area. The pre-exascale power of MareNostrum 5 is 100 times higher than that of MareNostrum 4. This is a breakthrough and offers a great opportunity for the execution of language resources and technologies that require high performance computing.

Data management, a collective effort

In the task of developing linguistic resources, Aina works with textual and voice data coming from multiple sources. In fact, the project works closely with entities from the language community and other sectors to collect and process large amounts of data.

Language models are key to the development of new applications. Aina works to generate and update these models, whether they are mono- or multi-lingual or multimodal .

The project also helps implement and include Catalan modules and libraries in reference environments and platforms to guarantee the correct coverage of our language.

aina

Join Common Voice and be part of the effort to position Catalan as one of the languages with the most resources available in the digital sphere.
Prominent organizations such as Google have inspired their language models such as PaLM2 through the data collected in these repositories.

Construction of pioneering models

The generation of trained linguistic models is a fundamental step for the generation of applications based on artificial intelligence. For this reason, Aina’s goal is the training of models that can lead to progress in terms of safety and response effectiveness.

The generation of the models known as large language model (LLM)
it is a progressive process that allows to evolve exponentially in the creation of new models, reducing the cost and resources to train new optimal models. One of the starting points is the generation of the Aguila 7b model, trained with a total of 26B tokens and which, at the same time, has 7B parameters. It is the last preliminary LLM open source code released by the researchers of the language technologies unit of the Barcelona Supercomputing Center (BSC).

aina

Figure 1: Ǎguila example of a single turn instruction

Model Flor 6.3B available at Hugging Face

Model Aguila 7b disponible a Hugging Face

The design of the multilingual LLM makes it possible to boost the artificial intelligence sector and natural language processing in Catalonia. Also to optimize processes and internationalize products, improve the service of public administrations, access content in Catalan in the digital field and the training sector by citizens, as well as promote the exchange of technology and knowledge by researchers to the scientific community.

News

Events