Infrastructures of Aina
The Aina project is coordinated by the Barcelona Supercomputing Center (BSC-CNS). This center is a benchmark in the field of supercomputing and hosts the MareNostrum 5 supercomputer, which enables massive data processing and the execution of advanced and innovative language models.
The supercomputer has a capacity of 314 Petaflops and is located in the UPC-Campus Nord-Barcelona Tech area. The pre-exascale power of MareNostrum 5 is 100 times higher than that of MareNostrum 4. This is a breakthrough and offers a great opportunity for the execution of language resources and technologies that require high performance computing.
Data management, a collective effort
In the task of developing linguistic resources, Aina works with textual and voice data coming from multiple sources. In fact, the project works closely with entities from the language community and other sectors to collect and process large amounts of data.
Language models are key to the development of new applications. Aina works to generate and update these models, whether they are mono- or multi-lingual or multimodal .
The project also helps implement and include Catalan modules and libraries in reference environments and platforms to guarantee the correct coverage of our language.
Join Common Voice and be part of the effort to position Catalan as one of the languages with the most resources available in the digital sphere.
Prominent organizations such as Google have inspired their language models such as PaLM2 through the data collected in these repositories.
Construction of pioneering models
The generation of trained linguistic models is a fundamental step for the generation of applications based on artificial intelligence. For this reason, Aina’s goal is the training of models that can lead to progress in terms of safety and response effectiveness.
The generation of the models known as large language model (LLM)
it is a progressive process that allows to evolve exponentially in the creation of new models, reducing the cost and resources to train new optimal models. One of the starting points is the generation of the Aguila 7b model, trained with a total of 26B tokens and which, at the same time, has 7B parameters. It is the last preliminary LLM open source code released by the researchers of the language technologies unit of the Barcelona Supercomputing Center (BSC).
The design of the multilingual LLM makes it possible to boost the artificial intelligence sector and natural language processing in Catalonia. Also to optimize processes and internationalize products, improve the service of public administrations, access content in Catalan in the digital field and the training sector by citizens, as well as promote the exchange of technology and knowledge by researchers to the scientific community.