Log in to watch

Log in or create a free account to watch this video.

Log in
Las Vegas 2025
Share
Download slides

National LLMs – Towards Moderate-Size Models That Are Focused on the Local Culture and Language

Generative Artificial Intelligence (GAI) has revolutionized information technologies, particularly Large Language Models (LLMs), which are now the frontier of this revolution. LLMs enable us to boost our systems and way we process data, however they present also significant challenges, including legal issues (e.g. authorship law and related one, especially crucial in EU), transparency (we know very little how the instruction train sets looks like in most of the LLM vendors), and non-uniform distribution of languages in train sets (most LLMs are mainly trained on English and Chinese texts, other languages are the minority).


Large language models characterize high proficiency in processing and generating multilingual texts. However, language and culture understanding (including historical events, traditions, folklore, literature, and pop culture) is a more complex problem, much harder than simple text analysis. A lack of such knowledge can lead to misinterpretations and subtle, hard-to-detect errors. There is a need for moderate-sized LLMs focused on local culture and history.


During my speech, I will present how we built local LLMs in Poland that consume mainly organic instructions (human-written ones) and how we boost our models to be focused on regional/national aspects, such as local history and culture.

Chapters

Full transcript

The complete talk, organized by section.

Marek Kozłowski

Okay. Hello everyone. I know the last days of the conference are usually tough days. The rooms are usually empty or almost empty.

First of all, I noticed that most of the speakers at ETLS are from North America. I am from the European Union, from the country called Poland. What am I doing here? First, we have an important competition called IT Manager of Tomorrow. The people who won the main award, who got to the final and got the award, received seven days in Las Vegas plus the chance to speak at ETLS.

My topic is national LLMs. This is the topic that I have worked on for many years, and now there are products from this work. I will talk a little bit about that.

First of all, I am the head of AI Lab. It is a very little research department, about 10 people, and we focus on two topics. First, LLMs, but LLMs focused on the Polish language and Polish culture; that means localized LLMs. Of course, localized LLMs are still natural language processing. Plus, we do computer vision for huge factory halls.

We work for the biggest companies in Southeastern Europe. We work for the biggest bank in Central and Eastern Europe, for the biggest recycling company, for different public offices, and for the French government in the Alliance for Language Technologies.

What is crucial is that many, many companies, or currently maybe all companies, are deploying or planning to deploy LLMs. There are strategic decisions that have to be made when you try to deploy LLMs. First, whether to use cloud-based LLMs or on-premises LLMs. Whether to use text-only, cheaper models, or multimodal models. How large a model is good enough. Maybe it is better to take a smaller model to have a better cost factor, because the efficiency is good enough. Some business values can be reached without using the biggest LLMs.

Another question is whether to use models in a few-shot mode, the easiest one, prompt engineering, or whether you have the people, abilities, and data sets to fine-tune the models to become better in some specific niche tasks. The last question, which is the topic of today, is whether to choose multilingual models or single-language models: broad language coverage or specialization.

01Localized Models

Localized models: are they worth it?

Nowadays, most of the LLMs from Anthropic, OpenAI, Google, and many others are generative and multilingual. However, you have to remember that neural language models are not only generative models. Neural language models are divided into two main categories: representation models and generative models.

Representation models are like the BERT models. Some years ago they were called the first LLMs, the BERT models. Of course, now they are relatively small, but they are still language models. As I mentioned, neural language models have two categories. The representation models are orders of magnitude smaller than the generative ones, but very good in classification and regression tasks: BERT, RoBERTa, and many others based on BERT.

Generative models are the artifact of the current revolution. They are decoder-only plus encoder-decoder models. Decoder-only models are now more popular because they are better scalable than the previous encoder-decoder architecture. But the previous architecture is also very good, and you have to remember that there are still places where it is worth deploying them.

Localized models, Polish localization or any other language localization, means improving understanding of the language and creating high-quality content in this language.

For example, in my department we build our own representation models. We had Polish RoBERTa, meaning we created our RoBERTa models focused on the Polish language in a classical context length and also in an expanded context. These models are better than multilingual ones like multilingual modern BERT or EuroBERT. It shows that building specialized models gives you better efficiency, of course in some specialized domain. For example, in the domain of banking language or the Polish language, you have to specialize to some niche to be better than the multilingual models that are more general.

We also took part in national consortium projects where we built generative models for the Polish language. They range from 8 billion to 70 billion parameters. What is crucial: maybe they are not so broad in instruction skills as the OpenAI and Anthropic ones, but generally they are better in producing content in a high-quality manner in the case of the Polish language.

When you compare even the cloud-based LLMs and ask them to write an email in the Polish language, and compare this email written by general-purpose cloud-based LLMs to our models, our models are able to produce better content in the dimension of linguistic quality. One example is that when you use a general-purpose LLM and ask it to write an email in Polish, usually it has under-the-hood transfer learning. It translates some statements on the fly from English to Polish. For example, a popular English statement in emails is, "I hope you stay in good health." When you translate this on the fly into Polish, there is a statement that seems weird, something that is not normal for Polish people.

Usually when you use multilingual models, there is always the risk that those localizations on the fly, based on transfer learning under the hood, will not be adjusted to the cultural, historical, or language conditions.

Why do multilingual models suffer from not speaking fluently in every language? It is a question of data sets. Ninety percent of the data sets used by OpenAI, Google, and others are English and Chinese. Only ten percent is other languages. For example, the Polish language is a promille in this corpus.

Of course, all our models are on Hugging Face. If we produce models for banks, the banks have accounts on Hugging Face. If we produce models for the Ministry of Digital Affairs in Poland, the ministry has a Hugging Face account. All the models we produce are open source and are released on Hugging Face.

02Why Build Polish LLMs

Now there is the idea in Poland to create our Polish large language models, in this case the generative ones. What are the reasons why we decided to produce our own models?

Because AI should speak not only English and Chinese fluently. As I mentioned, 90% of data sets are English and Chinese, and this imbalance causes the models not to be good in other languages, especially niche languages. For example, languages in countries that are not as huge as the United States, the English world, the Chinese world, or the Spanish world suffer from this imbalance.

Because AI should be easily accessible to national sectors where privacy and security are crucial. There are some sectors where we are not able to use clouds, even if the cloud LLMs are very good and can be used as they are. There are constraints, for example privacy and security. You have to create on-premises deployments.

If you create on-premises deployments, you always have the trade-off between cost and efficiency. When you deploy models on on-premises infrastructure, there is usually a problem with the number of GPUs used, for example because of energy factors or even buying these GPUs. You can of course use DeepSeek, which seems to be somehow similar quality to the OpenAI models. But even if you take DeepSeek V3 or DeepSeek R1, it is about 600 billion parameters. To deploy such a model, you have to use 16 Hopper H100 or H200 GPUs, and even that is only to launch the model. If you have to create more scalability, you have to use 32 GPUs. Most companies are not able to buy such on-premises configurations.

Therefore, there is a need, a huge demand in small and medium factories, to deploy models sized so they are able to run on two or four GPUs, not more.

Conversational abilities involve not only words and sentences. Very often in our conversations, historical and cultural aspects are also very important and should be taken into account.

The last statement is a very American one: we can sometimes do more than we think. If you can do it from scratch, if you can build your models from scratch, you have abilities and technical skills that you can leverage in the future.

Polish is not only the Polish language. It is not only the words. There are people, historical events, even pierogi, popular in the USA. There are writers, sports celebrities, and many other issues. All these cultural and historical issues influence how we converse with models and how comfortable we are when we talk with models.

03PLLuM

Now a little bit about our generative models called PLLuM.

We did not start from scratch. That means we did not start pre-training from random weights. Why? Because if you would like to create a model from scratch, starting from random weights, you have to get a huge number of tokens in your corpora for pre-training. For example, if you would like to create a stable, moderate-quality, or maybe good-quality model, you need at least 1 trillion tokens in the corpus. One trillion tokens is the threshold to create models from scratch.

In our case, in Poland, after a one-year project we were able to gather between 300 and 400 billion tokens. As I mentioned, to be able to create a model from scratch, you need at least 1 trillion tokens. That means it is not enough to create quality models with such a number of tokens. We changed the decision and started from existing models, of course multilingual open-source ones, and started language adaptation. That means we used continuous pre-training. Using our data, we continuously pre-trained existing open-source models.

Remember that the last training is the freshest. The model remembers the data that it saw last more comfortably than the previous data.

The huge advantage of our model is that we used mostly organic Polish-written instructions and preferences. When you create models, you have three stages. First is pre-training. It can be pre-training from scratch or pre-training from an existing checkpoint, called continuous pre-training. You teach the model the language, like a child. You teach the model to understand the language, words, sentences, and some basic information.

After this stage is fine-tuning. In fine-tuning, we use instructions. We teach the model how to resolve problems: write a poem, compute a mathematical formula, and so on. It is like a child at school. There are subjects like geography, math, and physics. You teach children to behave in such a manner, for example to resolve complex mathematical formulas. That is fine-tuning.

After fine-tuning is stage number three: alignment. In OpenAI, there is reinforcement learning from human feedback, but there are different approaches, such as DPO. Having some preferences, you can teach the model to behave better, to behave more efficiently, to be more secure.

The three stages are: pre-training, where you learn the language, words, sentences, grammar, and pieces of information; supervised fine-tuning, where you teach the model instructions and how to resolve different downstream tasks; and alignment, where you learn on preferences how to do it better.

What is very popular currently is that most companies use very big models and distill these models, creating instructions. You ask DeepSeek or other models: create me 1,000 instructions for such a task, create me 10,000 instructions for such a task. It is called distillation. We distill the very large models, create instructions the same as they do it, and learn the smaller models based on that.

But usually in this approach, when you perform distillation from very big models, there is a problem of counter-adaptation. Counter-adaptation means that after pre-training on our high-quality Polish text, our model speaks better in Polish. But after we show this model instructions generated by other models, which are not very good at Polish, we roll back this ability to be fluent in Polish. Showing the model synthetic instructions causes the model to be counter-adapted.

We decided to write Polish-written organic instructions and preferences. There is very little information on the internet about OpenAI instruction corpora, Anthropic instruction corpora, or even Mistral, because it is business value or intellectual property. I think if they showed them, they would lose the advantage. I am almost sure that most of the companies that are now at the frontier of the AI revolution have their own organic instructions written by hundreds or thousands of people in Kenya, Bangladesh, the USA, and other countries, because only organic data sets give you high-quality models.

The last point is that our models are not fully capable instruction models. They cover about hundreds of instruction types. For example, the OpenAI models have thousands of instruction types. However, each of them provides a relevant advantage. For example, email generation: when we would like to send one million emails to our citizens, it is much easier for us to generate them from our models than using other models that are not good at the Polish language.

As I mentioned, high quality and fluency in the Polish language give you the advantage when you create content in Polish, for example emails, summaries, reports, and so on. If you talk about analytical issues, for example comparing documents or extracting information, you can use multilingual models because it is enough to get a sufficient solution.

When you teach language models, models first learn how to understand, next how to write. It is the same as people. When we learn a new language, first we are able to hear and listen, next we are able to speak and write. The same is true in models. It is only worth using localized LLMs when you create content in this language that is more than a paragraph or one-page paragraphs, something that demands complex and very high linguistic quality.

04Benchmark And Polish Context

Now we have our benchmark. It is called PLCC, Polish Linguistic and Cultural Competencies. We tested around 1,000 questions, and we tested how good our models are at Polish language and Polish culture and history knowledge.

As you see after our procedure of training our models, our 12-billion model, or 8x7 mixture model, is not 56 billion but about 45 billion unique parameters. Our models are orders of magnitude smaller than DeepSeek or GPT, but they are at the same level if you talk about knowledge of Polish culture, history, and language. That means using this procedure, you can produce models that are orders of magnitude smaller, but they are high-quality generators of text in this language.

There is the point that Poland is now the country almost on the border between Russia and Western Europe. We are always there. We are between two huge powers, the West, Germany, and Russia, and we are still there.

It was trained in Polish, by Poles, and for Polish users. Polish education has very good rankings if you talk about math Olympiads and software coding Olympiads; we are always in the top three for many years. For example, if you look at the OpenAI team, the core team of OpenAI, half of the members are Poles. Of course, they were bought by others and they are immigrants now, but they spent their first years in Poland. I think their PhD studies were in universities in the USA or UK, but up to the first level of university degree they spent time in Poland. Even now Poland has very talented high school students, and I think it is still a very good place to invest.

It was no solo act. We built models. Our government created a consortium consisting of six high-level education and research institutes. They combined in the consortium to produce models for the Polish government, for Polish citizens, and for Polish companies. It is a public initiative.

05Startup Challenge And Deployments

The startup challenge had an 11-month implementation period and over 100 people. The goals were open and legally compliant Polish large language models, because we have various model sizes, and a Polish-speaking intelligent assistant: an open general-purpose assistant and RAG, a domain-oriented assistant focused on government documents and databases.

What makes us different? Legality. There are always problems if you would like to have a very large corpus of text: you always have problems with licenses and authorship rights. For example, the top vendors I mentioned, OpenAI, Google, Anthropic, Cohere, and others, I think they are not fully legalized compared to European Union legislation and demands. The European Union is very strict. Your data sets should be very clear about authorship rights and licenses.

We gathered two corpora. We have the grey corpora, where we are not fully sure how good the data are in legal terms; based on them we create only scientific prototypes. We also have the white corpora, which are very clear and legal sub-corpora, and we use them to create the models for public and commercial usage.

Organic: as I mentioned, we do not have any distillation. We have 50,000 instructions and about 100,000 preferences. All of them were done manually.

Various: we have a family of models, from the little ones, for example 8 billion, to the 70-billion-parameter models.

Secure: we embed security during the preference stage. We have preference data sets that are crucial to stop the model from generating unethical content.

Of course, we do not have thousands and tens of thousands of GPUs like OpenAI or other companies. We have only 400 Hoppers. Our training sessions using such 400 Hoppers usually last from two to three weeks.

For commercial deployments, our models this year were deployed in one of the IT leaders in Central and Eastern Europe, an e-commerce company, and in the biggest bank in Central and Eastern Europe. There are two big commercial deployments this year.

For public deployments, we deployed our models in the application called mObywatel. It is a mobile application where you can have your ID, driver's license, fees, penalty points, everything stored there. It is the gateway to citizen e-services, and our model works there in a chatbot solution. We also build chatbot solutions based on our models for city offices. For example, each city can have its own chatbot powered with LLMs.

We have one million prompts. Our general-purpose engine was released in March, and from March until now we have one million prompts from citizens that can enable us to improve our models.

06Closing

Why care about Polish AI models? For us there are three main factors: technological sovereignty, language is identity, and democratizing access, meaning open source and transparency. We would like to be as clear as possible and as legal as possible.

This is not just tech. It is a mission. AI should serve citizens and local companies.

What is in it for Poland? New opportunities for companies and the public sector, a boost for innovation, and technological skills that can be leveraged in the future.

Language localization is now a very popular topic because we would like not to create one for all, but to adapt products and translations to a specific country or region. For many e-commerce systems and deployments, localization is a must. You have to localize your product to a specific market, for example the Polish market, Czech market, French market, German market, or Asian market. You have to adjust your product or your process to the local environments.

I think the next step in the future will be localized LLMs as a trend in the next years. My last statement today: localizations are a must. Small localized models will become a part of this process in the future. I think localized LLMs are the future, because people will try to do it as small as possible, because the cost of energy and the cost of the assets you have to buy will be too high just to prototype it.

Okay. That is all. Thank you very much.