You ship a chatbot for your German team. The UI is German. The source docs are partly German, partly English. The tools behind the assistant expect English enum values, English function descriptions, English product names, English everything.
Then someone asks a perfectly normal question in German, the model picks the right tool, passes one argument in German instead of English, and the whole thing quietly falls apart.
I've seen versions of this enough times now that I no longer think of it as a translation issue. It is an execution issue.
My current default is simple: let users speak whatever language they want, but let the LLM do its retrieval, reasoning, and tool work in English whenever reliability actually matters. Then localize the answer outside that loop.
That sounds slightly heretical at first. It is also, in my opinion, the most practical thing you can do right now.
The research is starting to say this out loud
The numbers are not subtle anymore.
In the MASSIVE-Agents benchmark, researchers evaluated multilingual function calling across 52 languages, 47,020 samples, and 21 models. The best average score across all languages was just 34.05%. English reached 57.37%. Amharic dropped to 6.81%.
That is not a small quality wobble. That is a reliability cliff.
Then there is Lost in Execution, which gets even closer to the real systems problem. The paper shows that many multilingual tool-calling failures happen after the model already understood the intent and selected the correct tool. The dominant issue was parameter value language mismatch. In plain English, the model knew what to do, but expressed the executable bits in the user's language instead of the interface language, so the call failed anyway.
And this is not limited to tool calling. In Do Multilingual Language Models Think Better in English?, Etxaniz and colleagues found that self-translation into English consistently beat direct non-English inference across five tasks. Their phrasing is refreshingly blunt: models are "unable to leverage their full multilingual potential when prompted in non-English languages."
So yes, multilingual models are impressive. But if your bar is not "sounds pretty good" and is instead "must behave correctly in production," English still looks like the safer operating language remarkably often.
Why RAG breaks in the same place
People usually hear this argument and think of agents first. Function calling, structured output, API execution, that kind of thing.
RAG has the same weakness, just one layer earlier.
If your retrieval layer has to match a user's local phrasing against content written in mixed languages, with inconsistent terminology, translated product names, and half-localized taxonomy labels, you create more chances for the system to drift before generation even starts. Honestly, this is where a lot of "the model is unreliable" complaints come from. The model may be fine. The content interface is not.
I would rather normalize early.
Translate the question into English. Retrieve against an English canonical corpus. Let the model reason over one stable terminology layer. Generate an answer draft in English if needed. Then translate or localize the final response for the user.
That gives you one place where naming stays stable:
- one canonical document title
- one canonical product vocabulary
- one canonical tool schema
- one canonical set of retrieval labels
You can still support every user language on the outside. You just stop asking the core execution path to be perfectly multilingual at every step.
This is not anti-localization
Quite the opposite.
Bad multilingual AI architecture usually hurts local users first. They get the nice localized interface, then the hidden English-centric system underneath behaves inconsistently and makes them pay the price.
Proper localization means being honest about where language should flex and where it should not.
For me, the split looks like this:
- Localize the UI, prompts, help text, onboarding, and final answers.
- Localize the source content people read directly when that content needs to exist in-market.
- Keep internal tool definitions, canonical identifiers, retrieval labels, and reasoning pivots in English if that is the most stable layer.
- Add explicit post-processing or human review where a localized output has legal, regulatory, or contractual weight.
That last point matters more than teams like to admit. If the model is talking to a human, localization is a user experience decision. If the model is talking to another system, language is an interface contract.
Those are not the same thing.
The architecture I trust most right now
This is the version I would bet on today for multilingual AI products:
- User asks in their language.
- System translates or normalizes the request into English.
- Retrieval, reasoning, ranking, and tool calls happen against English canonical data.
- Final answer is localized back into the user's language.
- High-risk outputs get an extra validation step before they leave the system.
It is not philosophically pure. It is operationally sane.
The nice thing is that recent research points in the same direction. Lost in Execution found that pre-translation of user queries generally reduced language mismatch errors better than post-hoc fixes, even if it still did not fully recover English-level performance. That matches what many builders already suspect in practice. If you wait until the end to clean up multilingual inconsistency, you are usually too late.
And yes, there are exceptions. If you are building for low-resource languages, domain-specific language, or culturally dependent phrasing, translating everything into English can introduce drift. The paper above explicitly warns about that. So do not turn this into dogma.
But as a default for enterprise copilots, internal assistants, multilingual RAG, and tool-using agents, I think the rule holds surprisingly well.
What this means for Rasepi
This is exactly why I care so much about canonical content structure.
If your knowledge base has one clean source layer, stable terminology, and controlled localization on top, AI gets easier to trust. If every language version drifts independently inside the execution path, you are asking the model to improvise where your system should be precise.
Rasepi's whole approach is built around separating those concerns cleanly. Keep a canonical core. Localize deliberately. Track where variants exist. Do not pretend every layer of the stack should be equally multilingual just because the UI is.
I used to think the best multilingual AI experience meant "do everything in the user's language." I do not think that anymore. Not for systems that have to retrieve the right paragraph, choose the right tool, and return something you can trust.
The practical rule is simple: users should stay local, but the LLM's execution path should stay stable. Right now, that usually means English in the middle and localization at the edges.
That will change over time. I hope it changes quickly. But if you are shipping today and reliability matters more than aesthetics, I would let the model think in English and let your product speak the user's language.