Comparison of Hugo Page Translation quality - LLMs on Ollama
qwen3 8b, 14b and 30b, devstral 24b, mistral small 24b
In this test I’m comparing how different LLMs hosted on Ollama translate Hugo page in English to German. Three pages I tested were on different topics, had some nice markdown with some structure: headers, lists, tables, links, etc.
All these models did run on NVidia GPU with 16GB VRAM, some of them did fit completely into VRAM, some (24b and 30b) didn’t and spilt over to CPU.
Nevertheless, tested those too.
The Ollama models I’ve tested:
- qwen3:8b
- qwen3:14b
- qwen3:30b-a3b
- devstral:24b
- mistral-small3.1:24b
After running translation prompts on all of those models I asked smart Anthropic model to analyse the quality of the translations, compare and provide metrics for each model.
I have also tried magistral:24b but it was thinking
too much, so was slow and was eating up all the context.
So - not for my GPU.
The simple conclusion: TL;DR
The best one is Mistral Small 3.1 24b. It doesn’t fit into 16GB VRAM, so was a bit slow on my hardware.
The next two best: Qwen 3 14b and Qwen 3 30b. 14b fits nicely, but 30b is a MoE model, so quite fast too. On my hardware their speed was similar.
Test 1: Phylosophical topic
The markdown of the Hugo-based website page on some awesome philosophical topic - Ad Baculum Lofical Fallacy description and examples.
index.de.devstral-24b.md - Rating: 7/10 Strengths:
- Good technical terminology translation (“argumentum ad baculum Fehlschluss”)
- Maintains proper link formatting and structure
- Accurate translation of complex philosophical concepts
- Proper German grammar and syntax
Weaknesses:
- “verunstaltet” for “tainted” is somewhat awkward; “getrübt” would be more natural
- Some phrases feel slightly stiff or overly literal
- Minor flow issues in certain sentences
index.de.mistral-small3.1-24b.md - Rating: 8.5/10 Strengths:
- Very natural German language flow
- Excellent translation of technical terms
- Good preservation of original meaning and tone
- Proper grammar and sentence structure
- Uses “getrübt” for “tainted” which is more natural
Weaknesses:
- Very minor inconsistencies in terminology
- Occasional slight deviations from source structure
index.de.qwen3-8b.md - Rating: 6.5/10 Strengths:
- Accurate translation of main concepts
- Technical terms properly handled
- Maintains article structure
Weaknesses:
- Some awkward phrasing (“kann sich auch in alltäglichen Interaktionen breiten”)
- Less natural German flow in several passages
- Some grammatical constructions feel forced
- Missing word in “ein Werkzeug zur Unterdrückung von Widerstand und zur Gestaltung von Ergebnissen in Gunst der Machtbesitzer darstellt” (grammar error)
index.de.qwen3-14b.md - Rating: 8/10 Strengths:
- Natural German language flow
- Excellent handling of technical terminology
- Good preservation of original meaning
- Proper grammar and syntax
- Consistent style throughout
Weaknesses:
- Minor variations in terminology consistency
- Some sentences could be more concise
index.de.qwen3-30b-a3b.md - Rating: 7.5/10 Strengths:
- Good overall translation quality
- Natural German expression
- Technical terms well translated
- Maintains readability and flow
- Good preservation of meaning
Weaknesses:
- Some minor inconsistencies in word choice
- Occasional slightly awkward phrasing
- Missing article in final paragraph “über das [Terrain der] Argumentation”
Summary Rankings
File | Translation Quality Score |
---|---|
index.de.mistral-small3.1-24b.md | 8.5/10 |
index.de.qwen3-14b.md | 8.0/10 |
index.de.qwen3-30b-a3b.md | 7.5/10 |
index.de.devstral-24b.md | 7.0/10 |
index.de.qwen3-8b.md | 6.5/10 |
Best Translation: index.de.mistral-small3.1-24b.md
- Most natural German flow with excellent technical accuracy
Areas for Improvement: The Qwen3-8b version needs grammatical corrections and more natural phrasing, while others mainly need minor refinements in word choice and consistency.
Test 2: Translation of the page about Qwen3 Embedding & Reranker Models on Ollama.
Based on my analysis of the German translations compared to the original English text, here are the quality ratings on a scale of 1-10:
File | LLM Model | Translation Quality Score | Comments |
---|---|---|---|
index.de.devstral-24b.md |
Devstral 24B | 8.5/10 | Excellent overall translation with natural German flow, proper technical terminology, and complete Hugo shortcode preservation. Minor deductions for some slightly awkward phrasing. |
index.de.mistral-small3.1-24b.md |
Mistral Small 3.1 24B | 8.7/10 | Very high quality translation with consistent terminology, natural German language, and perfect formatting preservation. Slightly better flow than Devstral. |
index.de.qwen3-8b.md |
Qwen3 8B | 6.5/10 | Mixed quality - contains English-German language mixing in opening paragraph (“The [Qwen3 Embedding and Reranker models]” instead of full German). Rest is well translated but inconsistent. |
index.de.qwen3-14b.md |
Qwen3 14B | 8.2/10 | Good German translation with proper technical terminology and natural flow. Well-structured and consistent throughout. |
index.de.qwen3-30b-a3b.md |
Qwen3 30B | 9.0/10 | Excellent translation quality with very natural German language, perfect technical terminology consistency, and outstanding readability. Best overall translation. |
Key Evaluation Criteria:
Strengths observed across translations:
- Hugo shortcode preservation (
{ {< ref >}}
syntax) - Technical terminology accuracy
- Markdown formatting maintenance
- Table structure preservation
- Image reference consistency
Quality differentiators:
- Language consistency - Qwen3 8B had English-German mixing issues
- Natural German flow - Qwen3 30B achieved the most natural German expression
- Technical precision - All models handled technical terms well
- Formatting preservation - All maintained Hugo structure perfectly
Ranking by Quality:
- Qwen3 30B (9.0/10) - Best overall
- Mistral Small 3.1 24B (8.7/10) - Very strong
- Devstral 24B (8.5/10) - Excellent
- Qwen3 14B (8.2/10) - Good
- Qwen3 8B (6.5/10) - Adequate with issues
The larger models (Qwen3 30B and Mistral Small 3.1 24B) produced the highest quality translations, while the smaller Qwen3 8B model showed clear limitations in consistency.
Test 3: Reranking with Ollama and Qwen3 Embedding model - in Go
Basically, it is a previous post on this website: Reranking text documents with Ollama and Qwen3 Embedding model - in Go.
Based on my detailed analysis of the German translations, here’s the comparative assessment:
Translation Quality Analysis (1-10 Scale)
File | Model | Score | Key Assessment |
---|---|---|---|
index.de.devstral-24b.md | DeVistral 24B | 9/10 | Excellent - Most consistent and professional |
index.de.mistral-small3.1-24b.md | Mistral Small 3.1 24B | 8.5/10 | Very Good - Natural but minor inconsistencies |
index.de.qwen3-14b.md | Qwen3 14B | 7/10 | Good - Solid with some mixed elements |
index.de.qwen3-30b-a3b.md | Qwen3 30B | 6.5/10 | Fair - Contains untranslated English |
index.de.qwen3-8b.md | Qwen3 8B | 5.5/10 | Poor - Multiple translation failures |
Detailed Comparative Analysis:
🥇 DeVistral 24B (9/10) - Winner Strengths:
- Technical Precision: Uses “ollama API-Fehler” (lowercase, more natural)
- Consistent Translation: “handelt mit parallelen Anfragen” (sophisticated phrasing)
- Professional Quality: Complete German translation throughout
- Formatting: Perfect preservation of markdown structure
Minor Issues: Very few, mostly stylistic preferences
🥈 Mistral Small 3.1 24B (8.5/10) Strengths:
- Natural German flow
- Good technical terminology: “Ollama-API-Fehler” (hyphenated, formal style)
- Consistent “verarbeitet parallele Anfragen”
Issues:
- Slightly less sophisticated phrasing in some technical contexts
- Minor formatting variations
🥉 Qwen3 14B (7/10) Strengths:
- Generally accurate meaning preservation
- Proper German grammar structure
Issues:
- Some awkward constructions
- Mixed quality in technical terminology
- Contains “```” at start and formatting issues
Qwen3 30B (6.5/10) Critical Flaw:
- Contains untranslated English: “This little” appears directly in German text
- Despite larger model size, shows translation inconsistencies
- Technical accuracy is present but delivery is flawed
Qwen3 8B (5.5/10) Major Issues:
- Untranslated sections: “This little” left in English
- Inconsistent terminology: Uses “RANGORDNUNG” vs standard “RANGIERUNG”
- Poor flow: Less natural German constructions
- Shows clear limitations of smaller model size
Key Differentiators Found:
-
API Error Translation:
- DeVistral:
ollama API-Fehler
(natural, lowercase) - Others:
Ollama-API-Fehler
(formal, hyphenated)
- DeVistral:
-
Parallel Requests:
- DeVistral: “handelt mit parallelen Anfragen” (sophisticated)
- Mistral/Qwen3-14B/30B: “verarbeitet parallele Anfragen” (standard)
- Qwen3-8B: “behandelt parallele Anfragen” (less precise)
-
Section Headers:
- Most: “RANGIERUNG NACH ÄHNLICHKEIT” (standard)
- Qwen3-8B: “RANGORDNUNG DURCH ÄHNLICHKEIT” (awkward)
-
English Contamination:
- Qwen3-30B & Qwen3-8B: Leave “This little” untranslated
- Others: Properly translate to “Dieses kleine”
Winner: DeVistral 24B - Shows the most consistent professional-level German translation with sophisticated technical terminology and complete language conversion.
Hehe DeVistral :). But that’s another LLM’s error.
Useful links
- Reranking text documents with Ollama and Qwen3 Embedding model - in Go
- Ollama cheatsheet
- Qwen3 Embedding & Reranker Models on Ollama: State-of-the-Art Performance
- Install and Configure Ollama models location
- How Ollama Handles Parallel Requests
- LLMs comparison: Mistral Small, Gemma 2, Qwen 2.5, Mistral Nemo, LLama3 and Phi - On Ollama
- Test: How Ollama is using Intel CPU Performance and Efficient Cores
- Comparing LLM Summarising Abilities