Comparison of Hugo Page Translation quality - LLMs on Ollama

qwen3 8b, 14b and 30b, devstral 24b, mistral small 24b

Page content

In this test I’m comparing how different LLMs hosted on Ollama translate Hugo page in English to German. Three pages I tested were on different topics, had some nice markdown with some structure: headers, lists, tables, links, etc.

All these models did run on NVidia GPU with 16GB VRAM, some of them did fit completely into VRAM, some (24b and 30b) didn’t and spilt over to CPU.

Nevertheless, tested those too.

two pages - translation is in progress

The Ollama models I’ve tested:

  • qwen3:8b
  • qwen3:14b
  • qwen3:30b-a3b
  • devstral:24b
  • mistral-small3.1:24b

After running translation prompts on all of those models I asked smart Anthropic model to analyse the quality of the translations, compare and provide metrics for each model.

I have also tried magistral:24b but it was thinking too much, so was slow and was eating up all the context. So - not for my GPU.

The simple conclusion: TL;DR

The best one is Mistral Small 3.1 24b. It doesn’t fit into 16GB VRAM, so was a bit slow on my hardware.

The next two best: Qwen 3 14b and Qwen 3 30b. 14b fits nicely, but 30b is a MoE model, so quite fast too. On my hardware their speed was similar.

Test 1: Phylosophical topic

The markdown of the Hugo-based website page on some awesome philosophical topic - Ad Baculum Lofical Fallacy description and examples.

index.de.devstral-24b.md - Rating: 7/10 Strengths:

  • Good technical terminology translation (“argumentum ad baculum Fehlschluss”)
  • Maintains proper link formatting and structure
  • Accurate translation of complex philosophical concepts
  • Proper German grammar and syntax

Weaknesses:

  • “verunstaltet” for “tainted” is somewhat awkward; “getrübt” would be more natural
  • Some phrases feel slightly stiff or overly literal
  • Minor flow issues in certain sentences

index.de.mistral-small3.1-24b.md - Rating: 8.5/10 Strengths:

  • Very natural German language flow
  • Excellent translation of technical terms
  • Good preservation of original meaning and tone
  • Proper grammar and sentence structure
  • Uses “getrübt” for “tainted” which is more natural

Weaknesses:

  • Very minor inconsistencies in terminology
  • Occasional slight deviations from source structure

index.de.qwen3-8b.md - Rating: 6.5/10 Strengths:

  • Accurate translation of main concepts
  • Technical terms properly handled
  • Maintains article structure

Weaknesses:

  • Some awkward phrasing (“kann sich auch in alltäglichen Interaktionen breiten”)
  • Less natural German flow in several passages
  • Some grammatical constructions feel forced
  • Missing word in “ein Werkzeug zur Unterdrückung von Widerstand und zur Gestaltung von Ergebnissen in Gunst der Machtbesitzer darstellt” (grammar error)

index.de.qwen3-14b.md - Rating: 8/10 Strengths:

  • Natural German language flow
  • Excellent handling of technical terminology
  • Good preservation of original meaning
  • Proper grammar and syntax
  • Consistent style throughout

Weaknesses:

  • Minor variations in terminology consistency
  • Some sentences could be more concise

index.de.qwen3-30b-a3b.md - Rating: 7.5/10 Strengths:

  • Good overall translation quality
  • Natural German expression
  • Technical terms well translated
  • Maintains readability and flow
  • Good preservation of meaning

Weaknesses:

  • Some minor inconsistencies in word choice
  • Occasional slightly awkward phrasing
  • Missing article in final paragraph “über das [Terrain der] Argumentation”

Summary Rankings

File Translation Quality Score
index.de.mistral-small3.1-24b.md 8.5/10
index.de.qwen3-14b.md 8.0/10
index.de.qwen3-30b-a3b.md 7.5/10
index.de.devstral-24b.md 7.0/10
index.de.qwen3-8b.md 6.5/10

Best Translation: index.de.mistral-small3.1-24b.md - Most natural German flow with excellent technical accuracy

Areas for Improvement: The Qwen3-8b version needs grammatical corrections and more natural phrasing, while others mainly need minor refinements in word choice and consistency.

Test 2: Translation of the page about Qwen3 Embedding & Reranker Models on Ollama.

Based on my analysis of the German translations compared to the original English text, here are the quality ratings on a scale of 1-10:

File LLM Model Translation Quality Score Comments
index.de.devstral-24b.md Devstral 24B 8.5/10 Excellent overall translation with natural German flow, proper technical terminology, and complete Hugo shortcode preservation. Minor deductions for some slightly awkward phrasing.
index.de.mistral-small3.1-24b.md Mistral Small 3.1 24B 8.7/10 Very high quality translation with consistent terminology, natural German language, and perfect formatting preservation. Slightly better flow than Devstral.
index.de.qwen3-8b.md Qwen3 8B 6.5/10 Mixed quality - contains English-German language mixing in opening paragraph (“The [Qwen3 Embedding and Reranker models]” instead of full German). Rest is well translated but inconsistent.
index.de.qwen3-14b.md Qwen3 14B 8.2/10 Good German translation with proper technical terminology and natural flow. Well-structured and consistent throughout.
index.de.qwen3-30b-a3b.md Qwen3 30B 9.0/10 Excellent translation quality with very natural German language, perfect technical terminology consistency, and outstanding readability. Best overall translation.

Key Evaluation Criteria:

Strengths observed across translations:

  • Hugo shortcode preservation ({ {< ref >}} syntax)
  • Technical terminology accuracy
  • Markdown formatting maintenance
  • Table structure preservation
  • Image reference consistency

Quality differentiators:

  1. Language consistency - Qwen3 8B had English-German mixing issues
  2. Natural German flow - Qwen3 30B achieved the most natural German expression
  3. Technical precision - All models handled technical terms well
  4. Formatting preservation - All maintained Hugo structure perfectly

Ranking by Quality:

  1. Qwen3 30B (9.0/10) - Best overall
  2. Mistral Small 3.1 24B (8.7/10) - Very strong
  3. Devstral 24B (8.5/10) - Excellent
  4. Qwen3 14B (8.2/10) - Good
  5. Qwen3 8B (6.5/10) - Adequate with issues

The larger models (Qwen3 30B and Mistral Small 3.1 24B) produced the highest quality translations, while the smaller Qwen3 8B model showed clear limitations in consistency.

Test 3: Reranking with Ollama and Qwen3 Embedding model - in Go

Basically, it is a previous post on this website: Reranking text documents with Ollama and Qwen3 Embedding model - in Go.

Based on my detailed analysis of the German translations, here’s the comparative assessment:

Translation Quality Analysis (1-10 Scale)

File Model Score Key Assessment
index.de.devstral-24b.md DeVistral 24B 9/10 Excellent - Most consistent and professional
index.de.mistral-small3.1-24b.md Mistral Small 3.1 24B 8.5/10 Very Good - Natural but minor inconsistencies
index.de.qwen3-14b.md Qwen3 14B 7/10 Good - Solid with some mixed elements
index.de.qwen3-30b-a3b.md Qwen3 30B 6.5/10 Fair - Contains untranslated English
index.de.qwen3-8b.md Qwen3 8B 5.5/10 Poor - Multiple translation failures

Detailed Comparative Analysis:

🥇 DeVistral 24B (9/10) - Winner Strengths:

  • Technical Precision: Uses “ollama API-Fehler” (lowercase, more natural)
  • Consistent Translation: “handelt mit parallelen Anfragen” (sophisticated phrasing)
  • Professional Quality: Complete German translation throughout
  • Formatting: Perfect preservation of markdown structure

Minor Issues: Very few, mostly stylistic preferences

🥈 Mistral Small 3.1 24B (8.5/10) Strengths:

  • Natural German flow
  • Good technical terminology: “Ollama-API-Fehler” (hyphenated, formal style)
  • Consistent “verarbeitet parallele Anfragen”

Issues:

  • Slightly less sophisticated phrasing in some technical contexts
  • Minor formatting variations

🥉 Qwen3 14B (7/10) Strengths:

  • Generally accurate meaning preservation
  • Proper German grammar structure

Issues:

  • Some awkward constructions
  • Mixed quality in technical terminology
  • Contains “```” at start and formatting issues

Qwen3 30B (6.5/10) Critical Flaw:

  • Contains untranslated English: “This little” appears directly in German text
  • Despite larger model size, shows translation inconsistencies
  • Technical accuracy is present but delivery is flawed

Qwen3 8B (5.5/10) Major Issues:

  • Untranslated sections: “This little” left in English
  • Inconsistent terminology: Uses “RANGORDNUNG” vs standard “RANGIERUNG”
  • Poor flow: Less natural German constructions
  • Shows clear limitations of smaller model size

Key Differentiators Found:

  1. API Error Translation:

    • DeVistral: ollama API-Fehler (natural, lowercase)
    • Others: Ollama-API-Fehler (formal, hyphenated)
  2. Parallel Requests:

    • DeVistral: “handelt mit parallelen Anfragen” (sophisticated)
    • Mistral/Qwen3-14B/30B: “verarbeitet parallele Anfragen” (standard)
    • Qwen3-8B: “behandelt parallele Anfragen” (less precise)
  3. Section Headers:

    • Most: “RANGIERUNG NACH ÄHNLICHKEIT” (standard)
    • Qwen3-8B: “RANGORDNUNG DURCH ÄHNLICHKEIT” (awkward)
  4. English Contamination:

    • Qwen3-30B & Qwen3-8B: Leave “This little” untranslated
    • Others: Properly translate to “Dieses kleine”

Winner: DeVistral 24B - Shows the most consistent professional-level German translation with sophisticated technical terminology and complete language conversion.


Hehe DeVistral :). But that’s another LLM’s error.