Mistral Small, Gemma 2, Qwen 2.5, Mistral Nemo, LLama3 and Phi - LLM Test

Next round of LLM tests

Page content

Not long ago was released. Let’s catch up and test how Mistral Small performs comparing to other LLMs.

Before we already did:

Car is speeding

How we test

Here we test summarisation capabilities of LLMS:

  • we have 40 sample texts, and we are running LLM with the Question and Summarisation prompt (similar to perplexica way)
  • reranked summaries with embedding models
  • number of correct answers divided by number of total questions gives us the performance of the model

Test Result

Top 5 places with average % of correct answers:

  1. 82%: phi3 - 14b-medium-128k-instruct-q4_0
  2. 81%: llama3.1 - 8b-instruct-q8_0
  3. 81%: mistral-small - 22b-instruct-2409-q4_0
  4. 79%: mistral-nemo - 12b-instruct-2407-q6_K
  5. 79%: llama3.2 - 3b-instruct-q8_0

All these models have shown good performance.

I would direct a some of attention towards Mistal model group. The quality of language is a bit better then average.

Another point - a little 3.2b modell lama3.2:3b-instruct-q8_0 showed a very good result for it’s size, and it’s the fastest of them all.

Detailed test result

Model name, params, quant Size Test 1 Test 2 Avg
llama3.2:3b-instruct-q8_0 4GB 80 79 79
llama3.1:8b-instruct-q8_0 9GB 76 86 81
gemma2:27b-instruct-q3_K_S 12GB 76 72 74
mistral-nemo:12b-instruct-2407-q6_K 10GB 76 82 79
mistral-small:22b-instruct-2409-q4_0 12GB 85 75 80
phi3:14b-medium-128k-instruct-q4_0 9GB 76 89 82
qwen2.5:14b-instruct-q5_0 10GB 66 75 70
qwen2.5:32b-instruct-q3_K_S 14GB 80 75 77
qwen2.5:32b-instruct-q4_0 18GB 76 79 77
llama3.1:70b-instruct-q3_K_M 34GB 76 75 75
qwen2.5:72b-instruct-q4_1 45GB 76 75 75