Mistral Small, Gemma 2, Qwen 2.5, Mistral Nemo, LLama3 and Phi - LLM Test
Next round of LLM tests
Page content
Not long ago was released. Let’s catch up and test how Mistral Small performs comparing to other LLMs.
Before we already did:
- Testing logical fallacy detection by new LLMs: gemma2, qwen2 and mistralNemo
- Test: Best LLM for Perplexica
How we test
Here we test summarisation capabilities of LLMS:
- we have 40 sample texts, and we are running LLM with the Question and Summarisation prompt (similar to perplexica way)
- reranked summaries with embedding models
- number of correct answers divided by number of total questions gives us the performance of the model
Test Result
Top 5 places with average % of correct answers:
- 82%: phi3 - 14b-medium-128k-instruct-q4_0
- 81%: llama3.1 - 8b-instruct-q8_0
- 81%: mistral-small - 22b-instruct-2409-q4_0
- 79%: mistral-nemo - 12b-instruct-2407-q6_K
- 79%: llama3.2 - 3b-instruct-q8_0
All these models have shown good performance.
I would direct a some of attention towards Mistal model group. The quality of language is a bit better then average.
Another point - a little 3.2b modell lama3.2:3b-instruct-q8_0 showed a very good result for it’s size, and it’s the fastest of them all.
Detailed test result
Model name, params, quant | Size | Test 1 | Test 2 | Avg |
---|---|---|---|---|
llama3.2:3b-instruct-q8_0 | 4GB | 80 | 79 | 79 |
llama3.1:8b-instruct-q8_0 | 9GB | 76 | 86 | 81 |
gemma2:27b-instruct-q3_K_S | 12GB | 76 | 72 | 74 |
mistral-nemo:12b-instruct-2407-q6_K | 10GB | 76 | 82 | 79 |
mistral-small:22b-instruct-2409-q4_0 | 12GB | 85 | 75 | 80 |
phi3:14b-medium-128k-instruct-q4_0 | 9GB | 76 | 89 | 82 |
qwen2.5:14b-instruct-q5_0 | 10GB | 66 | 75 | 70 |
qwen2.5:32b-instruct-q3_K_S | 14GB | 80 | 75 | 77 |
qwen2.5:32b-instruct-q4_0 | 18GB | 76 | 79 | 77 |
llama3.1:70b-instruct-q3_K_M | 34GB | 76 | 75 | 75 |
qwen2.5:72b-instruct-q4_1 | 45GB | 76 | 75 | 75 |