Mistral Small, Gemma 2, Qwen 2.5, Mistral Nemo, LLama3 and Phi - LLM Test

Next round of LLM tests

Page content

Before we already did:

Car is speeding

How we test

Here we test summarisation capabilities of LLMS:

we have 40 sample texts, and we are running LLM with the Question and Summarisation prompt (similar to perplexica way)
reranked summaries with embedding models
number of correct answers divided by number of total questions gives us the performance of the model

Top 5 places with average % of correct answers:

All these models have shown good performance.

I would direct a some of attention towards Mistal model group. The quality of language is a bit better then average.

Another point - a little 3.2b modell lama3.2:3b-instruct-q8_0 showed a very good result for it’s size, and it’s the fastest of them all.

Model name, params, quant	Size	Test 1	Test 2	Avg
llama3.2:3b-instruct-q8_0	4GB	80	79	79
llama3.1:8b-instruct-q8_0	9GB	76	86	81
gemma2:27b-instruct-q3_K_S	12GB	76	72	74
mistral-nemo:12b-instruct-2407-q6_K	10GB	76	82	79
mistral-small:22b-instruct-2409-q4_0	12GB	85	75	80
phi3:14b-medium-128k-instruct-q4_0	9GB	76	89	82
qwen2.5:14b-instruct-q5_0	10GB	66	75	70
qwen2.5:32b-instruct-q3_K_S	14GB	80	75	77
qwen2.5:32b-instruct-q4_0	18GB	76	79	77
llama3.1:70b-instruct-q3_K_M	34GB	76	75	75
qwen2.5:72b-instruct-q4_1	45GB	76	75	75