Large Language Models Speed Test

Let's test the LLMs' speed on GPU vs CPU

Page content

Comparing prediction speed of several versions of LLMs: llama3 (Meta/Facebook), phi3 (Microsoft), gemma (Google), mistral(open source) on CPU and GPU.

Testing speed of large language models in detecting logical fallacies - stopwatch

I’m using the same sample text as in previous test where I compared these LLMs’ detection quality of logical fallacies.

Look, on first blush, it all sounds perfectly reasonable:
too many people, not enough houses.

But it is never that simple,
as a former home affairs minister should know.

TL;DR

On GPU LLMs run approximately 20 times faster, but on CPU they are still quite manageable.

Test Stand Description

I’ve run the below Large Language Models on two PCs

Old with 4th gen i5 4-core CPU (i5-4460 - produced in 2014) and
New with RTX 4080 GPU (produced in 2022) with 9728 CUDA cores and 304 tensor cores.

Test Results

Here below are the results:

Model_Name_Version__________	GPU RAM	GPU duration	GPU Perfor-mance	Main RAM	CPU Duration	CPU Perfor-mance	Perfor-mance diffe-rence
llama3:8b-instruct-q4_0	5.8GB	2.1s	80t/s	4.7GB	49s	4.6t/s	17.4x
llama3:8b-instruct-q8_0	9.3GB	3.4s	56t/s	8.3GB	98s	2.7t/s	20.7x
phi3:3.8b	4.5GB	3.6s	98t/s	3.0GB	83s	7.2t/s	13.6x
phi3:3.8b-mini-4k-instruct-q8_0	6.0GB	6.9s	89t/s	4.6GB	79s	5.3t/s	16.8x
phi3:3.8b-mini-instruct-4k-fp16	9.3GB	4.2s	66t/s	7.9GB	130s	2.9t/s	22.8x
phi3:14b	9.6GB	4.2s	55t/s	7.9GB	96s	2.7t/s	21.2x
phi3:14b-medium-4k-instruct-q6_K	12.5GB	8.9s	42t/s	11.1GB	175s	1.9t/s	21.8x
mistral:7b-instruct-v0.3-q4_0	5.4GB	2.1s	87t/s	4.1GB	36s	4.9t/s	17.8x
mistral:7b-instruct-v0.3-q8_0	8.7GB	2.3s	61t/s	7.5GB	109s	2.9t/s	21.0x
gemma:7b-instruct-v1.1-q4_0	7.4GB	1.8s	82t/s	7.5GB	25s	4.4t/s	18.6x
gemma:7b-instruct-v1.1-q6_K	9.1GB	1.6s	66t/s	7.5GB	40s	3.0t/s	22.0x

Model performance is in “GPU performance” and “CPU performance” columns.

Speed gain when moving from CPU to GPU is in “Performance difference” column.

We shouldn’t pay much of attention to the “duration” columns - this metric depends on model performance and produced text’ length. All models produce text of different lengths. These column just gives indicative wait time.

Conclusion 1 - Performance difference

The GPU vs CPU speed difference isnot as big as expected.

Seriously? All the legions (10k+) of Ada Tensor & Cuda cores vs 4 Haswell spartans, and just 20 times the difference. I thought it would be 100-1000 times.

Conclusion 2 - Cost per prediction is almost the same

this new PC pricetag is around 3500AUD
that old PC now costs probably 200AUD

From PCCCaseGear’s site:

pc with RTX 4080super price

From ebay (you might want to add extra 8GB RAM to make it 16GB total - so let’s round it up to 200AUD):

Dell 9020 from ebay

You might need 20 of those old pc to have the same throughput, so 200AUD * 20 = 4000AUD.

Conclusion 3 - Moore’s law

Moore’s law implies that computer’s performance doubles every two years.

Intel started the production of i5-4460 in 2014. Nvidia started one of RTX 4080 in 2022. The expected performance rise should be ~16 times.

I’d say, Moore’s law still works.

But keep in mind that DELL 9020 was at a time a basic workstation, and PC with RTX 4080 is now I’d say advanced graphics/gaming PC. Slightly different weight class.