Large Language Models Speed Test

Let's test the LLMs' speed on GPU vs CPU

Page content

Comparing prediction speed of several versions of LLMs: llama3 (Meta/Facebook), phi3 (Microsoft), gemma (Google), mistral(open source) on CPU and GPU.

Testing speed of large language models in detecting logical fallacies - stopwatch

I’m using the same sample text as in previous test where I compared these LLMs’ detection quality of logical fallacies.

Look, on first blush, it all sounds perfectly reasonable:
too many people, not enough houses.

But it is never that simple,
as a former home affairs minister should know.

TL;DR

On GPU LLMs run approximately 20 times faster, but on CPU they are still quite manageable.

Test Stand Description

I’ve run the below Large Language Models on two PCs

  • Old with 4th gen i5 4-core CPU (i5-4460 - produced in 2014) and
  • New with RTX 4080 GPU (produced in 2022) with 9728 CUDA cores and 304 tensor cores.

Test Results

Here below are the results:

Model_Name_Version__________ GPU RAM GPU duration GPU Perfor-mance Main RAM CPU Duration CPU Perfor-mance Perfor-mance diffe-rence
llama3:8b-instruct-q4_0 5.8GB 2.1s 80t/s 4.7GB 49s 4.6t/s 17.4x
llama3:8b-instruct-q8_0 9.3GB 3.4s 56t/s 8.3GB 98s 2.7t/s 20.7x
phi3:3.8b 4.5GB 3.6s 98t/s 3.0GB 83s 7.2t/s 13.6x
phi3:3.8b-mini-4k-instruct-q8_0 6.0GB 6.9s 89t/s 4.6GB 79s 5.3t/s 16.8x
phi3:3.8b-mini-instruct-4k-fp16 9.3GB 4.2s 66t/s 7.9GB 130s 2.9t/s 22.8x
phi3:14b 9.6GB 4.2s 55t/s 7.9GB 96s 2.7t/s 21.2x
phi3:14b-medium-4k-instruct-q6_K 12.5GB 8.9s 42t/s 11.1GB 175s 1.9t/s 21.8x
mistral:7b-instruct-v0.3-q4_0 5.4GB 2.1s 87t/s 4.1GB 36s 4.9t/s 17.8x
mistral:7b-instruct-v0.3-q8_0 8.7GB 2.3s 61t/s 7.5GB 109s 2.9t/s 21.0x
gemma:7b-instruct-v1.1-q4_0 7.4GB 1.8s 82t/s 7.5GB 25s 4.4t/s 18.6x
gemma:7b-instruct-v1.1-q6_K 9.1GB 1.6s 66t/s 7.5GB 40s 3.0t/s 22.0x

Model performance is in “GPU performance” and “CPU performance” columns.

Speed gain when moving from CPU to GPU is in “Performance difference” column.

We shouldn’t pay much of attention to the “duration” columns - this metric depends on model performance and produced text’ length. All models produce text of different lengths. These column just gives indicative wait time.

Conclusion 1 - Performance difference

The GPU vs CPU speed difference isnot as big as expected.

Seriously? All the legions (10k+) of Ada Tensor & Cuda cores vs 4 Haswell spartans, and just 20 times the difference. I thought it would be 100-1000 times.

Conclusion 2 - Cost per prediction is almost the same

  • this new PC pricetag is around 3500AUD
  • that old PC now costs probably 200AUD

From PCCCaseGear’s site:

pc with RTX 4080super price

From ebay (you might want to add extra 8GB RAM to make it 16GB total - so let’s round it up to 200AUD):

Dell 9020 from ebay

You might need 20 of those old pc to have the same throughput, so 200AUD * 20 = 4000AUD.

Conclusion 3 - Moore’s law

Moore’s law implies that computer’s performance doubles every two years.

Intel started the production of i5-4460 in 2014. Nvidia started one of RTX 4080 in 2022. The expected performance rise should be ~16 times.

I’d say, Moore’s law still works.

But keep in mind that DELL 9020 was at a time a basic workstation, and PC with RTX 4080 is now I’d say advanced graphics/gaming PC. Slightly different weight class.