Large Language Models Speed Test
Let's test the LLMs' speed on GPU vs CPU
Comparing prediction speed of several versions of LLMs: llama3 (Meta/Facebook), phi3 (Microsoft), gemma (Google), mistral(open source) on CPU and GPU.
I’m using the same sample text as in previous test where I compared these LLMs’ detection quality of logical fallacies.
Look, on first blush, it all sounds perfectly reasonable:
too many people, not enough houses.
But it is never that simple,
as a former home affairs minister should know.
TL;DR
On GPU LLMs run approximately 20 times faster, but on CPU they are still quite manageable.
Test Stand Description
I’ve run the below Large Language Models on two PCs
- Old with 4th gen i5 4-core CPU (i5-4460 - produced in 2014) and
- New with RTX 4080 GPU (produced in 2022) with 9728 CUDA cores and 304 tensor cores.
Test Results
Here below are the results:
Model_Name_Version__________ | GPU RAM | GPU duration | GPU Perfor-mance | Main RAM | CPU Duration | CPU Perfor-mance | Perfor-mance diffe-rence |
---|---|---|---|---|---|---|---|
llama3:8b-instruct-q4_0 | 5.8GB | 2.1s | 80t/s | 4.7GB | 49s | 4.6t/s | 17.4x |
llama3:8b-instruct-q8_0 | 9.3GB | 3.4s | 56t/s | 8.3GB | 98s | 2.7t/s | 20.7x |
phi3:3.8b | 4.5GB | 3.6s | 98t/s | 3.0GB | 83s | 7.2t/s | 13.6x |
phi3:3.8b-mini-4k-instruct-q8_0 | 6.0GB | 6.9s | 89t/s | 4.6GB | 79s | 5.3t/s | 16.8x |
phi3:3.8b-mini-instruct-4k-fp16 | 9.3GB | 4.2s | 66t/s | 7.9GB | 130s | 2.9t/s | 22.8x |
phi3:14b | 9.6GB | 4.2s | 55t/s | 7.9GB | 96s | 2.7t/s | 21.2x |
phi3:14b-medium-4k-instruct-q6_K | 12.5GB | 8.9s | 42t/s | 11.1GB | 175s | 1.9t/s | 21.8x |
mistral:7b-instruct-v0.3-q4_0 | 5.4GB | 2.1s | 87t/s | 4.1GB | 36s | 4.9t/s | 17.8x |
mistral:7b-instruct-v0.3-q8_0 | 8.7GB | 2.3s | 61t/s | 7.5GB | 109s | 2.9t/s | 21.0x |
gemma:7b-instruct-v1.1-q4_0 | 7.4GB | 1.8s | 82t/s | 7.5GB | 25s | 4.4t/s | 18.6x |
gemma:7b-instruct-v1.1-q6_K | 9.1GB | 1.6s | 66t/s | 7.5GB | 40s | 3.0t/s | 22.0x |
Model performance is in “GPU performance” and “CPU performance” columns.
Speed gain when moving from CPU to GPU is in “Performance difference” column.
We shouldn’t pay much of attention to the “duration” columns - this metric depends on model performance and produced text’ length. All models produce text of different lengths. These column just gives indicative wait time.
Conclusion 1 - Performance difference
The GPU vs CPU speed difference isnot as big as expected.
Seriously? All the legions (10k+) of Ada Tensor & Cuda cores vs 4 Haswell spartans, and just 20 times the difference. I thought it would be 100-1000 times.
Conclusion 2 - Cost per prediction is almost the same
- this new PC pricetag is around 3500AUD
- that old PC now costs probably 200AUD
From PCCCaseGear’s site:
From ebay (you might want to add extra 8GB RAM to make it 16GB total - so let’s round it up to 200AUD):
You might need 20 of those old pc to have the same throughput, so 200AUD * 20 = 4000AUD.
Conclusion 3 - Moore’s law
Moore’s law implies that computer’s performance doubles every two years.
Intel started the production of i5-4460 in 2014. Nvidia started one of RTX 4080 in 2022. The expected performance rise should be ~16 times.
I’d say, Moore’s law still works.
But keep in mind that DELL 9020 was at a time a basic workstation, and PC with RTX 4080 is now I’d say advanced graphics/gaming PC. Slightly different weight class.
Useful links
- Logical Fallacy Detection with LLMs
- Testing logical fallacy detection by new LLMs: gemma2, qwen2 and mistralNemo
- Comparing LLM Summarising Abilities
- Large Language Models: https://en.wikipedia.org/wiki/Large_language_model
- Logical Fallacies: https://www.logical-fallacy.com
- Logical Fallacy detector Android App: https://www.logical-fallacy.com/articles/detector-android-app/
- Ollama: https://ollama.com/
- Move Ollama Models to Different Drive or Folder
- Writing effective prompts for LLMs
- Self-hosting Perplexica - with Ollama