DeepSeek local model graphics card review: Insufficient graphics memory and computing power are useless

Localized deployment of DeepSeek is a popular application method nowadays. In addition to avoiding busy servers, localized operation can also protect user privacy to a great extent. There are currently many versions of DeepSeek, and the difference in model capacity can reach dozens of times. How to choose the version suitable for their own hardware for deployment has always been a headache for users.

Today we will use RTX5090D, RTX5080, RTX5070Ti and RTX5070, a total of four RTX50 series graphics cards to actually measure the performance gap between different graphics cards.

First, let’s introduce the test platform. In addition to the four graphics cards tested this time, the processor is AMDR79800X3D and the memory is 48GBDDR56000MHz.

I won’t explain too much about the steps of local deployment here. Interested users can read our previous articles.

The test uses LMStudio without an acceleration framework for comparison, relying entirely on the graphics card's own computing power. After all, different acceleration frameworks optimize graphics cards from different manufacturers differently, and the test variables are too large.

Here we first select the [DeepSeekR1DistillQwen32B] model.

The GPU is fully offloaded, which means that the DeepSeek model will be completely calculated by the GPU, and other parameters can be defaulted. Since the AI model's answers will be different every time, three questions are set here and the average is taken.

Another thing to note is that the questions we set frame the scope so that the AI will not be too divergent when thinking about answers. If you ask an unscoped question like "What is philosophy?" the results of each answer will not be quantifiable.

In the 32B model, you can see that the tok/sec of the RTX5090D is still very fast. After all, as the flagship product of this generation, the 32GB large video memory is suitable for AI training.

But a problem occurred when the RTX5080 was tested. It can be seen that when the RTX5080 answered the question, the thinking time reached 348 seconds, which is nearly 6 minutes.

It needs to be mentioned here that there is a rough formula for converting video memory requirements for different models, namely:

(32) B÷2×1.15=Video memory

Therefore, the minimum video memory required by the 32B model is about 18.4GB, which has exceeded the 16GB of video memory of the RTX5080. The overflowed 2GB of video memory is made up by the internal memory.

But for the model, no matter how much memory is "borrowed" after the video memory is exploded, it will operate at the slowest speed.

When running the 32B model on my colleague's RTX2060, even though it has more "loaned" memory, the thinking time is still about 5 minutes.

Exploding the video memory is of little significance to this test, so we replaced it with a smaller 8B model so that subsequent models can complete the test entirely with video memory.

According to the above formula, it can be inferred that the 8B model only requires approximately 4.6GB of video memory to meet the computing needs.

After changing the model, all graphics cards can be tested normally, and the results are summarized as above.

Judging from the results, tok/sec has a greater relationship with graphics card memory and computing power, and shows the proper performance progression relationship. There is no big rule for firsttoken and thinking time. Below we have summarized the tok/sec results of each graphics card in a histogram so that everyone can see more clearly.

The RTX5090D, which has large video memory and high computing power, comes out on top without any surprise, while the RTX5080 and RTX5070Ti have the same video memory and the gap is not big. According to the AI computing power between different graphics cards:

RTX5090D (AITOPS: 2375);

RTX5080(AITOPS:1801);

RTX5070Ti (AITOPS: 1406);

RTX5070 (AITOPS: 988)

At least the DeepSeek large language model's requirement for AI computing power is not the most important, but video memory. As long as the video memory is large enough, it will have an overwhelming advantage in inferencing operations.

Finally, let’s summarize the key points of this DeepSeek test for everyone to quickly remember:

1. DeepSeek large language model’s demand for GPU: video memory > computing power

2. Conversion formula for model’s video memory requirements(x)B÷2×1.15=Video memory

3. When the video memory cannot meet the minimum requirements of the model, no amount of AI computing power will help.

4. Thinking time has no absolute relationship with GPU, but the openness of the problem

The purpose of LMStudio selected for this test is to use the real computing power of the graphics card without acceleration. However, there are now many acceleration frameworks for different architectures, and even laptops can run large models with full health. You may wish to try them yourself when using them.

(9647699)