Deep Understanding of DeepSeek and Enterprise Practices (Part 2): Principles of 32B Multi-GPU Inference, Hardware Cooling, and Performance Testing

2025-02-13 19:56

Table of Contents

Foreword

In the article Deep Understanding of DeepSeek and Enterprise Practices (Part 1): Distillation, Deployment, and Evaluation, we explored the distillation and quantization techniques of deep models, as well as the deployment basics of a 7B model. Typically, a single GPU’s memory can handle the full parameter requirements of a 7B model. However, when the parameter count scales up to the 32B (32 billion) level, a single GPU’s memory often falls short of supporting its full operation. This is where multi-GPU parallel inference becomes necessary, alongside considerations of whether the server’s hardware architecture can support multiple GPUs.

This article takes the deployment of DeepSeek-Distilled-Qwen-32B as an example. We’ll dive into the principles of multi-GPU parallelism and key considerations for deploying multiple GPUs in a server. Additionally, we’ll evaluate the 32B model’s runtime performance and reasoning capabilities, offering analysis and recommendations for its suitable use cases.

I. Memory Requirements Assessment for 32B Model Deployment

When deploying a 32B model, factors like precision, context length, and batch size significantly affect memory and computational demands. We covered the core influencing factors in the previous article, so we won’t repeat them here. Instead, we’ll directly provide the assessed values:

assessed values

Given the complexity of modern quantization methods (e.g., data packing, FP8 format quantization, etc.), labeling them as Int8 or Int4 is less precise. Thus, we’ll use 8-bit quantization and 4-bit quantization for estimation here.

Additional variations may arise due to quantization strategies for different layers, data structure precision, whether KV Cache quantization is enabled, or the use of different inference frameworks.

llm vram utilizing gadget

 

II. Principles of Multi-GPU Inference Explained

From the calculations above, it’s clear that with large contexts—especially at higher data precision—a single GPU struggles to meet memory needs. Common consumer GPUs typically offer up to 24GB of memory, while inference-focused cards reach 48GB. Only a few high-end GPUs provide 64–141GB.

Thus, for models with 32B parameters or more, multi-GPU inference is nearly unavoidable. The main multi-GPU parallel strategies today are Tensor Parallelism and Pipeline Parallelism.

  1. Tensor Parallelism
    Splits a single tensor across dimensions, enabling parallel computation of the same operation across multiple GPUs.
  1. Advantages: Computation and communication can overlap, boosting efficiency.
  2. Disadvantages: High implementation complexity. It demands high inter-GPU communication bandwidth and low latency. Splitting must follow powers of 2 (e.g., 2, 4, 8, 16).

e.g., 2, 4, 8, 16

  1. Pipeline Parallelism
    Assigns different model layers to different GPUs. Upstream and downstream GPUs pass activation values sequentially, like an assembly line.
  1. Advantages: Reduces synchronization communication overhead. It has lower demands on bandwidth and latency.
  2. Disadvantages: Pipeline bubbles may occur, wasting resources.

Pipeline Parallelism

  1. Comparison of Parallel Strategies

Comparison of Parallel Strategies

From the table above, Tensor Parallelism excels at boosting overall throughput. However, Pipeline Parallelism is simpler to implement and suits mixed CPU-GPU inference scenarios. This is why llama.cpp (the inference engine used by ollama) opts for Pipeline Parallelism. It also explains why llama.cpp’s multi-GPU performance is relatively weaker.

III. Server Hardware Deployment and GPU Configuration

  1. Challenges of Installing Multiple GPUs in a 2U Server
    As noted earlier, installing GPUs in powers of 2 (e.g., 2, 4, 8, 16) optimizes Tensor Parallelism performance. For commonly used 2U servers, fitting 2 GPUs is usually straightforward. However, 4 GPUs pose challenges.

Challenges of Installing Multiple GPUs in a 2U Server

Most GPUs are double-width, occupying two PCIe slots. Even without other devices taking slots, a typical 2U server can only fit three GPUs. Since this doesn’t align with powers of 2, only two GPUs can be fully utilized.

  1. Solutions

Reduce Front-Panel Drive Count: Free up space and improve cooling by using larger-capacity drives instead of multiple smaller ones.

Use Multi-GPU Modules: Some server vendors offer dedicated GPU modules. These reserve the entire upper 1U space for GPUs, allowing up to 4 double-width GPUs side by side.

At this point, the front panel must reserve airflow for cooling. Thus, it can only hold 8 3.5-inch drives. Larger-capacity drives are needed to ensure sufficient storage.

hold 8 3.5-inch drives

For more drives or better cooling, a 3U, 4U, or taller server is required. The optimal setup depends on cabinet power supply and GPU power consumption.

IV. One-Click Deployment of DeepSeek-Distilled-Qwen-32B on AIOS Helix

  1. Deployment Steps
    Environment Configuration

Environment Configuration

Deployment Steps

  1. Environment Preparation: Install ZStackAIOS Helix. Ensure the system meets runtime requirements.
  2. One-Click Deployment:
    Use ZStack AIOS Helix to select and load the model.
    b. Specify GPU and compute specs for deployment.
  3. Test Run: Try a chat in the demo box or connect via API to other applications.

Performance Evaluation
Using ZStack AIOS Helix’s performance testing, we quickly assessed the model’s performance on current hardware. The data is summarized as follows:

Performance Evaluation

Performance Evaluation ui

Combining these results, we can analyze the current environment:

Throughput (TPS) vs. Concurrency

  1. TPS rises sharply from 1 to 16 concurrency (23→256). At 32 concurrency, growth slows significantly (only 15% increase).
  2. Recommended Concurrency Range: 4–16 offers good throughput gains.
  3. Peak Turning Point: Beyond 16, the system nears a performance bottleneck.

Key Findings on Response Latency

  1. TTFT (Time to First Token) spikes to 25 seconds at 32 concurrency (vs. 0.06 seconds at 1).
  2. Total latency exceeds 64 seconds at 32 concurrency—2.7 times higher than low concurrency.
  3. Real-Time Scenario Advice: For latency-sensitive cases (e.g., chat systems), keep concurrency ≤4.

Resource Efficiency Analysis

  1. Single-session throughput is 23.248. At 32 concurrency, it drops to 9.198 (60% decline).
  2. Marginal benefits per added session fade after 16 concurrency.
  3. Optimization Advice: Scale via 16 concurrency * multiple instances, not high single-instance concurrency.

Recommended Configurations for Different Scenarios

Recommended Configurations for Different Scenarios

With ZStack AIOS Helix’s evaluation tools and real-world conditions, finding the right business plan and deployment model becomes easier.
Note: Tests show 16 concurrency as the optimal throughput/latency balance. Beyond this, performance degrades noticeably. Validate with stress tests based on hardware resources during actual deployment.

V. Capability Evaluation: MMLU, HumanEval, and Other Benchmarks

  1. Test Metrics
    1. Answer Accuracy: Performance on professional knowledge Q&A (MMLU), reflecting comprehensive knowledge ability.
    2. Code Generation: Assessed on HumanEval, where code must compile and pass unit tests.
    3. Math Reasoning: Tested on the Math dataset, showcasing mathematical problem understanding and reasoning.

      2. Evaluation Results

Evaluation Results

VI. Application Scenarios and Outlook for the 32B Model

The 32B model excels in multiple areas:

  • Reasoning Speed: Optimized multi-GPU parallelism boosts inference speed, balancing cost and capability.
  • Math Ability: Shines in complex calculations and formula derivations.
  • Logical Reasoning: Grasps and reasons through intricate logical relationships.
  • Code Generation: Offers high-quality code writing and fixes. However, it lags slightly behind larger models in generating long, complete code. It’s better suited for code review and completion.

Thus, we’ve identified potential use cases for DeepSeek-Distilled-Qwen-32B:

  • Educational Support: Leverages its knowledge and comprehension for teaching assistance, explanations, and Q&A.
  • Code Review: Uses code understanding and generation to automate reviews, spot issues, and suggest improvements.
  • Specialized Domains: Delivers high-quality text generation, knowledge retrieval, and decision support in fields like law, medicine, and finance.

VII. Outlook: Deployment Strategies for Larger Parameter Models

Through this exploration, we’ve gained deep insights into the multi-GPU deployment of DeepSeek-Distilled-Qwen-32B, hardware requirements, and performance across precision and parallel strategies. The 32B model’s robust capabilities open new possibilities for enterprise applications. Looking ahead, we anticipate seeing its value and potential in more real-world scenarios.

In future articles, we’ll cover:

  • Quantized Deployment of DeepSeek R1 Models: Deploying 671B-scale models with limited resources.
  • Full-Precision Deployment of DeepSeek R1 Models: Maximizing large models in high-performance environments.

By comparing models of varying sizes and precisions, we aim to provide comprehensive, detailed deployment plans for enterprise use. This will help industries adopt large language model technology quickly, unlocking business value.
Note: Some data in this article is illustrative. Actual conditions may vary. Detailed testing and validation are recommended during implementation.

 

//