常见错误

修复: Docker 找不到 GPU

解决 "could not select device driver" 或 "nvidia-container-cli" 错误

错误信息

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

nvidia-container-cli: initialization error

torch.cuda.is_available() 返回 False

根本原因

Docker 无法访问宿主机 GPU。常见原因:

  • 未安装 NVIDIA Container Toolkit - GPU 直通必需
  • Docker 守护进程未配置 - 缺少 nvidia runtime 配置
  • 缺少 --gpus 参数 - 容器启动时未请求 GPU 访问
  • 驱动未加载 - NVIDIA 内核模块未激活

解决方案

步骤 1: 验证 NVIDIA 驱动

nvidia-smi

应显示 GPU 和驱动版本。如果没有,请先安装 NVIDIA 驱动。

步骤 2: 安装 NVIDIA Container Toolkit

# 添加仓库

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \

sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \

sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# 安装工具包

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

步骤 3: 配置 Docker

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

步骤 4: 测试 GPU 访问

docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

步骤 5: 运行容器

始终使用 --gpus all 参数:

docker run --gpus all your-image python -c "import torch; print(torch.cuda.is_available())"

快速检查清单

  • 宿主机 nvidia-smi 正常工作
  • 已安装 nvidia-container-toolkit
  • 安装后已重启 Docker 守护进程
  • 使用 --gpus all--gpus "device=0"

生成支持 GPU 的 Dockerfile

配置选项

本地 GPU 或 CPU 环境

2025推荐,Blackwell(10.0)原生支持,官方cu128编译包

需要 NVIDIA 驱动版本 >=570.26.00
Dockerfile
1# syntax=docker/dockerfile:1
2# ^ Required for BuildKit cache mounts and advanced features
3
4# Generated by DockerFit (https://tools.eastondev.com/docker)
5# PYTORCH 2.9.1 + CUDA 12.8 | Python 3.11
6# Multi-stage build for optimized image size
7
8# ==============================================================================
9# Stage 1: Builder - Install dependencies and compile
10# ==============================================================================
11FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04 AS builder
12
13# Build arguments
14ARG DEBIAN_FRONTEND=noninteractive
15
16# Environment variables
17ENV PYTHONUNBUFFERED=1
18ENV PYTHONDONTWRITEBYTECODE=1
19ENV TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0;10.0"
20
21# Install Python 3.11 from deadsnakes PPA (Ubuntu 24.04)
22RUN apt-get update && apt-get install -y --no-install-recommends \
23 software-properties-common \
24 && add-apt-repository -y ppa:deadsnakes/ppa \
25 && apt-get update && apt-get install -y --no-install-recommends \
26 python3.11 \
27 python3.11-venv \
28 python3.11-dev \
29 build-essential \
30 git
31 && rm -rf /var/lib/apt/lists/*
32
33# Create virtual environment
34ENV VIRTUAL_ENV=/opt/venv
35RUN python3.11 -m venv $VIRTUAL_ENV
36ENV PATH="$VIRTUAL_ENV/bin:$PATH"
37
38# Upgrade pip
39RUN pip install --no-cache-dir --upgrade pip setuptools wheel
40
41# Install PyTorch with BuildKit cache
42RUN --mount=type=cache,target=/root/.cache/pip \
43 pip install torch torchvision torchaudio \
44 --index-url https://download.pytorch.org/whl/cu128
45
46# Install project dependencies
47COPY requirements.txt .
48RUN --mount=type=cache,target=/root/.cache/pip \
49 pip install -r requirements.txt
50
51# ==============================================================================
52# Stage 2: Runtime - Minimal production image
53# ==============================================================================
54FROM nvidia/cuda:12.8.0-cudnn-runtime-ubuntu24.04 AS runtime
55
56# Labels
57LABEL maintainer="Generated by DockerFit"
58LABEL version="2.9.1"
59LABEL description="PYTORCH 2.9.1 + CUDA 12.8"
60
61# Environment variables
62ENV PYTHONUNBUFFERED=1
63ENV PYTHONDONTWRITEBYTECODE=1
64ENV NVIDIA_VISIBLE_DEVICES=all
65ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
66
67# Install Python 3.11 runtime from deadsnakes PPA (Ubuntu 24.04)
68RUN apt-get update && apt-get install -y --no-install-recommends \
69 software-properties-common \
70 && add-apt-repository -y ppa:deadsnakes/ppa \
71 && apt-get update && apt-get install -y --no-install-recommends \
72 python3.11 \
73 libgomp1
74 && apt-get remove -y software-properties-common \
75 && apt-get autoremove -y \
76 && rm -rf /var/lib/apt/lists/*
77
78# Create non-root user for security
79ARG USERNAME=appuser
80ARG USER_UID=1000
81ARG USER_GID=$USER_UID
82RUN groupadd --gid $USER_GID $USERNAME \
83 && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME
84
85# Copy virtual environment from builder
86COPY --from=builder --chown=$USERNAME:$USERNAME /opt/venv /opt/venv
87ENV VIRTUAL_ENV=/opt/venv
88ENV PATH="$VIRTUAL_ENV/bin:$PATH"
89
90# Set working directory
91WORKDIR /app
92
93# Copy application code
94COPY --chown=$USERNAME:$USERNAME . .
95
96# Switch to non-root user
97USER $USERNAME
98
99# Expose port
100EXPOSE 8000
101
102# Default command
103CMD ["python", "main.py"]
🚀 推荐部署

高性能 GPU 与 AI 云服务器

为您的 Docker 容器提供强大的 NVIDIA 算力支持,支持 A100/H100,全球 32 个机房可选。

  • 支持 NVIDIA A100/H100 GPU 实例
  • 按小时计费,测试成本低至 $0.004/h
  • 全球 32+ 数据中心,极低访问延迟
  • 一键运行容器化应用与裸金属服务器
🎁 立即部署