What causes CUDA out of memory error?

GPU memory is exhausted when the model is too large for available VRAM, batch size is too high, there's memory fragmentation, memory leaks from unreleased tensors, or multiple processes are using GPU memory.

How much GPU memory do I need for a 7B parameter model?

A 7B parameter model requires approximately 28GB in FP32, 14GB in FP16/BF16, or 7GB in INT8 quantization. For training, you need 2-3x more memory for gradients and optimizer states.

How do I reduce GPU memory usage in PyTorch?

Reduce batch size, use mixed precision (FP16/BF16), enable gradient checkpointing, use quantization (INT8/INT4), clear cache with torch.cuda.empty_cache(), or use CPU offloading.

How do I fix vLLM CUDA out of memory error?

Limit GPU memory usage with --gpu-memory-utilization 0.8, use tensor parallelism for large models with --tensor-parallel-size, or use a smaller model that fits in your GPU memory.

Common Error

Fix: CUDA Out of Memory

Resolve "CUDA out of memory" errors when training or running inference

Error Message

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB

RuntimeError: CUDA error: out of memory

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor

Root Cause

GPU memory is exhausted. This happens when:

Model too large - Model parameters exceed available VRAM
Batch size too high - Processing too many samples at once
Memory fragmentation - Inefficient memory allocation patterns
Memory leaks - Tensors not properly released
Multiple processes - Other processes using GPU memory

Model Memory Requirements

Model Size	FP32	FP16/BF16	INT8	Recommended GPU
7B params	~28GB	~14GB	~7GB	T4 16GB, A10G 24GB
13B params	~52GB	~26GB	~13GB	A10G 24GB, L4 24GB
70B params	~280GB	~140GB	~70GB	A100 80GB, H100 80GB

* Inference only. Training requires 2-3x more memory for gradients and optimizer states.

Solutions

1. Reduce Batch Size

The quickest fix - halve your batch size until it fits:


# Instead of batch_size=32, try batch_size=16 or batch_size=8

2. Use Mixed Precision (FP16/BF16)

Cuts memory usage in half with minimal accuracy loss:

# PyTorch

from torch.cuda.amp import autocast

with autocast():

output = model(input)

# Or use BF16 on Ampere+ GPUs

model = model.to(torch.bfloat16)

3. Gradient Checkpointing

Trade compute for memory during training:

# PyTorch

model.gradient_checkpointing_enable()

# Transformers

model = AutoModel.from_pretrained(..., gradient_checkpointing=True)

4. Use Quantization (INT8/INT4)

Reduce model size by 2-4x:

# bitsandbytes 8-bit

model = AutoModelForCausalLM.from_pretrained(

"model_name",

load_in_8bit=True,

device_map="auto"

)

5. Clear Cache Periodically

Free up fragmented memory:

import torch

torch.cuda.empty_cache()

import gc

gc.collect()

6. Monitor GPU Memory

Check what's using memory:

# Terminal

watch -n 1 nvidia-smi

# Python

print(torch.cuda.memory_summary())

vLLM-Specific Solutions

Limit GPU Memory Usage


python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-3B --gpu-memory-utilization 0.8

Use Tensor Parallelism for Large Models


python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-70B --tensor-parallel-size 2

🚀 Recommended

High-Performance GPU Cloud

Deploy your Docker containers with powerful NVIDIA GPUs. A100/H100 available, 32+ global locations.

NVIDIA A100/H100 GPU instances
Hourly billing, starting at $0.004/h
32+ global data centers
One-click container & bare metal deployment

🎁 Deploy Now

Related Issues

PyTorch CUDA Mismatch

PyTorch and CUDA version incompatibility

Docker GPU Not Found

NVIDIA Container Toolkit not configured