Common Error

Fix: CUDA Out of Memory

Resolve "CUDA out of memory" errors when training or running inference

Error Message

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB

or

RuntimeError: CUDA error: out of memory

or

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor

Root Cause

GPU memory is exhausted. This happens when:

  • Model too large - Model parameters exceed available VRAM
  • Batch size too high - Processing too many samples at once
  • Memory fragmentation - Inefficient memory allocation patterns
  • Memory leaks - Tensors not properly released
  • Multiple processes - Other processes using GPU memory

Model Memory Requirements

Model Size FP32 FP16/BF16 INT8 Recommended GPU
7B params ~28GB ~14GB ~7GB T4 16GB, A10G 24GB
13B params ~52GB ~26GB ~13GB A10G 24GB, L4 24GB
70B params ~280GB ~140GB ~70GB A100 80GB, H100 80GB

* Inference only. Training requires 2-3x more memory for gradients and optimizer states.

Solutions

1. Reduce Batch Size

The quickest fix - halve your batch size until it fits:

# Instead of batch_size=32, try batch_size=16 or batch_size=8

2. Use Mixed Precision (FP16/BF16)

Cuts memory usage in half with minimal accuracy loss:

# PyTorch

from torch.cuda.amp import autocast

with autocast():

output = model(input)

# Or use BF16 on Ampere+ GPUs

model = model.to(torch.bfloat16)

3. Gradient Checkpointing

Trade compute for memory during training:

# PyTorch

model.gradient_checkpointing_enable()

# Transformers

model = AutoModel.from_pretrained(..., gradient_checkpointing=True)

4. Use Quantization (INT8/INT4)

Reduce model size by 2-4x:

# bitsandbytes 8-bit

model = AutoModelForCausalLM.from_pretrained(

"model_name",

load_in_8bit=True,

device_map="auto"

)

5. Clear Cache Periodically

Free up fragmented memory:

import torch

torch.cuda.empty_cache()

import gc

gc.collect()

6. Monitor GPU Memory

Check what's using memory:

# Terminal

watch -n 1 nvidia-smi

# Python

print(torch.cuda.memory_summary())

vLLM-Specific Solutions

Limit GPU Memory Usage

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-3B --gpu-memory-utilization 0.8

Use Tensor Parallelism for Large Models

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-70B --tensor-parallel-size 2
🚀 Recommended

High-Performance GPU Cloud

Deploy your Docker containers with powerful NVIDIA GPUs. A100/H100 available, 32+ global locations.

  • NVIDIA A100/H100 GPU instances
  • Hourly billing, starting at $0.004/h
  • 32+ global data centers
  • One-click container & bare metal deployment
🎁 Deploy Now