Fix: CUDA Out of Memory
Resolve "CUDA out of memory" errors when training or running inference
Error Message
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB
or
RuntimeError: CUDA error: out of memory
or
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor
Root Cause
GPU memory is exhausted. This happens when:
- Model too large - Model parameters exceed available VRAM
- Batch size too high - Processing too many samples at once
- Memory fragmentation - Inefficient memory allocation patterns
- Memory leaks - Tensors not properly released
- Multiple processes - Other processes using GPU memory
Model Memory Requirements
| Model Size | FP32 | FP16/BF16 | INT8 | Recommended GPU |
|---|---|---|---|---|
| 7B params | ~28GB | ~14GB | ~7GB | T4 16GB, A10G 24GB |
| 13B params | ~52GB | ~26GB | ~13GB | A10G 24GB, L4 24GB |
| 70B params | ~280GB | ~140GB | ~70GB | A100 80GB, H100 80GB |
* Inference only. Training requires 2-3x more memory for gradients and optimizer states.
Solutions
1. Reduce Batch Size
The quickest fix - halve your batch size until it fits:
# Instead of batch_size=32, try batch_size=16 or batch_size=8
2. Use Mixed Precision (FP16/BF16)
Cuts memory usage in half with minimal accuracy loss:
# PyTorch
from torch.cuda.amp import autocast
with autocast():
output = model(input)
# Or use BF16 on Ampere+ GPUs
model = model.to(torch.bfloat16)
3. Gradient Checkpointing
Trade compute for memory during training:
# PyTorch
model.gradient_checkpointing_enable()
# Transformers
model = AutoModel.from_pretrained(..., gradient_checkpointing=True)
4. Use Quantization (INT8/INT4)
Reduce model size by 2-4x:
# bitsandbytes 8-bit
model = AutoModelForCausalLM.from_pretrained(
"model_name",
load_in_8bit=True,
device_map="auto"
)
5. Clear Cache Periodically
Free up fragmented memory:
import torch
torch.cuda.empty_cache()
import gc
gc.collect()
6. Monitor GPU Memory
Check what's using memory:
# Terminal
watch -n 1 nvidia-smi
# Python
print(torch.cuda.memory_summary())
vLLM-Specific Solutions
Limit GPU Memory Usage
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-3B --gpu-memory-utilization 0.8
Use Tensor Parallelism for Large Models
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-70B --tensor-parallel-size 2
High-Performance GPU Cloud
Deploy your Docker containers with powerful NVIDIA GPUs. A100/H100 available, 32+ global locations.
- NVIDIA A100/H100 GPU instances
- Hourly billing, starting at $0.004/h
- 32+ global data centers
- One-click container & bare metal deployment