Can I run it? | Can you run it? | LLM Memory Calculator

Inference

Training

Precision:

fp16/bf16

fp32

Sequence Length

Batch Size

Number of GPUs

Model Parameters could be taken from config.json on HuggingFace or directly from model via model.config

Parameters Preset

Number of Parameters (billions)

Number of Layers

Vocab Size

Hidden Size

Number of Attention Heads

Intermediate Size

Expanding dimensionality within MLP block. Usually it is 4 × hidden size.

Number of Key Value Heads

Might be different from number of attention heads when using Grouped Query Attention

MiB

GiB

Total VRAM usage is 4401 MiB
CUDA Kernels use 1000 MiB of VRAM
When PyTorch uses CUDA for the first time, it allocates between 300 MiB and 2 GiB of VRAM
Parameters use 2705 MiB of VRAM
Number of Parameters (1.418) billion × number of bytes per parameter (2)
Activations use 296 MiB of VRAM
Size of a biggest tensor within forward pass. It is estimated as the sum of all intermediate tensors within computation of a single layer. Activations size have quadratic dependence on Sequence Length.
Output tensor uses 400 MiB of VRAM
Batch Size (4) × Sequence Length (512) × Vocabulary Size (51200) × number of bytes per parameter (4) (even we infer model in half precision, outputs are still almost always casted to fp32 within the model itself)