Can I run it? Can you run it? LLM Memory Calculator

Estimate GPU VRAM usage of transformer-based models.
Running Parameters
Inference
Training

Precision:

fp16/bf16
fp32
Model Parameters

Model Parameters could be taken from config.json on HuggingFace or directly from model via model.config

Expanding dimensionality within MLP block. Usually it is 4 × hidden size.

Might be different from number of attention heads when using Grouped Query Attention

Estimation Result
MiB
GiB
  • Total VRAM usage is 4401 MiB

  • CUDA Kernels use 1000 MiB of VRAM

    When PyTorch uses CUDA for the first time, it allocates between 300 MiB and 2 GiB of VRAM

  • Parameters use 2705 MiB of VRAM

    Number of Parameters (1.418) billion × number of bytes per parameter (2)

  • Activations use 296 MiB of VRAM

    Size of a biggest tensor within forward pass. It is estimated as the sum of all intermediate tensors within computation of a single layer. Activations size have quadratic dependence on Sequence Length.

  • Output tensor uses 400 MiB of VRAM

    Batch Size (4) × Sequence Length (512) × Vocabulary Size (51200) × number of bytes per parameter (4) (even we infer model in half precision, outputs are still almost always casted to fp32 within the model itself)


Wondering "Can I run this LLM on my GPU?" This calculator helps you find out! It uses a comprehensive memory estimation model that accounts for model parameters, activations, CUDA kernels, and optimizer states. The calculations are based on PyTorch memory patterns and real-world LLM deployments. For an in-depth explanation and the logic behind these numbers, check out this blog post and the source code repository.

If you find something wrong or have suggestions, please create an issue/PR in the above repository.