Precision:
Model Parameters could be taken from config.json
on HuggingFace or directly from model via model.config
Expanding dimensionality within MLP block. Usually it is 4 × hidden size.
Might be different from number of attention heads when using Grouped Query Attention
Total VRAM usage is 4401 MiB
When PyTorch uses CUDA for the first time, it allocates between 300 MiB and 2 GiB of VRAM
Number of Parameters (1.418) billion × number of bytes per parameter (2)
Size of a biggest tensor within forward pass. It is estimated as the sum of all intermediate tensors within computation of a single layer. Activations size have quadratic dependence on Sequence Length.
Batch Size (4) × Sequence Length (512) × Vocabulary Size (51200) × number of bytes per parameter (4) (even we infer model in half precision, outputs are still almost always casted to fp32 within the model itself)
If you find something wrong or have suggestions, please create an issue/PR in the above repository.