Training Compute-Optimal Large Language Models (Chinchilla)

Product

* ⭐ 04/2022: [Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan)](https://arxiv.org/abs/2204.01691)

/ 100

5 capabilities

Capabilities5 decomposed

compute-optimal model scaling with token-to-parameter ratio optimization

Medium confidence

Determines the mathematically optimal allocation of training compute budget between model parameters and training tokens using empirical scaling laws derived from training runs across multiple model sizes. The approach fits power-law relationships to observed loss curves, then solves for the compute-optimal ratio where both parameters and tokens scale equally with total compute budget (N ≈ C/6L, D ≈ 20C/L where C is compute budget). This differs from prior Kaplan scaling laws which suggested undertrained models; Chinchilla shows equal parameter-token scaling is optimal.

Solves for

Determine how many parameters and tokens to use given a fixed training compute budgetUnderstand whether to train larger models on fewer tokens or smaller models on more tokensPredict loss curves and training efficiency across different model scalesAllocate resources between model size and dataset size for a new training run

Best for

ML researchers designing new LLM training runs

ML engineers planning compute budgets for foundation models

Teams deciding between scaling parameters vs data for fixed compute

Requires

Access to compute infrastructure for training multiple model scales (70M, 400M, 1B, 3B, 7B, 13B, 70B parameter models minimum)

Standardized training pipeline with reproducible hyperparameters across scales

Ability to measure loss curves with sufficient precision across training steps

Limitations

Scaling laws derived from specific training setups (dense transformer architectures, particular optimization hyperparameters); may not transfer perfectly to sparse models, mixture-of-experts, or different training regimes

Empirical fitting introduces uncertainty bounds; predictions degrade significantly beyond the range of observed model sizes (roughly 70M to 540B parameters in the paper)

Does not account for inference cost optimization or downstream task performance variance; optimizes only for training loss

What makes it unique

Empirically derives compute-optimal scaling laws through systematic training of models from 70M to 540B parameters, discovering that parameter count and token count should scale equally with compute budget (contrary to prior Kaplan et al. scaling laws which suggested undertrained models were optimal). Uses power-law fitting to loss curves across multiple scales to establish generalizable relationships.

vs alternatives

More compute-efficient than prior Kaplan scaling laws by ~20% through equal parameter-token scaling; provides empirically-grounded recommendations rather than theoretical extrapolations, making it more reliable for practical training budget allocation decisions

loss prediction across model scales via empirical scaling law interpolation

Medium confidence

Predicts training loss for unseen model sizes by fitting power-law functions (L(N,D) = aN^α + bD^β + E) to loss measurements from trained models at multiple scales, then interpolating or extrapolating to new parameter/token combinations. The model captures how loss decreases with both parameter count and data size, enabling loss prediction without retraining. Chinchilla's key finding is that optimal loss follows L_opt(C) = E + (C/6L)^-α where both exponents are approximately -0.07.

Solves for

Predict final training loss for a proposed model size without running full trainingCompare loss outcomes across different parameter-token allocation strategiesEstimate convergence behavior before committing compute budgetIdentify diminishing returns regions where additional compute yields minimal loss improvement

Best for

ML researchers planning experiments and allocating compute budgets

Teams evaluating whether to scale up existing models

Organizations comparing multiple model architecture proposals

Requires

Training loss measurements from at least 3-4 different model scales

Consistent training setup across scales (same optimizer, learning rate schedule, data preprocessing)

Sufficient training steps to reach convergence or near-convergence at each scale

Limitations

Accuracy degrades significantly outside the training range of observed scales; extrapolation beyond 540B parameters is unreliable

Assumes smooth power-law relationships; does not capture phase transitions, emergent capabilities, or task-specific performance cliffs

Fitted exponents (α ≈ -0.07) are specific to dense transformer architectures and may not apply to sparse models, recurrent architectures, or hybrid approaches

What makes it unique

Fits bidirectional power-law scaling laws (loss as function of both parameters AND tokens) rather than unidirectional extrapolation; discovers that optimal loss follows a specific compute-dependent curve where both parameter and token exponents are nearly identical (~-0.07), enabling unified compute-optimal recommendations.

vs alternatives

More accurate than prior Kaplan scaling laws for predicting loss at new scales because it accounts for both parameter and token scaling simultaneously; enables loss prediction without retraining, saving weeks of compute compared to empirical validation

compute budget allocation solver for parameter-token tradeoff

Medium confidence

Given a fixed training compute budget (measured in FLOPs), solves for the optimal split between model parameters (N) and training tokens (D) by applying the derived scaling law relationships. The solver uses the constraint that compute C ≈ 6ND (accounting for forward and backward passes) and the empirical finding that optimal allocation has N ≈ C/6L and D ≈ 20C/L, where L is the sequence length. This produces a deterministic recommendation for model size and dataset size given compute budget.

Solves for

Given a compute budget in FLOPs, determine the optimal model size in parametersGiven a compute budget, determine how many tokens to train onDecide whether to train a 70B model on 1.4T tokens or a 7B model on 140B tokens for the same computePlan resource allocation for a new training run with fixed hardware and time constraints

Best for

ML engineers planning training infrastructure and hardware procurement

Teams with fixed compute budgets (e.g., limited GPU hours or cloud credits)

Organizations comparing cost-efficiency of different model sizes

Requires

Compute budget specified in FLOPs or equivalent (e.g., GPU hours × TFLOPS)

Knowledge of sequence length (L) for the training setup

Assumption of standard transformer training efficiency (6 FLOPs per parameter per token)

Limitations

Assumes linear relationship between compute and FLOPs; does not account for hardware efficiency variations, communication overhead, or distributed training inefficiencies

Optimal allocation is for training loss only; does not optimize for inference cost, latency, or downstream task performance

Assumes standard dense transformer architecture; does not apply to sparse models, mixture-of-experts, or other architectural variants

What makes it unique

Solves the parameter-token allocation problem as a constrained optimization using empirically-derived scaling laws, producing deterministic recommendations rather than heuristics. The key insight is that equal scaling of parameters and tokens (N ∝ D ∝ √C) is optimal, contrary to prior assumptions of undertrained models.

vs alternatives

Provides data-driven allocation recommendations vs rule-of-thumb approaches; accounts for both parameter and token scaling simultaneously rather than treating them independently, resulting in ~20% better compute efficiency than prior Kaplan-based approaches

empirical scaling law fitting and validation across model scales

Medium confidence

Trains multiple model instances at different scales (70M, 400M, 1B, 3B, 7B, 13B, 70B parameters) with varying token counts, measures training loss curves, and fits power-law functions to the observed data. The fitting process uses least-squares regression on log-log plots to extract scaling exponents and coefficients, then validates the fit by comparing predicted vs observed loss on held-out model sizes. This creates an empirical foundation for all downstream scaling law predictions and recommendations.

Solves for

Establish empirical scaling laws for a specific model architecture or training setupValidate whether published scaling laws apply to your particular training configurationMeasure the actual scaling exponents for your models rather than relying on published valuesCreate organization-specific scaling law models for internal planning

Best for

ML researchers conducting scaling studies

Large organizations training multiple foundation models

Teams needing to validate scaling laws for custom architectures or datasets

Requires

Access to substantial compute infrastructure (minimum 100K GPU hours for comprehensive study)

Standardized training pipeline with reproducible hyperparameters across all scales

Ability to measure loss curves with high precision across thousands of training steps

Limitations

Requires training 6-10 models at different scales, consuming significant compute (hundreds of thousands of GPU hours for large models)

Fitting accuracy depends on number of data points and range of scales; sparse or narrow ranges produce unreliable exponents

Power-law fitting assumes smooth relationships; does not capture non-monotonic behavior or phase transitions

What makes it unique

Conducts systematic empirical training across 6+ model scales from 70M to 540B parameters with multiple token counts per scale, fitting bidirectional power-law relationships rather than relying on theoretical extrapolation. Validates fits on held-out scales to ensure generalization.

vs alternatives

More comprehensive than prior Kaplan et al. scaling law study by covering larger model sizes (up to 540B vs 1.3B) and testing both parameter and token scaling simultaneously; provides empirically-grounded exponents rather than theoretical predictions

training efficiency benchmarking and comparison across scales

Medium confidence

Measures and compares training efficiency metrics (loss per compute unit, convergence speed, sample efficiency) across different model sizes and token counts. Efficiency is quantified as the loss achieved per unit of compute (FLOPs), enabling direct comparison of whether larger models or more tokens provide better returns on compute investment. The benchmarking reveals that compute-optimal allocation (equal parameter-token scaling) achieves better efficiency than either parameter-heavy or token-heavy alternatives.

Solves for

Compare the compute efficiency of different model size choicesDetermine whether to invest compute in more parameters or more tokensMeasure how much compute is wasted by suboptimal parameter-token allocationBenchmark your training setup against published scaling law baselines

Best for

ML engineers optimizing training efficiency

Teams comparing different model architecture proposals

Organizations evaluating whether to scale up or scale out

Requires

Training runs for multiple model sizes and token counts

Precise measurement of compute usage (FLOPs, GPU hours, wall-clock time)

Standardized evaluation methodology across all runs

Limitations

Efficiency metrics are specific to training loss; do not reflect downstream task performance, inference cost, or latency

Benchmarking requires training multiple models to completion, consuming substantial compute

Efficiency comparisons assume identical training conditions (optimizer, data, hyperparameters); differences in setup can confound results

What makes it unique

Systematically benchmarks training efficiency across a wide range of model sizes (70M to 540B) and token counts, revealing that compute-optimal allocation (N ≈ D) achieves ~20% better efficiency than undertrained or overtrained alternatives. Provides empirical efficiency curves rather than theoretical predictions.

vs alternatives

More comprehensive efficiency analysis than prior work by testing both parameter and token scaling; reveals that equal scaling is optimal, contradicting prior assumptions of undertrained models being more efficient

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with Training Compute-Optimal Large Language Models (Chinchilla), ranked by overlap. Discovered automatically through the match graph.

Web App19

ultrascale-playbook

ultrascale-playbook — AI demo on HuggingFace

scaling-law-prediction-engineparameter-sweep-configuration-interface

2 shared capabilities

Model25

OPT

Open Pretrained Transformers (OPT) by Facebook is a suite of decoder-only pre-trained transformers....

scalable-model-selection

1 shared capability

Model20

OPT

Open Pretrained Transformers (OPT) by Facebook is a suite of decoder-only pre-trained transformers. [Announcement](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/).

multi-scale model variant selection for inference optimization

1 shared capability

Product18

CS324 - Advances in Foundation Models - Stanford University

![](https://img.shields.io/badge/Level-Easy-green)

scaling laws and compute efficiency analysis framework

1 shared capability

Product17

CS25: Transformers United V2 - Stanford University

![](https://img.shields.io/badge/Level-Medium-yellow)

scaling-laws-and-efficiency-analysis

1 shared capability

Product19

GPTSwarm

Language Agents as Optimizable Graphs

cost-aware-model-selection-and-fallback

1 shared capability

Best For

✓ML researchers designing new LLM training runs
✓ML engineers planning compute budgets for foundation models
✓Teams deciding between scaling parameters vs data for fixed compute
✓ML researchers planning experiments and allocating compute budgets
✓Teams evaluating whether to scale up existing models
✓Organizations comparing multiple model architecture proposals
✓ML engineers planning training infrastructure and hardware procurement
✓Teams with fixed compute budgets (e.g., limited GPU hours or cloud credits)

Known Limitations

⚠Scaling laws derived from specific training setups (dense transformer architectures, particular optimization hyperparameters); may not transfer perfectly to sparse models, mixture-of-experts, or different training regimes
⚠Empirical fitting introduces uncertainty bounds; predictions degrade significantly beyond the range of observed model sizes (roughly 70M to 540B parameters in the paper)
⚠Does not account for inference cost optimization or downstream task performance variance; optimizes only for training loss
⚠Assumes homogeneous data quality and standard cross-entropy loss; does not model domain-specific or curriculum learning effects
⚠Accuracy degrades significantly outside the training range of observed scales; extrapolation beyond 540B parameters is unreliable
⚠Assumes smooth power-law relationships; does not capture phase transitions, emergent capabilities, or task-specific performance cliffs

Requirements

Access to compute infrastructure for training multiple model scales (70M, 400M, 1B, 3B, 7B, 13B, 70B parameter models minimum)Standardized training pipeline with reproducible hyperparameters across scalesAbility to measure loss curves with sufficient precision across training stepsTraining loss measurements from at least 3-4 different model scalesConsistent training setup across scales (same optimizer, learning rate schedule, data preprocessing)Sufficient training steps to reach convergence or near-convergence at each scaleCompute budget specified in FLOPs or equivalent (e.g., GPU hours × TFLOPS)Knowledge of sequence length (L) for the training setup

Input / Output

Accepts: training compute budget (in FLOPs), model architecture specification (transformer depth, width, attention heads), dataset size and composition, model parameter count (N), training token count (D), observed training loss values at multiple (N,D) pairs, total training compute budget (in FLOPs), sequence length (tokens per example), optional: constraints on minimum/maximum model size, model sizes to train (parameter counts), token counts to train on, training configuration (optimizer, learning rate, batch size, etc.), training data, training loss curves for multiple model sizes, compute budget for each training run, model parameter counts and token counts

Produces: optimal model parameter count, optimal training token count, predicted loss at convergence, training efficiency curves, predicted loss for arbitrary (N,D) combinations, loss sensitivity curves showing impact of parameter vs token scaling, optimal allocation recommendations for a given compute budget, estimated final training loss, compute efficiency metrics, fitted power-law coefficients and exponents, loss curves for each model size, validation metrics (R² fit quality, prediction error), scaling law equations, efficiency metrics (loss per FLOP, loss per GPU-hour), efficiency comparison tables, recommendations for optimal allocation, efficiency curves showing diminishing returns

UnfragileRank

Adoption15%(30% weight)

Quality21%(25% weight)

Ecosystem15%(15% weight)

Match Graph10%(25% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Product

5 capabilities

Visit Training Compute-Optimal Large Language Models (Chinchilla)→

About

* ⭐ 04/2022: [Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan)](https://arxiv.org/abs/2204.01691)

Alternatives to Training Compute-Optimal Large Language Models (Chinchilla)

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

Are you the builder of Training Compute-Optimal Large Language Models (Chinchilla)?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

github awesome

Looking for something else?

Search →

Capabilities5 decomposed

compute-optimal model scaling with token-to-parameter ratio optimization

Medium confidence

Solves for

Best for

ML researchers designing new LLM training runs

ML engineers planning compute budgets for foundation models

Teams deciding between scaling parameters vs data for fixed compute

Requires

Access to compute infrastructure for training multiple model scales (70M, 400M, 1B, 3B, 7B, 13B, 70B parameter models minimum)

Standardized training pipeline with reproducible hyperparameters across scales

Ability to measure loss curves with sufficient precision across training steps

Limitations

Empirical fitting introduces uncertainty bounds; predictions degrade significantly beyond the range of observed model sizes (roughly 70M to 540B parameters in the paper)

Does not account for inference cost optimization or downstream task performance variance; optimizes only for training loss

What makes it unique

vs alternatives

loss prediction across model scales via empirical scaling law interpolation

Medium confidence

Solves for

Best for

ML researchers planning experiments and allocating compute budgets

Teams evaluating whether to scale up existing models

Organizations comparing multiple model architecture proposals

Requires

Training loss measurements from at least 3-4 different model scales

Consistent training setup across scales (same optimizer, learning rate schedule, data preprocessing)

Sufficient training steps to reach convergence or near-convergence at each scale

Limitations

Accuracy degrades significantly outside the training range of observed scales; extrapolation beyond 540B parameters is unreliable

Assumes smooth power-law relationships; does not capture phase transitions, emergent capabilities, or task-specific performance cliffs

Fitted exponents (α ≈ -0.07) are specific to dense transformer architectures and may not apply to sparse models, recurrent architectures, or hybrid approaches

What makes it unique

vs alternatives

compute budget allocation solver for parameter-token tradeoff

Medium confidence

Solves for

Best for

ML engineers planning training infrastructure and hardware procurement

Teams with fixed compute budgets (e.g., limited GPU hours or cloud credits)

Organizations comparing cost-efficiency of different model sizes

Requires

Compute budget specified in FLOPs or equivalent (e.g., GPU hours × TFLOPS)

Knowledge of sequence length (L) for the training setup

Assumption of standard transformer training efficiency (6 FLOPs per parameter per token)

Limitations

Assumes linear relationship between compute and FLOPs; does not account for hardware efficiency variations, communication overhead, or distributed training inefficiencies

Optimal allocation is for training loss only; does not optimize for inference cost, latency, or downstream task performance

Assumes standard dense transformer architecture; does not apply to sparse models, mixture-of-experts, or other architectural variants

What makes it unique

vs alternatives

empirical scaling law fitting and validation across model scales

Medium confidence

Solves for

Best for

ML researchers conducting scaling studies

Large organizations training multiple foundation models

Teams needing to validate scaling laws for custom architectures or datasets

Requires

Access to substantial compute infrastructure (minimum 100K GPU hours for comprehensive study)

Standardized training pipeline with reproducible hyperparameters across all scales

Ability to measure loss curves with high precision across thousands of training steps

Limitations

Requires training 6-10 models at different scales, consuming significant compute (hundreds of thousands of GPU hours for large models)

Fitting accuracy depends on number of data points and range of scales; sparse or narrow ranges produce unreliable exponents

Power-law fitting assumes smooth relationships; does not capture non-monotonic behavior or phase transitions

What makes it unique

vs alternatives

training efficiency benchmarking and comparison across scales

Medium confidence

Solves for

Best for

ML engineers optimizing training efficiency

Teams comparing different model architecture proposals

Organizations evaluating whether to scale up or scale out

Requires

Training runs for multiple model sizes and token counts

Precise measurement of compute usage (FLOPs, GPU hours, wall-clock time)

Standardized evaluation methodology across all runs

Limitations

Efficiency metrics are specific to training loss; do not reflect downstream task performance, inference cost, or latency

Benchmarking requires training multiple models to completion, consuming substantial compute

Efficiency comparisons assume identical training conditions (optimizer, data, hyperparameters); differences in setup can confound results

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to Training Compute-Optimal Large Language Models (Chinchilla)

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

Training Compute-Optimal Large Language Models (Chinchilla)

Capabilities5 decomposed

compute-optimal model scaling with token-to-parameter ratio optimization

loss prediction across model scales via empirical scaling law interpolation

compute budget allocation solver for parameter-token tradeoff

empirical scaling law fitting and validation across model scales

training efficiency benchmarking and comparison across scales

Related Artifactssharing capabilities

ultrascale-playbook

OPT

OPT

CS324 - Advances in Foundation Models - Stanford University

CS25: Transformers United V2 - Stanford University

GPTSwarm

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Training Compute-Optimal Large Language Models (Chinchilla)

Are you the builder of Training Compute-Optimal Large Language Models (Chinchilla)?

Get the weekly brief

Data Sources

Training Compute-Optimal Large Language Models (Chinchilla)

Capabilities5 decomposed

compute-optimal model scaling with token-to-parameter ratio optimization

loss prediction across model scales via empirical scaling law interpolation

compute budget allocation solver for parameter-token tradeoff

empirical scaling law fitting and validation across model scales

training efficiency benchmarking and comparison across scales

Related Artifactssharing capabilities

ultrascale-playbook

OPT

OPT

CS324 - Advances in Foundation Models - Stanford University

CS25: Transformers United V2 - Stanford University

GPTSwarm

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Training Compute-Optimal Large Language Models (Chinchilla)

Are you the builder of Training Compute-Optimal Large Language Models (Chinchilla)?

Get the weekly brief

Data Sources