` blocks) is preserved at each timestep. Consider a two-turn math conversation:
**Timestep 1:**
User: What is 2+2?
Assistant: <think>Let me calculate...</think> 4
User:
**Timestep 2:**
User: What is 2+2?
Assistant: <think>Let me calculate...</think> 4
User: What is 3+3?
Assistant: <think>Let me calculate...</think> 6
User:
Notice that the observation (green) at timestep 2 contains the entire timestep 1 sequence as a prefix. The new observation just appends `What is 3+3?\n\nAssistant: ` to the end. This is the **extension property**.
Because extension holds, the RL code can merge both timesteps into a **single Datum**:
User: What is 2+2?
Assistant: <think>Let me calculate...</think> 4
User: What is 3+3?
Assistant: <think>Let me calculate...</think> 6
User:
Green = observation tokens (loss weight = 0). Red = action tokens (loss weight > 0).
## Example 2: Qwen3 with Thinking Hidden (Extension Breaks)
When using `Qwen3Renderer` with the default `strip_thinking_from_history=True`, the `...` blocks are stripped from previous assistant messages. This matches how Qwen3 models were post-trained by the Qwen team.
**Timestep 1:**
User: What is 2+2?
Assistant: <think>Let me calculate...</think> 4
User:
**Timestep 2:**
User: What is 2+2?
Assistant: 4
User: What is 3+3?
Assistant: <think>Let me calculate...</think> 6
User:
The observation at timestep 2 is **not** an extension of timestep 1's full sequence. The `Let me calculate...` portion was stripped, so the prefix doesn't match. The RL code must create **two separate Datums**:
**Datum 1:**
User: What is 2+2?
Assistant: <think>Let me calculate...</think> 4
User:
**Datum 2:**
User: What is 2+2?
Assistant: 4
User: What is 3+3?
Assistant: <think>Let me calculate...</think> 6
User:
This results in more compute during training (two forward/backward passes instead of one) and prevents KV-cache reuse during sampling. For a trajectory of T timesteps, compute scales as O(T²) instead of O(T).
## The Tradeoff
**Keeping thinking visible** (`strip_thinking_from_history=False`) gives you O(T) compute scaling, allows packing sequences together in training batches, and enables KV-cache reuse during sampling. The downside is that context grows faster since all thinking tokens are retained, so you may hit context length limits sooner.
**Stripping thinking** (`strip_thinking_from_history=True`, the default) keeps context smaller but breaks the extension property, leading to O(T²) compute scaling.
Note that while stripping thinking matches Qwen3's original post-training distribution, with RL fine-tuning the model should quickly adapt to the new situation where thinking is preserved. So "distribution match" might not be a major concern in practice.
## How the RL Code Handles This
The RL training code in `data_processing.py` automatically detects whether consecutive timesteps satisfy the extension property. The key function is `trajectory_to_data`:
```python
def trajectory_to_data(traj: Trajectory, traj_advantage: float) -> list[tinker.Datum]:
"""
Return one or more Datum objects corresponding to the trajectory.
If the sequence grows by appending, i.e., each successive observation contains
the previous observation+action as a prefix, then we can return a single Datum.
However, if we get a sequence that's not an extension of the previous sequence,
then that results in a new Datum.
"""
```
When rendering your conversations, be aware of whether your renderer has the extension property. For `Qwen3Renderer`:
- `strip_thinking_from_history=False` → Extension holds
- `strip_thinking_from_history=True` (default) → Extension breaks
**Note on sampling:** The training code automatically merges timesteps when possible. Sampling infrastructure doesn't yet adjust billing based on KV-cache hits, but this is planned for a future release.
## Advanced: Periodic Compaction
A hybrid approach is to use **periodic compaction**: keep thinking visible most of the time (preserving extension), but periodically clear old thinking blocks from the context.
**How it works:**
- For turns 1-10, keep all thinking visible (extension holds, single datum)
- At turn 11, strip thinking from turns 1-10 (extension breaks once, new datum starts)
- For turns 11-20, keep thinking visible again (extension holds)
- Repeat every N turns
Here's what the datums look like with compaction every 3 turns:
**Datum 1 (turns 1-3):**
User: Q1
Assistant: <think>...</think> A1
User: Q2
Assistant: <think>...</think> A2
User: Q3
Assistant: <think>...</think> A3
User:
**Datum 2 (turns 4-6, thinking from turns 1-3 stripped):**
User: Q1
Assistant: A1
User: Q2
Assistant: A2
User: Q3
Assistant: A3
User: Q4
Assistant: <think>...</think> A4
User: Q5
Assistant: <think>...</think> A5
User: Q6
Assistant: <think>...</think> A6
User:
This approach breaks extension only every N timesteps instead of every timestep, keeps context size bounded (old thinking doesn't accumulate forever), and amortizes the recomputation cost over N turns.
To implement this, you would modify your environment or renderer to periodically transform the conversation history, stripping `` blocks from messages older than N turns.
## Summary
For `Qwen3Renderer`:
- `strip_thinking_from_history=False` → Extension holds → Use for long trajectories where compute efficiency matters
- `strip_thinking_from_history=True` (default) → Extension breaks → Use for short trajectories, or when you want minimal changes from base model behavior
- Periodic compaction → Best of both worlds when you need efficiency with bounded context
When designing your RL environment, consider how many turns you expect and whether the O(T) vs O(T²) difference will be significant for your use case.
---
## File: rl/rl-hyperparams.mdx
# RL Hyperparameters
This guide covers the key hyperparameters for reinforcement learning training, from core settings to advanced configurations.
## Core Hyperparameters
### Learning Rate
Similar to the [supervised learning setting](../supervised-learning/sl-hyperparams), the learning rate is the most critical hyperparameter choice. We recommend using the guidance presented there as a starting point for RL experiments as well.
### Batch and Group Sizes
As described in our [RL environments](../rl/rl-envs.mdx) documentation, we use two key parameters:
- **`batch_size`**: The number of unique environments or problems used for training
- **`group_size`**: The number of rollouts performed per unique environment
If you have limited environments or problems available for training, increase the `group_size` to generate more training data. While the total number of rollouts depends on both parameters, we recommend scaling learning rates proportionally to $\text{LR} \propto \sqrt{\text{batch\_size}}$.
## Multiple Updates per Sampling Iteration
The `num_substeps` parameter controls how many policy weight updates are performed on data sampled from the last policy iteration, similar to PPO and GRPO.
### How it works:
- **`num_substeps = 1` (default)**: Each batch of collected trajectories is used for exactly one optimizer update
- **`num_substeps > 1`**: The batch of unique environments is split into `num_substeps` mini-batches, where each environment/problem has `group_size` rollouts (we pack all rollouts for a particular environment/problem in the same minibatch). We do a single update step on each mini-batch. Note that our implementation still takes only a single epoch through the data.
### Usage Guidelines:
- The batch size must be divisible by `num_substeps`
- Our experiments show that `num_substeps = 1` already gives decent performance, but if you would like to experiment with this parameter, we recommend starting with a low value of 2-4 and using the PPO objective.
- Higher values can lead to update steps that are too out-of-distribution for the policy. Consider limiting the number of updates or decreasing the learning rate when using multiple update steps.
## Advanced Training Configurations
⚠️ **Note**: These features are experimental and may be subject to instabilities. They are currently disabled by default.
### Streaming Minibatch Training
Enable streaming minibatch training by specifying the `StreamMinibatchConfig`. This approach overlaps trajectory sampling and model training, improving overall throughput by submitting training requests as soon as enough rollouts complete, without waiting for all sampling jobs to finish.
**Configuration Parameters:**
- **`groups_per_batch`**: Same as batch size
- **`num_minibatches`**: Number of minibatches per substep—controls how many individual forward-backward requests we submit. This controls how the work is split.
**Important**: This remains on-policy training and is strictly a pipeline efficiency improvement.
### Async Off-Policy Training
Async training allows the model to train on trajectories generated with slightly older model versions, enabling higher throughput at the cost of some off-policy bias. While Tinker doesn't currently support in-flight weight changes, it supports the "off-by-K" async RL approach where multiple model iterations generate data simultaneously. Configure this by setting the `AsyncConfig` object.
**Configuration Parameters:**
- **`max_steps_off_policy`**: Maximum age (in training steps) of trajectories before they're discarded. Essentially, trajectories from policy iterations older than `max_steps_off_policy` steps will not be used.
- **`groups_per_batch`**: Number of new trajectory groups to accumulate (with a `group_size` number of rollouts each) before updating the current iteration of the model. Note: This is separate from the batch size used for dataset construction.
**Usage Guidelines:**
- Async RL is appropriate for applications with long and heterogeneous rollouts, such as very long CoT models, multi-hop tool use, or agentic workflows
- Start with a small value for `max_steps_off_policy` (less than 5)
## Monitoring and Run Health
Using policy-gradient algorithms with off-policy data can significantly degrade performance or even crash the policy, making monitoring essential during training.
### KL Divergence Monitoring
The current implementation logs the KL divergence between the data generation policy and the current learner: $\mathbb{D}_{KL}[\pi_{\text{sampler}}(\cdot|x)||\pi_{\theta}(\cdot|x)]$ using two separate estimators ([Schulman 2020](http://joschu.net/blog/kl-approx.html)):
- `kl_sample_train_v1`
- `kl_sample_train_v2`
A few important notes to keep in mind:
- Even with full on-policy training, the divergence between sampling and learning policies will not be exactly zero ([He 2025](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/)) due to implementation details
- In our experience training is stable with KL divergence below 0.01
- If KL divergence crosses a recommended threshold, this indicates a numerical instability or potential issue with the training run
---
## File: rl/rl-loops.mdx
import { CookbookLink } from '../../components/CookbookLink'
# Reinforcement Learning Training Loop
We've provided a simple RL training loop in rl_loop.py, which avoids using our environment classes and instead defines the data loading and rollouts in a more self-contained way. This is for people who like to write their own training loops or learn about how things work under the hood. Our more performant implementation in rl/train.py does basically the same thing, but with some performance optimizations, and with some additional features like periodic evals.
You can run the RL training loop using:
```
python -m tinker_cookbook.recipes.rl_loop
```
The default config should write the results to `/tmp/tinker-examples/rl-loop`. The experiment should be completed after 57 steps of training. You can plot the reward curve as follows:
```python
import pandas
import matplotlib.pyplot as plt
metrics_path = "/tmp/tinker-examples/rl-loop/metrics.jsonl"
df = pandas.read_json(metrics_path, lines=True)
plt.plot(df["reward/total"], label="reward/total")
plt.legend()
plt.show()
```
You should see a plot like this:

---
## File: rl/rl-envs.mdx
import { CookbookLink } from '../../components/CookbookLink'
# RL Environments
Here, we'll explain how to create your own RL environments and train on them. First, lets look at the basic classes, which can be found in `tinker_cookbook.rl.types`. As you can see, there's an `Env` interface, corresponding to an RL environment. To write an environment, you need to implement two methods: `initial_observation` and `step`.
```python
class Env:
"""
Stateful environment that a single agent interacts with.
Discard after running for one episode.
"""
async def initial_observation(self) -> tuple[Observation, StopCondition]:
raise NotImplementedError
async def step(self, action: Action) -> StepResult:
raise NotImplementedError
```
Note that this `Env` operates on *tokens*, rather than strings or messages. Why define it this way, when it's usually more natural to define the logic in terms of strings or messages? We've defined `Env` this way because this interface is what's needed by the *training* code, which needs to know the exact tokens that were sampled, and their logprobs.
We need to write two more small classes to use this environment in the RL training code. First, since the environment is discarded after a single episode, we need to be able to instantiate new environments in the training loop. We actually build a *group* of environments at a time, which enables multi-agent training or objectives that compare multiple samples (for example, a reward model that acts on a pair of samples).
```python
class EnvGroupBuilder:
"""
Builds a group of environments.
"""
async def make_envs(self) -> Sequence[Env]:
raise NotImplementedError
```
This object creates a group of environments. Often it does the trivial thing of returning a list of copies of the same environment.
Finally, we need a dataset of these EnvGroupBuilders.
```python
class RLDataset:
"""
Dataset of EnvGroupBuilders.
"""
def get_batch(self, index: int) -> list[EnvGroupBuilder]:
raise NotImplementedError
```
That's a lot of classes! But their combination gives us a lot of flexibility. In previous implementations (like OpenAI Gym), the dataset is implicitly part of the environment; this structure is more modular and gives us more control over the data loading.
## Building a simple example
You can find an example of writing a new RL environment in the Twenty Questions directory.
Here, we define a multi-step environment, where we're training a question-asking agent, which asks questions to another agent to guess a hidden word.
In this case, the answerer model is fixed and is Llama-3.1-8B-Instruct.
The player model (which we fine-tune) is also based on that same model.
You can run the training script as follows:
```bash
python -m tinker_cookbook.recipes.twenty_questions.train
```
---
## File: supervised-learning/sl-hyperparams.mdx
# Supervised Learning Hyperparameters
Successful LLM fine-tuning requires careful hyperparameter tuning. While the most accurate approach is to sweep over ranges and selecting values that minimize loss or maximize eval performance for each hyperparameter, this is often time-consuming and expensive. This guide provides some starting recommendations for the most important hyperparameters.
## Learning rate
The most important hyperparameter is generally the learning rate (LR). Our current best estimate of optimal LR for a model $m$ is the following:
$$ LR(m) = lr_{base} · M_{LoRA} · \Big(\frac{2000}{H_m}\Big)^{P_m} $$
where $lr_{base}$ is a constant base LR, $M_{LoRA}$ is a multiplier applied when using LoRA (1 if using full-finetuning), $H_m$ is the hidden size of the model $m$, and $P_m$ is a model-specific exponent adjustment. Importantly, this function is independent of the LoRA rank.
Our current best estimates are the following: $lr_{base} = 5e-5$,
$M_{LoRA} = 10$, $P_m = 0.0775$ for Qwen models and $P_m = 0.781$ for Llama models.
### Getting the recommended learning rate
You can use the following function to get the recommended LR for any Llama or Qwen model:
```
from tinker_cookbook.hyperparam_utils import get_lr
model_name = "meta-llama/Llama-3.2-1B"
recommended_lr = get_lr(model_name)
print(f"Recommended LR: {recommended_lr}")
```
### Validation
We validated this formula across diverse supervised fine-tuning experiments, varying datasets, dataset sizes, batch_sizes and lora_ranks.
Using our LR estimates resulted in \<0.5% regret compared to exhaustive hyperparameter sweeps, where regret is defined as:
We can define the regret of using any lr as the following:
$$regret(lr') = \frac{loss(lr') - min_{lr} loss(lr)}{min_{lr} loss(lr)}$$
## Batch size
Batch size is the second-most important hyperparameter; it significantly affects both training efficiency and final performance.
For small batch sizes, there's a phenomenon of *perfect scaling*, where the LR and batchsize should be varied together as $LR \propto \sqrt{B}$, and the learning curve only depends on $\frac{LR}{\sqrt{B}}$. See [Shallue et al. (2018)](https://arxiv.org/abs/1811.03600) for an example in the training-from-scratch setting.
When fine-tuning LLMs, we're often in a regime where smaller batch sizes give better performance, at the cost of longer training time; moreover, the $LR \propto \sqrt{B}$ scaling doesn't always hold. When doing SL fine-tuning, we recommend using smaller batch sizes like 128, depending on your tolerance for longer training time.
For best results, you should aim for at least 100 steps of training (but usually get best results with 1000 or more).
⚠️ Note: Our batch size recommendations are based on preliminary findings and ongoing research. We're not confident about them!
---
## File: supervised-learning/sl-basic.mdx
import { CookbookLink } from '../../components/CookbookLink'
# Basic Supervised Learning
This guide walks you through running your first supervised learning experiment using Tinker's built-in training loop.
## Quick start
We've provided an implementation of supervised learning in train_cli.py. To use this training loop, you'll need to create a `Config` object with the data and parameters.
We've provided a ready-to-run example that fine-tunes Llama-3.1-8B on a small instruction-following dataset in sl_basic.py. You can run it from the command line as follows:
```bash
python -m tinker_cookbook.recipes.sl_basic
```
This script fine-tunes the base (pretrained) model on a small dataset called [NoRobots](https://huggingface.co/datasets/HuggingFaceH4/no_robots), created by Hugging Face.
### What you'll see during training
- Each step you should see a printout of the train and test loss, along with other stats like timing.
- The training script will also print out what the data looks like, with predicted tokens (weight=1) in green and context tokens (weight=0) in yellow.
- The training script will write various logs and checkpoint info to the `log_path` directory, which is set to `/tmp/tinker-examples/sl_basic` in the example script.
### Understanding the output files
Looking at the `log_path` directory, you will find several files of interest:
- `metrics.jsonl`: the training metrics that also were printed to the console. You can load and plot them like this:
```python
import pandas
import matplotlib.pyplot as plt
df = pandas.read_json("/tmp/tinker-examples/sl_basic/metrics.jsonl", lines=True)
plt.plot(df['train_mean_nll'], label='train_loss')
plt.plot(df['test/nll'].dropna(), label='test_loss')
plt.legend()
plt.show()
```
You should see a plot like this:

- `checkpoints.jsonl`: the checkpoints that were saved during training. Recall from [Saving and Loading](/save-load) that there are (currently) two kinds of checkpoints: one that has "/sampler_weights/" in the path (used for sampling), and the other that has "/weights/" in the path (includes full optimizer state, used for resuming training). If you interrupt the training script, then run it again, it will ask you if you want to resume training. If you choose to do so, it'll load the last (full state) checkpoint from this file.
- `config.json`: the configuration that you used for training.
In the `sl_basic` script, you'll see that there's also some disabled code (under `if 0:`) that shows how to use your own dataset, specified as a JSONL file, provided in the format of conversations.jsonl.
---
## File: supervised-learning/prompt-distillation.mdx
import { CookbookLink } from '../../components/CookbookLink'
# Prompt Distillation
Prompt distillation is a training technique in which a model is optimized to behave as though it had been provided with a long and complex prompt, without requiring access to that prompt during inference.
At a high level, this procedure involves two main steps:
- **Creation of distillation data**: A teacher prompt, which is typically lengthy and highly detailed, provides explicit, step-by-step instructions. A teacher model uses this prompt to generate responses for a set of queries.
- **Training the student model**: A student model is then trained (or fine-tuned) on the distilled dataset, thereby learning to reproduce the essential behaviors and reasoning encoded in the teacher’s instructions.
---
## Overview
Let $f_T$ and $f_S$ denote the teacher and student models, respectively. Given an instruction prompt $P$ and a query $q_i$, the teacher model generates a response $r_i$:
$$
r_i = f_T([P, q_i])
$$
Here, the prompt $P$ and the query $q_i$ are concatenated to form the input to the teacher model $f_T$. For a dataset of queries $Q = \{q_i \mid 1 \leq i \leq D\}$, we obtain a corresponding set of teacher responses $R = \{r_i \mid 1 \leq i \leq D\}$.
The distillation training dataset is defined as the set of query–response pairs (excluding the original prompt):
$$
T = \{(q_i, r_i) \mid 1 \leq i \leq D\}.
$$
The student model $f_S$ is then trained to minimize the cross-entropy loss:
$$
\ell(f_S(q_i), r_i) = \ell(f_S(q_i), f_T([P, q_i])).
$$
---
## Example
The Tinker Cookbook provides a prompt distillation recipe tailored for a language classification task. The objective is straightforward: given a text query, the model should predict a two-character code corresponding to the language of the input. The set of possible labels is:
```
ar (Arabic), de (German), el (Greek), en (English), es (Spanish), fr (French), hi (Hindi), ru (Russian), tr (Turkish), ur (Urdu), vi (Vietnamese), zh (Chinese - Simplified), ot (Other/Unknown).
```
The recipe in recipes/prompt_distillation/create_data.py also includes handling strategies for inputs containing code, numerical content, or multiple languages.
In the example below, the same model (`Qwen/Qwen3-30B-A3B`) is used as both teacher and student, though in general they need not be identical.
---
### Step 1: Generate Training Data
Create prompt distillation data using the teacher model using recipes/prompt_distillation/create_data.py:
```bash
python -m tinker_cookbook.recipes.prompt_distillation.create_data \
output_file=/tmp/tinker-datasets/prompt_distillation_lang.jsonl
```
This command will:
- Use the configured teacher model to generate language classification examples
- Save the distilled dataset to the specified output file
- Create diverse training examples suitable for student model fine-tuning
### Step 2: Train the Student Model
Fine-tune a student model on the distillation data using recipes/prompt_distillation/train.py:
```bash
python -m tinker_cookbook.recipes.prompt_distillation.train
```
The training script will:
- Load the generated distillation dataset
- Apply optimized training configurations
- Fine-tune the student model for language classification
### Step 3: Test Your Model
Once training is complete, you can test your distilled model by sampling from the trained model to verify its performance on language classification tasks.
## Advanced Configuration
The prompt distillation recipe can be customized for different scenarios:
- **Teacher model selection**: Choose different base models based on your requirements
- **Sampling strategies**: Adjust temperature and other generation parameters
- **Data volume**: Scale the number of generated examples based on your needs
- **Training hyperparameters**: Fine-tune learning rates and other training settings
---
## File: supervised-learning/sweep-case-study.mdx
import { CookbookLink } from '../../components/CookbookLink'
# Sweep case study
In [Supervised Learning Hyperparameters](./sl-hyperparams), we introduced default hyperparameters as a starting point. While defaults are useful, optimal values are often task-specific. A hyperparameter sweep---systematically testing values across a range---is a more reliable way to identify the best settings for your use case.
This guide demonstrates how to sweep over the **learning rate (LR)** to find an optimal value.
## Why sweep the learning rate?
The learning rate is typically the most impactful hyperparameter. While our default recommendations perform well (usually \<0.5% regret), you can often achieve even better results by sweeping to find the task-specific optimum.
## Setup
We use the simple supervised learning training loop in
sl_loop.py, which trains a Llama-3.1-8B model.
To retrieve the model’s default learning rate recommendation:
```
from tinker_cookbook.hyperparam_utils import get_lr
print(get_lr("meta-llama/Llama-3.1-8B"))
```
This should output
```
0.0002856415043086949 # ≈ 2.8e-4
```
This default value provides a baseline. A common best practice is to sweep one order of magnitude above and below the default. For this case, we sweep over: $LR \in [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3]$
## Running the sweep
Launch experiments in parallel, using separate terminal windows for each LR value. For example:
```bash
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.003 log_path=/tmp/sft-lr-sweep/lr-0.003
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.001 log_path=/tmp/sft-lr-sweep/lr-0.001
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0003 log_path=/tmp/sft-lr-sweep/lr-0.0003
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.0001 log_path=/tmp/sft-lr-sweep/lr-0.0001
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00003 log_path=/tmp/sft-lr-sweep/lr-0.00003
python -m tinker_cookbook.recipes.sl_loop learning_rate=0.00001 log_path=/tmp/sft-lr-sweep/lr-0.00001
```
You can also automate this process by writing a script that spawns multiple tmux windows and launches experiments programmatically. This is especially useful for larger sweeps.
## Collecting Results
After the experiments are complete, you can read the `metrics.jsonl` files:
```python
from glob import glob
import pandas
import os
import json
data = []
for fname in sorted(glob(os.path.expanduser("/tmp/sft-lr-sweep/*/metrics.jsonl"))):
df = pandas.read_json(fname, lines=True)
# make sure the experiment is completed
if len(df) == 0 or df["progress"].iloc[-1] < 0.98:
continue
config_fname = fname.replace("metrics.jsonl", "config.json")
with open(config_fname, "rb") as f:
metadata = json.load(f)
data.append({
"fname": fname,
"learning_rate": metadata["learning_rate"],
"final_loss": df["train_mean_nll"].iloc[-1].item()
})
print(f"Read metrics for {len(data)} experiments")
```
If all the experiments are completed, the above code should print:
```
Read metrics for 6 experiments
```
## Visualizing the Sweep
Plot the `final_loss` as a function of `learning_rate`:
```python
import matplotlib.pyplot as plt
df = pandas.DataFrame(data)
plt.plot(df["learning_rate"], df["final_loss"], marker='o')
plt.axhline(y=df["final_loss"].min(), color="green", linestyle="--")
plt.ylim(1.65, 1.8)
plt.xscale("log")
plt.xlabel("Learning Rate (log scale)")
plt.ylabel("Final Loss")
plt.title("Final Loss vs Learning Rate")
plt.show()
```
You should see a U-shaped curve, similar to this:

If the full U-curve is not visible in your setting, expand the sweep range by adding more LR values.
## Determining the Optimal LR
The optimal learning rate is the one that minimizes the loss. The plot above shows that the optimal LR is `3e-4` which you can also calculate by finding the minima:
```
optimal_lr = df["learning_rate"][df["final_loss"].idxmin()]
print(f"The optimal LR is {optimal_lr:.2e}")
```
Expected output:
```
The optimal LR is 3.00e-04
```
Note that the optimal LR in our sweep (`3e-4`) is very close to the default LR (`2.8e-4`). However, task-specific sweeps can still provide marginal improvements and greater confidence in your hyperparameter choices.
## Next steps
Now that you've identified the optimal learning rate:
1. Retrain with the optimal LR for your production run
2. Consider sweeping other hyperparameters like batch size, warmup steps, or weight decay
3. Use the optimal LR as a baseline for future experiments on similar tasks
---
## File: supervised-learning/sl-loop.mdx
import { CookbookLink } from '../../components/CookbookLink'
# Supervised Learning Training Loop
We've provided a simple SL training loop in sl_loop.py, which avoids using our dataset classes and instead defines the data loading in a more self-contained way. This is for people who like to write their own training loops or learn about how things work under the hood. Our more performant implementation in supervised/train.py does basically the same thing, but with some performance optimizations, and with some additional features like periodic evals.
---
## File: compatible-apis/openai.mdx
# OpenAI API Compatible Inference (in beta)
OpenAI-compatible inference lets you interact with any model checkpoint in Tinker, using an endpoint compatible with the [OpenAI Completions API](https://platform.openai.com/docs/api-reference/chat). It’s designed to let you easily “poke at” your model while you're training it.
For inference within your training runs (e.g. RL), we recommend using Tinker’s standard [sampling client](/training-sampling).
Currently, OpenAI-compatible inference is meant for testing and internal use with low internal traffic, rather than large, high-throughput, user-facing deployments. Latency and throughput may vary by model and may change without notice during the beta. If you need higher or more stable throughput, contact the Tinker team in [our Discord](https://discord.gg/KqqEZNX88c) for guidance on larger-scale setups.
## Use Cases
OpenAI-compatible inference is designed for
- **Fast feedback while training**: Start sampling very quickly from any sampler checkpoint obtained during training.
- **Sampling while training continues**: Sample even while the training job is still running on that experiment.
- **Developer & internal workflows**: Intended for testing, evaluation, and internal tools.
We will release production-grade inference soon and will update our users then.
## Using OpenAI compatible inference from an OpenAI client
The new interface exposes an OpenAI-compatible HTTP API. You can use any OpenAI SDK or HTTP client that lets you override the base URL.
1\. Set the base URL of your OpenAI-compatible client to:
```
https://tinker.thinkingmachines.dev/services/tinker-prod/oai/api/v1
```
2\. Use a Tinker sampler weight path as the model name. For example:
```
tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080
```
Any valid Tinker sampler checkpoint path works here. You can keep training and sample from the same checkpoint simultaneously.
3\. Authenticate with your Tinker API key, by passing the same key used for Tinker as the API key to the OpenAI client.
**Note:** We support both `/completions` and `/chat/completions` endpoints. Chat requests are rendered with the model’s default Hugging Face chat template; if your checkpoint expects a different renderer, render the prompt yourself (see [Rendering](/rendering)) and use `/completions`.
## Code Example
```py
from os import getenv
from openai import OpenAI
BASE_URL = "https://tinker.thinkingmachines.dev/services/tinker-prod/oai/api/v1"
MODEL_PATH = "tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080"
api_key = getenv("TINKER_API_KEY")
client = OpenAI(
base_url=BASE_URL,
api_key=api_key,
)
response = client.completions.create(
model=MODEL_PATH,
prompt="The capital of France is",
max_tokens=50,
temperature=0.7,
top_p=0.9,
)
print(f"{response.choices[0].text}")
```
Notes:
* `BASE_URL` points to the OpenAI compatible inference endpoint.
* `MODEL_PATH` is a sampler checkpoint path from Tinker (`tinker://0034d8c9-0a88-52a9-b2b7-bce7cb1e6fef:train:0/sampler_weights/000080`).
* The rest of the arguments (`prompt`, `max_tokens`, `temperature`, `top_p`) behave like they do in the OpenAI Completions API.
* You can swap `MODEL_PATH` to any other sampler checkpoint to compare runs quickly in your evals or notebooks.
## Related docs
* [Getting a `TINKER_API_KEY`](/install)
* [Security and Privacy](https://thinkingmachines.ai/legal/terms/)
* [Training and Sampling](/training-sampling)
---
# PART 2: TYPE DEFINITIONS
Total types collected: 30
## Type: AdamParams
```python
class AdamParams(StrictBase):
learning_rate: float = 0.0001
"""Learning rate for the optimizer"""
beta1: float = 0.9
"""Coefficient used for computing running averages of gradient"""
beta2: float = 0.95
"""Coefficient used for computing running averages of gradient square"""
eps: float = 1e-12
"""Term added to the denominator to improve numerical stability"""
```
## Type: CreateModelResponse
```python
class CreateModelResponse(BaseModel):
model_id: ModelID
type: Literal["create_model"] = "create_model"
```
## Type: Datum
```python
class Datum(StrictBase):
loss_fn_inputs: LossFnInputs
"""Dictionary mapping field names to tensor data"""
model_input: ModelInput
@model_validator(mode="before")
@classmethod
def convert_tensors(cls, data: Any) -> Any:
"""Convert torch.Tensor and numpy arrays to TensorData in loss_fn_inputs during construction."""
if isinstance(data, dict) and "loss_fn_inputs" in data:
loss_fn_inputs = data["loss_fn_inputs"]
if isinstance(loss_fn_inputs, dict):
converted_inputs = {}
for key, value in loss_fn_inputs.items():
converted_inputs[key] = cls._maybe_convert_array(key, value)
data = dict(data) # Make a copy
data["loss_fn_inputs"] = converted_inputs
return data
@classmethod
def _maybe_convert_array(cls, key: str, value: Any) -> Any:
"""Convert torch.Tensor, numpy array, or 1-D list to TensorData if needed."""
if _HAVE_TORCH and isinstance(value, torch.Tensor):
return TensorData.from_torch(value)
elif isinstance(value, np.ndarray):
return TensorData.from_numpy(value)
elif isinstance(value, list):
# assume it's 1d and infer the dtype from the key
return TensorData(data=value, dtype=_key_to_type[key], shape=[len(value)])
else:
return value
_key_to_type = {
"target_tokens": "int64",
"weights": "float32",
"advantages": "float32",
"logprobs": "float32",
"clip_low_threshold": "float32",
"clip_high_threshold": "float32",
}
```
## Type: EncodedTextChunk
```python
class EncodedTextChunk(StrictBase):
tokens: Sequence[int]
"""Array of token IDs"""
type: Literal["encoded_text"] = "encoded_text"
@property
def length(self) -> int:
return len(self.tokens)
```
## Type: ForwardBackwardInput
```python
class ForwardBackwardInput(StrictBase):
data: List[Datum]
"""Array of input data for the forward/backward pass"""
loss_fn: LossFnType
"""Fully qualified function path for the loss function"""
loss_fn_config: Optional[Dict[str, float]] = None
"""Optional configuration parameters for the loss function (e.g., PPO clip thresholds, DPO beta)"""
```
## Type: ForwardBackwardOutput
```python
class ForwardBackwardOutput(BaseModel):
loss_fn_output_type: str
"""The type of the ForwardBackward output. Can be one of [...] TODO"""
loss_fn_outputs: List[LossFnOutput]
"""Dictionary mapping field names to tensor data"""
metrics: Dict[str, float]
"""Training metrics as key-value pairs"""
```
## Type: GetInfoResponse
```python
class GetInfoResponse(BaseModel):
type: Optional[Literal["get_info"]] = None
model_data: ModelData
model_id: ModelID
is_lora: Optional[bool] = None
lora_rank: Optional[int] = None
model_name: Optional[str] = None
if PYDANTIC_V2:
# allow fields with a `model_` prefix
model_config = ConfigDict(protected_namespaces=tuple())
```
## Type: GetServerCapabilitiesResponse
```python
class GetServerCapabilitiesResponse(BaseModel):
supported_models: List[SupportedModel]
```
## Type: ImageAssetPointerChunk
```python
class ImageAssetPointerChunk(StrictBase):
format: Literal["png", "jpeg"]
"""Image format"""
location: str
"""Path or URL to the image asset"""
expected_tokens: int | None = None
"""Expected number of tokens this image represents.
This is only advisory: the tinker backend will compute the number of tokens
from the image, and we can fail requests quickly if the tokens does not
match expected_tokens."""
type: Literal["image_asset_pointer"] = "image_asset_pointer"
@property
def length(self) -> int:
if self.expected_tokens is None:
raise ValueError("ImageAssetPointerChunk expected_tokens needs to be set in order to compute the length")
return self.expected_tokens
```
## Type: ImageChunk
```python
class ImageChunk(StrictBase):
data: bytes
"""Image data as bytes"""
format: Literal["png", "jpeg"]
"""Image format"""
expected_tokens: int | None = None
"""Expected number of tokens this image represents.
This is only advisory: the tinker backend will compute the number of tokens
from the image, and we can fail requests quickly if the tokens does not
match expected_tokens."""
type: Literal["image"] = "image"
@field_validator("data", mode="before")
@classmethod
def validate_data(cls, value: Union[bytes, str]) -> bytes:
"""Deserialize base64 string to bytes if needed."""
if isinstance(value, str):
return base64.b64decode(value)
return value
@field_serializer("data")
def serialize_data(self, value: bytes) -> str:
"""Serialize bytes to base64 string for JSON."""
return base64.b64encode(value).decode("utf-8")
@property
def length(self) -> int:
if self.expected_tokens is None:
raise ValueError("ImageChunk expected_tokens needs to be set in order to compute the length")
return self.expected_tokens
```
## Type: LoadWeightsResponse
```python
class LoadWeightsResponse(BaseModel):
path: Optional[str] = None
"""A tinker URI for model weights at a specific step"""
type: Optional[Literal["load_weights"]] = None
```
## Type: LoraConfig
```python
class LoraConfig(StrictBase):
rank: int
"""LoRA rank (dimension of low-rank matrices)"""
seed: Optional[int] = None
"""Seed used for initialization of LoRA weights.
Useful if you need deterministic or reproducible initialization of weights.
"""
train_unembed: bool = True
"""Whether to add lora to the unembedding layer"""
train_mlp: bool = True
"""Whether to add loras to the MLP layers (including MoE layers)"""
train_attn: bool = True
"""Whether to add loras to the attention layers"""
```
## Type: LossFnInputs
```python
LossFnInputs: TypeAlias = Dict[str, TensorData]
```
## Type: LossFnOutput
```python
LossFnOutput: TypeAlias = Dict[str, TensorData]
```
## Type: LossFnType
```python
LossFnType: TypeAlias = Literal["cross_entropy", "importance_sampling", "ppo", "cispo", "dro"]
```
## Type: ModelData
```python
class ModelData(BaseModel):
arch: Optional[str] = None
model_name: Optional[str] = None
tokenizer_id: Optional[str] = None
```
## Type: ModelID
```python
ModelID: TypeAlias = str
```
## Type: ModelInput
```python
class ModelInput(StrictBase):
chunks: List[ModelInputChunk]
"""Sequence of input chunks (formerly TokenSequence)"""
@classmethod
def from_ints(cls, tokens: List[int]) -> "ModelInput":
"""
Create a ModelInput from a list of ints (tokens).
"""
return cls(chunks=[EncodedTextChunk(tokens=tokens)])
def to_ints(self) -> List[int]:
"""
Convert the ModelInput to a list of ints (tokens)
Throws exception if there are any non-token chunks
"""
if not all(isinstance(chunk, EncodedTextChunk) for chunk in self.chunks):
raise ValueError(f"to_ints only supported for ModelInput with EncodedTextChunks, got {[type(chunk) for chunk in self.chunks]}")
return [token for chunk in self.chunks for token in chunk.tokens]
@property
def length(self) -> int:
"""
Return the total context length used by this ModelInput.
"""
return sum(chunk.length for chunk in self.chunks)
@classmethod
def empty(cls) -> "ModelInput":
"""
Create an empty ModelInput.
"""
return cls(chunks=[])
def append(self, chunk: ModelInputChunk) -> "ModelInput":
"""
Add a new chunk, return a new ModelInput.
"""
return ModelInput(chunks=self.chunks + [chunk])
def append_int(self, token: int) -> "ModelInput":
"""
Add a new token, return a new ModelInput.
"""
return self.append(EncodedTextChunk(tokens=[token]))
```
## Type: ModelInputChunk
```python
ModelInputChunk: TypeAlias = Annotated[
Union[EncodedTextChunk, ImageAssetPointerChunk, ImageChunk], PropertyInfo(discriminator="type")
]
```
## Type: OptimStepResponse
```python
class OptimStepResponse(BaseModel):
metrics: Optional[Dict[str, float]] = None
"""Optimization step metrics as key-value pairs"""
```
## Type: SampleResponse
```python
class SampleResponse(BaseModel):
sequences: Sequence[SampledSequence]
type: Literal["sample"] = "sample"
prompt_logprobs: Optional[List[Optional[float]]] = None
"""
If prompt_logprobs was set to true in the request, logprobs are computed for
every token in the prompt. The `prompt_logprobs` response contains a float32
value for every token in the prompt.
"""
topk_prompt_logprobs: Optional[list[Optional[list[tuple[int, float]]]]] = None
"""
If topk_prompt_logprobs was set to a positive integer k in the request,
the top-k logprobs are computed for every token in the prompt. The
`topk_prompt_logprobs` response contains, for every token in the prompt,
a list of up to k (token_id, logprob) tuples.
"""
```
## Type: SampledSequence
```python
class SampledSequence(BaseModel):
stop_reason: StopReason
"""Reason why sampling stopped"""
tokens: List[int]
"""List of generated token IDs"""
logprobs: Optional[List[float]] = None
"""Log probabilities for each token (optional)"""
```
## Type: SamplingParams
```python
class SamplingParams(BaseModel):
max_tokens: Optional[int] = None
"""Maximum number of tokens to generate"""
seed: Optional[int] = None
"""Random seed for reproducible generation"""
stop: Union[str, Sequence[str], Sequence[int], None] = None
"""Stop sequences for generation"""
temperature: float = 1
"""Sampling temperature"""
top_k: int = -1
"""Top-k sampling parameter (-1 for no limit)"""
top_p: float = 1
"""Nucleus sampling probability"""
```
## Type: SaveWeightsForSamplerResponse
```python
class SaveWeightsForSamplerResponse(BaseModel):
path: str
"""A tinker URI for model weights for sampling at a specific step"""
type: Optional[Literal["save_weights_for_sampler"]] = None
```
## Type: SaveWeightsResponse
```python
class SaveWeightsResponse(BaseModel):
path: str
"""A tinker URI for model weights at a specific step"""
type: Optional[Literal["save_weights"]] = None
```
## Type: StopReason
```python
StopReason: TypeAlias = Literal["length", "stop"]
```
## Type: SupportedModel
```python
class SupportedModel(BaseModel):
model_name: Optional[str] = None
```
## Type: TensorData
```python
class TensorData(StrictBase):
data: Union[List[int], List[float]]
"""Flattened tensor data as array of numbers."""
dtype: TensorDtype
shape: Optional[List[int]] = None
"""Optional.
The shape of the tensor (see PyTorch tensor.shape). The shape of a
one-dimensional list of length N is `(N,)`. Can usually be inferred if not
provided, and is generally inferred as a 1D tensor.
"""
@classmethod
def from_numpy(cls, array: npt.NDArray[Any]) -> "TensorData":
return cls(
data=array.flatten().tolist(),
dtype=_convert_numpy_dtype_to_tensor(array.dtype),
shape=list(array.shape),
)
@classmethod
def from_torch(cls, tensor: "torch.Tensor") -> "TensorData":
return cls(
data=tensor.flatten().tolist(),
dtype=_convert_torch_dtype_to_tensor(tensor.dtype),
shape=list(tensor.shape),
)
def to_numpy(self) -> npt.NDArray[Any]:
"""Convert TensorData to numpy array."""
numpy_dtype = _convert_tensor_dtype_to_numpy(self.dtype)
arr = np.array(self.data, dtype=numpy_dtype)
if self.shape is not None:
arr = arr.reshape(self.shape)
return arr
def to_torch(self) -> "torch.Tensor":
"""Convert TensorData to torch tensor."""
if not _HAVE_TORCH:
raise ImportError("PyTorch is not installed. Cannot convert to torch tensor.")
torch_dtype = _convert_tensor_dtype_to_torch(self.dtype)
tensor = torch.tensor(self.data, dtype=torch_dtype)
if self.shape is not None:
tensor = tensor.reshape(self.shape)
return tensor
def tolist(self) -> List[Any]:
return self.to_numpy().tolist()
def _convert_tensor_dtype_to_numpy(dtype: TensorDtype) -> npt.DTypeLike:
"""Convert TensorDtype to numpy dtype-like."""
if dtype == "float32":
return np.float32
elif dtype == "int64":
return np.int64
else:
raise ValueError(f"Unsupported TensorDtype: {dtype}")
def _convert_tensor_dtype_to_torch(dtype: TensorDtype) -> "torch.dtype":
"""Convert TensorDtype to torch dtype."""
if not _HAVE_TORCH:
raise ImportError("PyTorch is not installed. Cannot convert to torch dtype.")
import torch
if dtype == "float32":
return torch.float32
elif dtype == "int64":
return torch.int64
else:
raise ValueError(f"Unsupported TensorDtype: {dtype}")
def _convert_numpy_dtype_to_tensor(dtype: np.dtype[Any]) -> TensorDtype:
"""Convert numpy dtype to TensorDtype."""
if dtype.kind == "f":
return "float32"
elif dtype.kind == "i":
return "int64"
else:
raise ValueError(f"Unsupported numpy dtype: {dtype}")
def _convert_torch_dtype_to_tensor(dtype: "torch.dtype") -> TensorDtype:
"""Convert torch dtype to TensorDtype."""
# torch.dtype objects have .is_floating_point
if getattr(dtype, "is_floating_point", False):
return "float32"
else:
return "int64"
```
## Type: TensorDtype
```python
TensorDtype: TypeAlias = Literal["int64", "float32"]
```
## Type: UnloadModelResponse
```python
class UnloadModelResponse(BaseModel):
model_id: ModelID
type: Optional[Literal["unload_model"]] = None
```