Fine-Tuning Embedding Models#
Fine-tuning adapts a pretrained embedding model to your specific data, producing embeddings that better capture the distinctions that matter for your task. OrcaCloud handles the training infrastructure – you provide a name, training data, and optionally the parameters you want to control.
For background on how embedding models work in OrcaCloud and which pretrained models are available, see the Embedding Models guide.
Prerequisites#
Before you start, make sure you have:
- OrcaSDK installed and an API key configured (see the Quick Start if you haven’t set this up yet)
- Training data uploaded as a
Datasourceor stored in aLabeledMemoryset/ScoredMemoryset
Quick Start#
The simplest fine-tuning call only needs a name and training data. Everything else – loss function, learning rate, batch size, number of epochs – is set to sensible defaults automatically.
- You can fine-tune any pretrained model.
F2LLM_160Mis a good general-purpose starting point. - Every finetuned model needs a unique name.
- When you pass a
LabeledMemoryset, the task type (classification) and target column are inferred automatically. You can also pass aDatasourcewith an explicitlabel_columnorscore_column. if_exists="open"returns the existing model if one with this name already exists. This is especially useful in Jupyter notebooks where cells may be re-executed.
With defaults, this runs a single training epoch using the "prediction" loss (a linear classification head on top of the embeddings). An eval split is automatically held out from the training data.
Once training completes, use the finetuned model by cloning your memoryset with it:
Training Data#
The finetune method accepts three types of training data:
Pass a LabeledMemoryset directly. The task type is set to classification and the label column is inferred from the memoryset.
Pass a ScoredMemoryset directly. The task type is set to regression and the score column is inferred from the memoryset.
Pass a Datasource with an explicit target column. Use label_column for classification or score_column for regression. If your text column is not called "value", specify value_column as well.
Evaluation data
When you omit eval_datasource, a split is automatically held out from the training data for evaluation. For more reliable evaluation – especially with small datasets – provide a separate Datasource via eval_datasource.
Customizing Training#
Loss Functions#
The loss parameter controls how the model learns from your data. Each loss function shapes the embedding space differently:
| Loss | Description | Task types |
|---|---|---|
"prediction" |
Trains a linear head on top of the embeddings. Simple and fast – a good starting point. | Classification, Regression |
"contrastive" |
In-batch contrastive loss. Trains embeddings directly for similarity. Often produces better embeddings for retrieval tasks and scales well to large batches. | Classification, Regression |
"triplet" |
Batch-hard triplet loss. Pulls same-class embeddings together. Simpler than contrastive but requires the full batch to fit on one GPU. | Classification |
"proxy" |
Proxy-anchor loss. Learns class proxies in embedding space. Particularly useful for class-imbalanced datasets. | Classification |
Choosing a loss function
Start with "prediction" (the default) for your first run. It is the fastest to train and gives you a baseline to compare against.
Try "contrastive" when you want to optimize the embedding space directly for similarity-based retrieval. It tends to produce better embeddings than prediction loss for nearest-neighbor lookups, especially with larger batch sizes.
Try "triplet" as a simpler alternative to contrastive. It works well when you have well-separated classes and moderate dataset sizes.
Try "proxy" when your dataset has significant class imbalance. The learned proxies help the model focus on underrepresented classes.
Hyperparameters#
Pass a hyperparams dict to override any training parameter. Only specify what you want to change – everything else keeps its loss-specific default.
The most commonly tuned parameters are:
| Parameter | Description | Default |
|---|---|---|
epochs |
Number of full passes over the training data | 1 (single run), 2 (sweep), 3 (early stopping) |
learning_rate |
Peak learning rate after warmup | 5e-5 |
batch_size |
Effective batch size – the number of samples the loss sees per optimizer step. OrcaCloud automatically splits this across GPUs and gradient accumulation steps, so you only pick the batch size that is best for learning, not what fits on a single device. See Sequence length, batch size, and memory. | 64 (prediction/triplet), 128 (contrastive/proxy) |
max_seq_length |
Maximum token length for input text, or a percentile string ("p90", "p95", "p99", "max"). Longer sequences preserve more text but use quadratically more memory and training time. See Sequence length, batch size, and memory. |
"p99" (covers 99% of your training samples) |
truncation_side |
Which end to drop when an input exceeds max_seq_length. "right" keeps the beginning of the text; "left" keeps the ending. Flip to "left" when the task-relevant signal is at the end (e.g. a verdict at the end of a customer-support chat). |
"right" |
warmup |
Learning rate warmup. int = steps, float = fraction of total steps |
Varies by loss |
learning_rate_scheduler |
How the learning rate changes after warmup: "linear", "cosine", or "constant" |
"linear" |
early_stopping |
Stop training when eval metric plateaus. True = patience 2, int = custom patience |
Auto-enabled for fast eval methods |
loss_scale |
Inverse temperature for contrastive and proxy losses | 20.0 (contrastive), 30.0 (proxy) |
Additional hyperparameters
| Parameter | Description |
|---|---|
max_steps |
Maximum training steps. Overrides epochs when set. |
weight_decay |
L2 regularization strength (typical range: 0.0 to 0.1) |
normalize_embeddings |
L2-normalize embeddings before the head. Only applies to "prediction" loss. |
early_stopping_threshold |
Minimum improvement to count as progress for early stopping |
eval_method |
How to measure model quality during training: "head", "neighbor", or "loss". See Evaluation during training. |
eval_steps |
How often to evaluate: int = every N steps, "epoch", "end", or "off" |
See the EmbeddingFinetuneHyperparams reference for the full list.
Sequence Length, Batch Size, and Memory#
Three parameters – max_seq_length, batch_size, and the model itself – together determine how much GPU memory a training step needs. Understanding how they interact makes it much easier to dial in a fast, stable run.
max_seq_lengthhas the biggest effect. Transformer memory and compute scale roughly quadratically with sequence length, so halving it can more than halve training time. OrcaCloud’s default of"p99"picks the sequence length that fits 99% of your samples exactly, with the remaining 1% truncated from thetruncation_side. For most text classification workloads this is dramatically cheaper than the model’s rawmax_seq_lengthwith almost no quality loss. Use"p95"to trim more aggressively, or pass an integer (e.g.max_seq_length=256) when you know the right cutoff for your data.batch_sizeis the effective batch size – the number of samples the loss function sees per optimizer step. Larger batches give more stable gradients, and contrastive and triplet losses specifically benefit from more in-batch negatives. You don’t need to tune a separate “micro-batch” or gradient-accumulation value: OrcaCloud auto-detects how many samples actually fit on each GPU and splits your effective batch into per-device micro-batches with gradient accumulation as needed. Sobatch_size=128always trains as if 128 samples produced one gradient step, regardless of which GPU you land on.- Mixed precision (
bf16) and gradient checkpointing are also auto-enabled when they help. On supported GPUs (A100/H100 class) OrcaCloud trains in bfloat16 to roughly halve memory, and it turns on gradient checkpointing automatically when a configuration would otherwise not fit (e.g. a large contrastive batch) – at the cost of some extra compute.
If a run fails with out-of-memory
The most effective levers, in order, are:
- Lower
max_seq_length(e.g. from"p99"to"p95", or from512to256). Because cost is quadratic in sequence length, this often recovers far more memory than reducing batch size. - Lower
batch_size. Since OrcaCloud already grad-accumulates to fit a large effective batch on small devices, you usually only need to reducebatch_sizewhen the resulting micro-batch for a single loss step (e.g. in contrastive or triplet losses, which can’t split arbitrarily) won’t fit. - Pick a smaller base model. For instance, use
F2LLM_80Minstead ofF2LLM_160MorF2LLM_330M. Smaller models trade a bit of quality for a lot of headroom.
Instructions When Fine-Tuning#
Instruction-tuned base models such as F2LLM_160M, QWEN_600M, and the HARRIER_* family perform best when you pair them with a task-specific instruction. The instruction is attached to the memoryset (not the finetune call) so that it is applied consistently every time text is embedded – see Using Instructions in the embeddings guide for the full walkthrough.
When you fine-tune from a memoryset and you do not pass an instruction in hyperparams, OrcaCloud copies the memoryset’s instruction into the training config if the memoryset has one and the base model supports instructions. That keeps training aligned with how memories in that memoryset were embedded. If you omit instruction on both the memoryset and hyperparams, training uses the base model’s default prompt.
When you fine-tune from a datasource, there is no memoryset instruction to copy; set hyperparams={"instruction": "..."} when you want a custom training instruction on an instruction-capable base model.
Passing instruction explicitly in hyperparams always wins over the memoryset default for that run.
Hyperparameter Sweeps#
When you are not sure which hyperparameter values work best, you can run an automated sweep. OrcaCloud uses Optuna to search the hyperparameter space efficiently.
Sweeps are driven by trial_count. Any value greater than 1 activates a sweep, and you have two ways to control what is searched:
- Auto sweep – only set
trial_count. OrcaCloud automatically picks which parameters to sweep and injects sensible, loss-specific search ranges for them. This is the easiest way to start. - Custom sweep – set
trial_countand pass explicit ranges inhyperparams. Any parameter whose value is a tuple or list is swept; any parameter set to a scalar is held fixed for every trial. Parameters you don’t mention keep their defaults and are excluded from the search.
In other words: the system always auto-selects which parameters to sweep – passing tuples/lists in hyperparams simply overrides the default selection for those specific parameters.
Auto Sweep#
The easiest way to sweep is to just set trial_count. OrcaCloud picks the highest-impact parameters for your loss function (typically learning_rate, batch_size, epochs, warmup, and loss_scale where applicable) and injects loss-specific ranges:
- A good rule of thumb: budget roughly 2n + 1 trials for n search parameters. With 9 trials, the system can effectively explore up to 4 parameters.
Custom Sweep#
For finer control, add explicit search ranges to hyperparams. The syntax per-parameter determines what happens:
- Tuples
(min, max)define a continuous range (log-uniform forlearning_rate, uniform otherwise). - Lists
[a, b, c]define categorical choices – Optuna picks one per trial. - Scalars fix a parameter to a single value (excluded from the search, even if it would normally be auto-swept).
Start with auto sweep
If you are new to sweeping, start by only setting trial_count and letting the system choose what to search. You can always narrow down the search space later once you see which parameters matter most.
Inspecting Trials#
After a sweep completes, inspect the individual trials to understand which configurations performed best:
FinetunedEmbeddingModelTrial({
hyperparameters: {'learning_rate': 5.2e-05, 'batch_size': 64, 'epochs': 3, 'warmup': 142},
metrics: {'f1_score': 0.91, 'accuracy': 0.91}
})
FinetunedEmbeddingModelTrial({
hyperparameters: {'learning_rate': 1.8e-05, 'batch_size': 128, 'epochs': 5, 'warmup': 87},
metrics: {'f1_score': 0.89, 'accuracy': 0.89}
})
The best trial is automatically selected as the final model.
Evaluation During Training#
OrcaCloud evaluates your model during training to track progress and enable early stopping. The eval_method parameter controls how evaluation is performed:
| Method | Description | Default for |
|---|---|---|
"head" |
Uses the trained prediction head for fast evaluation. Auto-enables early stopping. | Single-run prediction loss |
"neighbor" |
Nearest-neighbor evaluation (FAISS KNN). More representative of retrieval performance but slower. | Sweeps and metric losses |
"loss" |
Monitors the training loss on the eval set. Lightweight but less informative. | – |
You rarely need to set eval_method explicitly – the defaults are chosen based on your loss function and whether you are running a sweep.
Early stopping
For single runs with prediction loss, early stopping is auto-enabled with a patience of 2 and defaults to 3 epochs. You can configure it explicitly with "early_stopping": True (patience 2) or "early_stopping": 5 (patience 5) in hyperparams.
Background Jobs#
Fine-tuning can take a while depending on your dataset size and configuration. Use background=True to get a Job handle immediately and continue working:
Managing Finetuned Models#
Open an Existing Model#
List All Models#
Delete a Model#
- You cannot delete a model that is currently in use by a memoryset. The
if_not_exists="ignore"option prevents an error if the model has already been deleted.
Tips and Best Practices#
Start simple
Run fine-tuning with all defaults first to establish a baseline. Then iterate on loss function, hyperparameters, or sweep configuration based on results.
Use if_exists=\"open\" in notebooks
This prevents errors when re-running cells and avoids accidentally re-training a model.
Contrastive loss for retrieval
If your primary goal is to improve nearest-neighbor retrieval quality (e.g. for a classification model backed by a memoryset), try loss="contrastive". It directly optimizes embedding similarity rather than training a prediction head.
Provide separate eval data for small datasets
When your training set is small, the automatic eval split may not be representative. Pass an explicit eval_datasource for more reliable evaluation.
Early stopping prevents overfitting
For longer training runs, set "early_stopping": True in hyperparams. This automatically stops training when the eval metric plateaus, saving time and avoiding overfitting.