Skip to content

Fine-Tuning Embedding Models#

Fine-tuning adapts a pretrained embedding model to your specific data, producing embeddings that better capture the distinctions that matter for your task. OrcaCloud handles the training infrastructure – you provide a name, training data, and optionally the parameters you want to control.

For background on how embedding models work in OrcaCloud and which pretrained models are available, see the Embedding Models guide.

Prerequisites#

Before you start, make sure you have:

Quick Start#

The simplest fine-tuning call only needs a name and training data. Everything else – loss function, learning rate, batch size, number of epochs – is set to sensible defaults automatically.

1
2
3
4
5
6
7
8
9
from orca_sdk import PretrainedEmbeddingModel, LabeledMemoryset

memoryset = LabeledMemoryset.open("my_memoryset")

finetuned_model = PretrainedEmbeddingModel.F2LLM_160M.finetune( # (1)!
    "my_finetuned_model", # (2)!
    memoryset, # (3)!
    if_exists="open", # (4)!
)

  1. You can fine-tune any pretrained model. F2LLM_160M is a good general-purpose starting point.
  2. Every finetuned model needs a unique name.
  3. When you pass a LabeledMemoryset, the task type (classification) and target column are inferred automatically. You can also pass a Datasource with an explicit label_column or score_column.
  4. if_exists="open" returns the existing model if one with this name already exists. This is especially useful in Jupyter notebooks where cells may be re-executed.

With defaults, this runs a single training epoch using the "prediction" loss (a linear classification head on top of the embeddings). An eval split is automatically held out from the training data.

Once training completes, use the finetuned model by cloning your memoryset with it:

1
2
3
4
5
finetuned_memoryset = memoryset.clone(
    "my_finetuned_memoryset",
    embedding_model=finetuned_model,
    if_exists="open",
)
LabeledMemoryset({
    name: 'my_finetuned_memoryset',
    length: 2500,
    label_names: ['neg', 'pos'],
    embedding_model: FinetunedEmbeddingModel({name: my_finetuned_model, embedding_dim: 768, max_seq_length: 40960}),
})

Training Data#

The finetune method accepts three types of training data:

Pass a LabeledMemoryset directly. The task type is set to classification and the label column is inferred from the memoryset.

memoryset = LabeledMemoryset.open("my_memoryset")
model.finetune("my_model", memoryset)

Pass a ScoredMemoryset directly. The task type is set to regression and the score column is inferred from the memoryset.

memoryset = ScoredMemoryset.open("my_scored_memoryset")
model.finetune("my_model", memoryset)

Pass a Datasource with an explicit target column. Use label_column for classification or score_column for regression. If your text column is not called "value", specify value_column as well.

datasource = Datasource.open("my_datasource")
model.finetune("my_model", datasource, value_column="text", label_column="sentiment")

Evaluation data

When you omit eval_datasource, a split is automatically held out from the training data for evaluation. For more reliable evaluation – especially with small datasets – provide a separate Datasource via eval_datasource.

Customizing Training#

Loss Functions#

The loss parameter controls how the model learns from your data. Each loss function shapes the embedding space differently:

Loss Description Task types
"prediction" Trains a linear head on top of the embeddings. Simple and fast – a good starting point. Classification, Regression
"contrastive" In-batch contrastive loss. Trains embeddings directly for similarity. Often produces better embeddings for retrieval tasks and scales well to large batches. Classification, Regression
"triplet" Batch-hard triplet loss. Pulls same-class embeddings together. Simpler than contrastive but requires the full batch to fit on one GPU. Classification
"proxy" Proxy-anchor loss. Learns class proxies in embedding space. Particularly useful for class-imbalanced datasets. Classification
Choosing a loss function

Start with "prediction" (the default) for your first run. It is the fastest to train and gives you a baseline to compare against.

Try "contrastive" when you want to optimize the embedding space directly for similarity-based retrieval. It tends to produce better embeddings than prediction loss for nearest-neighbor lookups, especially with larger batch sizes.

Try "triplet" as a simpler alternative to contrastive. It works well when you have well-separated classes and moderate dataset sizes.

Try "proxy" when your dataset has significant class imbalance. The learned proxies help the model focus on underrepresented classes.

Hyperparameters#

Pass a hyperparams dict to override any training parameter. Only specify what you want to change – everything else keeps its loss-specific default.

finetuned_model = PretrainedEmbeddingModel.F2LLM_160M.finetune(
    "my_custom_model",
    memoryset,
    loss="contrastive",
    hyperparams={
        "learning_rate": 3e-5,
        "batch_size": 128,
        "epochs": 3,
        "warmup": 100,
        "early_stopping": True,
    },
)

The most commonly tuned parameters are:

Parameter Description Default
epochs Number of full passes over the training data 1 (single run), 2 (sweep), 3 (early stopping)
learning_rate Peak learning rate after warmup 5e-5
batch_size Effective batch size – the number of samples the loss sees per optimizer step. OrcaCloud automatically splits this across GPUs and gradient accumulation steps, so you only pick the batch size that is best for learning, not what fits on a single device. See Sequence length, batch size, and memory. 64 (prediction/triplet), 128 (contrastive/proxy)
max_seq_length Maximum token length for input text, or a percentile string ("p90", "p95", "p99", "max"). Longer sequences preserve more text but use quadratically more memory and training time. See Sequence length, batch size, and memory. "p99" (covers 99% of your training samples)
truncation_side Which end to drop when an input exceeds max_seq_length. "right" keeps the beginning of the text; "left" keeps the ending. Flip to "left" when the task-relevant signal is at the end (e.g. a verdict at the end of a customer-support chat). "right"
warmup Learning rate warmup. int = steps, float = fraction of total steps Varies by loss
learning_rate_scheduler How the learning rate changes after warmup: "linear", "cosine", or "constant" "linear"
early_stopping Stop training when eval metric plateaus. True = patience 2, int = custom patience Auto-enabled for fast eval methods
loss_scale Inverse temperature for contrastive and proxy losses 20.0 (contrastive), 30.0 (proxy)
Additional hyperparameters
Parameter Description
max_steps Maximum training steps. Overrides epochs when set.
weight_decay L2 regularization strength (typical range: 0.0 to 0.1)
normalize_embeddings L2-normalize embeddings before the head. Only applies to "prediction" loss.
early_stopping_threshold Minimum improvement to count as progress for early stopping
eval_method How to measure model quality during training: "head", "neighbor", or "loss". See Evaluation during training.
eval_steps How often to evaluate: int = every N steps, "epoch", "end", or "off"

See the EmbeddingFinetuneHyperparams reference for the full list.

Sequence Length, Batch Size, and Memory#

Three parameters – max_seq_length, batch_size, and the model itself – together determine how much GPU memory a training step needs. Understanding how they interact makes it much easier to dial in a fast, stable run.

  • max_seq_length has the biggest effect. Transformer memory and compute scale roughly quadratically with sequence length, so halving it can more than halve training time. OrcaCloud’s default of "p99" picks the sequence length that fits 99% of your samples exactly, with the remaining 1% truncated from the truncation_side. For most text classification workloads this is dramatically cheaper than the model’s raw max_seq_length with almost no quality loss. Use "p95" to trim more aggressively, or pass an integer (e.g. max_seq_length=256) when you know the right cutoff for your data.
  • batch_size is the effective batch size – the number of samples the loss function sees per optimizer step. Larger batches give more stable gradients, and contrastive and triplet losses specifically benefit from more in-batch negatives. You don’t need to tune a separate “micro-batch” or gradient-accumulation value: OrcaCloud auto-detects how many samples actually fit on each GPU and splits your effective batch into per-device micro-batches with gradient accumulation as needed. So batch_size=128 always trains as if 128 samples produced one gradient step, regardless of which GPU you land on.
  • Mixed precision (bf16) and gradient checkpointing are also auto-enabled when they help. On supported GPUs (A100/H100 class) OrcaCloud trains in bfloat16 to roughly halve memory, and it turns on gradient checkpointing automatically when a configuration would otherwise not fit (e.g. a large contrastive batch) – at the cost of some extra compute.

If a run fails with out-of-memory

The most effective levers, in order, are:

  1. Lower max_seq_length (e.g. from "p99" to "p95", or from 512 to 256). Because cost is quadratic in sequence length, this often recovers far more memory than reducing batch size.
  2. Lower batch_size. Since OrcaCloud already grad-accumulates to fit a large effective batch on small devices, you usually only need to reduce batch_size when the resulting micro-batch for a single loss step (e.g. in contrastive or triplet losses, which can’t split arbitrarily) won’t fit.
  3. Pick a smaller base model. For instance, use F2LLM_80M instead of F2LLM_160M or F2LLM_330M. Smaller models trade a bit of quality for a lot of headroom.

Instructions When Fine-Tuning#

Instruction-tuned base models such as F2LLM_160M, QWEN_600M, and the HARRIER_* family perform best when you pair them with a task-specific instruction. The instruction is attached to the memoryset (not the finetune call) so that it is applied consistently every time text is embedded – see Using Instructions in the embeddings guide for the full walkthrough.

When you fine-tune from a memoryset and you do not pass an instruction in hyperparams, OrcaCloud copies the memoryset’s instruction into the training config if the memoryset has one and the base model supports instructions. That keeps training aligned with how memories in that memoryset were embedded. If you omit instruction on both the memoryset and hyperparams, training uses the base model’s default prompt.

When you fine-tune from a datasource, there is no memoryset instruction to copy; set hyperparams={"instruction": "..."} when you want a custom training instruction on an instruction-capable base model.

Passing instruction explicitly in hyperparams always wins over the memoryset default for that run.

Hyperparameter Sweeps#

When you are not sure which hyperparameter values work best, you can run an automated sweep. OrcaCloud uses Optuna to search the hyperparameter space efficiently.

Sweeps are driven by trial_count. Any value greater than 1 activates a sweep, and you have two ways to control what is searched:

  • Auto sweep – only set trial_count. OrcaCloud automatically picks which parameters to sweep and injects sensible, loss-specific search ranges for them. This is the easiest way to start.
  • Custom sweep – set trial_count and pass explicit ranges in hyperparams. Any parameter whose value is a tuple or list is swept; any parameter set to a scalar is held fixed for every trial. Parameters you don’t mention keep their defaults and are excluded from the search.

In other words: the system always auto-selects which parameters to sweep – passing tuples/lists in hyperparams simply overrides the default selection for those specific parameters.

Auto Sweep#

The easiest way to sweep is to just set trial_count. OrcaCloud picks the highest-impact parameters for your loss function (typically learning_rate, batch_size, epochs, warmup, and loss_scale where applicable) and injects loss-specific ranges:

1
2
3
4
5
finetuned_model = PretrainedEmbeddingModel.F2LLM_160M.finetune(
    "my_sweep_model",
    memoryset,
    trial_count=9, # (1)!
)

  1. A good rule of thumb: budget roughly 2n + 1 trials for n search parameters. With 9 trials, the system can effectively explore up to 4 parameters.

Custom Sweep#

For finer control, add explicit search ranges to hyperparams. The syntax per-parameter determines what happens:

  • Tuples (min, max) define a continuous range (log-uniform for learning_rate, uniform otherwise).
  • Lists [a, b, c] define categorical choices – Optuna picks one per trial.
  • Scalars fix a parameter to a single value (excluded from the search, even if it would normally be auto-swept).
finetuned_model = PretrainedEmbeddingModel.F2LLM_160M.finetune(
    "my_custom_sweep_model",
    memoryset,
    loss="contrastive",
    trial_count=15,
    hyperparams={
        "learning_rate": (1e-5, 1e-3),   # continuous range
        "batch_size": [32, 64, 128],      # categorical choices
        "epochs": 4,                      # fixed, not swept
        "loss_scale": (16.0, 40.0),       # continuous range
    },
)

Start with auto sweep

If you are new to sweeping, start by only setting trial_count and letting the system choose what to search. You can always narrow down the search space later once you see which parameters matter most.

Inspecting Trials#

After a sweep completes, inspect the individual trials to understand which configurations performed best:

for trial in finetuned_model.trials:
    print(trial)
FinetunedEmbeddingModelTrial({
    hyperparameters: {'learning_rate': 5.2e-05, 'batch_size': 64, 'epochs': 3, 'warmup': 142},
    metrics: {'f1_score': 0.91, 'accuracy': 0.91}
})
FinetunedEmbeddingModelTrial({
    hyperparameters: {'learning_rate': 1.8e-05, 'batch_size': 128, 'epochs': 5, 'warmup': 87},
    metrics: {'f1_score': 0.89, 'accuracy': 0.89}
})

The best trial is automatically selected as the final model.

Evaluation During Training#

OrcaCloud evaluates your model during training to track progress and enable early stopping. The eval_method parameter controls how evaluation is performed:

Method Description Default for
"head" Uses the trained prediction head for fast evaluation. Auto-enables early stopping. Single-run prediction loss
"neighbor" Nearest-neighbor evaluation (FAISS KNN). More representative of retrieval performance but slower. Sweeps and metric losses
"loss" Monitors the training loss on the eval set. Lightweight but less informative.

You rarely need to set eval_method explicitly – the defaults are chosen based on your loss function and whether you are running a sweep.

Early stopping

For single runs with prediction loss, early stopping is auto-enabled with a patience of 2 and defaults to 3 epochs. You can configure it explicitly with "early_stopping": True (patience 2) or "early_stopping": 5 (patience 5) in hyperparams.

Background Jobs#

Fine-tuning can take a while depending on your dataset size and configuration. Use background=True to get a Job handle immediately and continue working:

job = PretrainedEmbeddingModel.F2LLM_160M.finetune(
    "my_model",
    memoryset,
    background=True,
)

# Check status
print(job.status)

# Block until complete and get the result
finetuned_model = job.result()

Managing Finetuned Models#

Open an Existing Model#

1
2
3
from orca_sdk import FinetunedEmbeddingModel

model = FinetunedEmbeddingModel.open("my_finetuned_model")
FinetunedEmbeddingModel({
    name: my_finetuned_model,
    embedding_dim: 768,
    max_seq_length: 40960,
    base_model: PretrainedEmbeddingModel.F2LLM_160M
})

List All Models#

FinetunedEmbeddingModel.all()

Delete a Model#

FinetunedEmbeddingModel.drop("my_finetuned_model", if_not_exists="ignore") # (1)!

  1. You cannot delete a model that is currently in use by a memoryset. The if_not_exists="ignore" option prevents an error if the model has already been deleted.

Tips and Best Practices#

Start simple

Run fine-tuning with all defaults first to establish a baseline. Then iterate on loss function, hyperparameters, or sweep configuration based on results.

Use if_exists=\"open\" in notebooks

This prevents errors when re-running cells and avoids accidentally re-training a model.

Contrastive loss for retrieval

If your primary goal is to improve nearest-neighbor retrieval quality (e.g. for a classification model backed by a memoryset), try loss="contrastive". It directly optimizes embedding similarity rather than training a prediction head.

Provide separate eval data for small datasets

When your training set is small, the automatic eval split may not be representative. Pass an explicit eval_datasource for more reliable evaluation.

Early stopping prevents overfitting

For longer training runs, set "early_stopping": True in hyperparams. This automatically stops training when the eval metric plateaus, saving time and avoiding overfitting.