Finetuning Embedding Models#

Finetuning adapts a pretrained embedding model to your specific data, producing embeddings that better capture the distinctions that matter for your task. OrcaCloud handles the training infrastructure. You provide a name, training data, and optionally the parameters you want to control.

For background on how embedding models work in OrcaCloud and which pretrained models are available, see the Embedding Models guide.

Prerequisites#

Before you start, make sure you have:

OrcaSDK installed and an API key configured (see the Quick Start if you haven’t set this up yet)
Training data uploaded as a Datasource or stored in a LabeledMemoryset / ScoredMemoryset

Quick Start#

The simplest finetuning call only needs a name and training data. Everything else (loss function, learning rate, batch size, number of epochs) is set to sensible defaults automatically.

from orca_sdk import PretrainedEmbeddingModel, LabeledMemoryset

memoryset = LabeledMemoryset.open("my_memoryset")

finetuned_model = PretrainedEmbeddingModel.F2LLM_160M.finetune( # (1)!
    "my_finetuned_model", # (2)!
    memoryset, # (3)!
    if_exists="open", # (4)!
)

You can finetune any pretrained model. F2LLM_160M is a good general-purpose starting point.
Every finetuned model needs a unique name.
When you pass a LabeledMemoryset, the task type (classification) and target column are inferred automatically. You can also pass a Datasource with an explicit label_column or score_column.
if_exists="open" returns the existing model if one with this name already exists. This is especially useful in Jupyter notebooks where cells may be re-executed.

With defaults, this trains for up to 3 epochs with the "prediction" loss (a linear classification head on top of the embeddings) and stops early once the eval metric plateaus. An eval split is automatically held out from the training data.

Once training completes, use the finetuned model by cloning your memoryset with it:

finetuned_memoryset = memoryset.clone(
    "my_finetuned_memoryset",
    embedding_model=finetuned_model,
    if_exists="open",
)

LabeledMemoryset({
    name: 'my_finetuned_memoryset',
    length: 2500,
    label_names: ['neg', 'pos'],
    embedding_model: FinetunedEmbeddingModel({name: my_finetuned_model, embedding_dim: 768, max_seq_length: 40960}),
})

Training Data#

The finetune method accepts three types of training data:

LabeledMemorysetScoredMemorysetDatasource

Pass a LabeledMemoryset directly. The task type is set to classification and the label column is inferred from the memoryset.

memoryset = LabeledMemoryset.open("my_memoryset")
model.finetune("my_model", memoryset)

Pass a ScoredMemoryset directly. The task type is set to regression and the score column is inferred from the memoryset.

memoryset = ScoredMemoryset.open("my_scored_memoryset")
model.finetune("my_model", memoryset)

Pass a Datasource with an explicit target column. Use label_column for classification or score_column for regression. If your text column is not called "value", specify value_column as well.

datasource = Datasource.open("my_datasource")
model.finetune("my_model", datasource, value_column="text", label_column="sentiment")

Evaluation data

When you omit eval_datasource, a split is automatically held out from the training data for evaluation. For more reliable evaluation, especially with small datasets, provide a separate Datasource via eval_datasource.

Customizing Training#

Loss Functions#

The loss parameter controls how the model learns from your data. Each loss function shapes the embedding space differently:

Loss	Description	Task types
`"prediction"`	Trains a linear head on top of the embeddings. Simple and fast, a good starting point.	Classification, Regression
`"contrastive"`	In-batch contrastive loss. Trains embeddings directly for similarity. Often produces better embeddings for retrieval tasks and scales well to large batches.	Classification, Regression
`"triplet"`	Batch-hard triplet loss. Pulls same-class embeddings together. Simpler than contrastive but requires the full batch to fit on one GPU.	Classification
`"proxy"`	Proxy-anchor loss. Learns class proxies in embedding space. Particularly useful for class-imbalanced datasets.	Classification

Choosing a loss function

Start with "prediction" (the default) for your first run. It is the fastest to train and gives you a baseline to compare against.

Try "contrastive" when you want to optimize the embedding space directly for similarity-based retrieval. It tends to produce better embeddings than prediction loss for nearest-neighbor lookups, especially with larger batch sizes.

Try "triplet" as a simpler alternative to contrastive. It works well when you have well-separated classes and moderate dataset sizes.

Try "proxy" when your dataset has significant class imbalance. The learned proxies help the model focus on underrepresented classes.

Hyperparameters#

Pass a hyperparams dict to override any training parameter. Only specify what you want to change. Everything else keeps its loss-specific default.

finetuned_model = PretrainedEmbeddingModel.F2LLM_160M.finetune(
    "my_custom_model",
    memoryset,
    loss="contrastive",
    hyperparams={
        "learning_rate": 3e-5,
        "batch_size": 128,
        "epochs": 3,
        "warmup": 100,
        "early_stopping": True,
    },
)

The most commonly tuned parameters are:

Parameter	Description	Default
`epochs`	Number of full passes over the training data	3 with early stopping (default for prediction loss), 2 for sweeps, 1 otherwise
`learning_rate`	Peak learning rate after warmup	5e-5
`batch_size`	Effective batch size: the number of samples the loss sees per optimizer step. OrcaCloud automatically splits this across GPUs and gradient accumulation steps, so you only pick the batch size that is best for learning, not what fits on a single device. See Sequence length, batch size, and memory.	64 (prediction/triplet), 128 (contrastive/proxy)
`max_seq_length`	Maximum token length for input text, or a percentile string (`"p90"`, `"p95"`, `"p99"`, `"max"`). Longer sequences preserve more text but use quadratically more memory and training time. See Sequence length, batch size, and memory.	`"p99"` (covers 99% of your training samples)
`truncation_side`	Which end to drop when an input exceeds `max_seq_length`. `"right"` keeps the beginning of the text, `"left"` keeps the ending. Flip to `"left"` when the task-relevant signal is at the end (e.g. a verdict at the end of a customer-support chat).	`"right"`
`warmup`	Learning rate warmup. `int` = steps, `float` = fraction of total steps	Varies by loss
`learning_rate_scheduler`	How the learning rate changes after warmup: `"linear"`, `"cosine"`, or `"constant"`	`"linear"`
`early_stopping`	Stop training when eval metric plateaus. `True` = patience 2, `int` = custom patience	Auto-enabled for fast eval methods
`loss_scale`	Inverse temperature for contrastive and proxy losses	20.0 (contrastive), 30.0 (proxy)

Additional hyperparameters

Parameter	Description
`max_steps`	Maximum training steps. Overrides `epochs` when set.
`weight_decay`	L2 regularization strength (typical range: 0.0 to 0.1)
`normalize_embeddings`	L2-normalize embeddings before the head. Only applies to `"prediction"` loss.
`early_stopping_threshold`	Minimum improvement to count as progress for early stopping
`eval_method`	How to measure model quality during training: `"head"`, `"neighbor"`, or `"loss"`. See Evaluation during training.
`eval_steps`	How often to evaluate: `int` = every N steps, `"epoch"`, `"end"`, or `"off"`

See the EmbeddingFinetuneHyperparams reference for the full list.

Sequence Length, Batch Size, and Memory#

Three parameters (max_seq_length, batch_size, and the model itself) together determine how much GPU memory a training step needs. Understanding how they interact makes it much easier to dial in a fast, stable run.

max_seq_length has the biggest effect. Transformer memory and compute scale roughly quadratically with sequence length, so halving it can more than halve training time. OrcaCloud’s default of "p99" picks the sequence length that fits 99% of your samples exactly, with the remaining 1% truncated from the truncation_side. For most text classification workloads this is dramatically cheaper than the model’s raw max_seq_length with almost no quality loss. Use "p95" to trim more aggressively, or pass an integer (e.g. max_seq_length=256) when you know the right cutoff for your data.
batch_size is the effective batch size, the number of samples the loss function sees per optimizer step. Larger batches give more stable gradients, and contrastive and triplet losses specifically benefit from more in-batch negatives. You don’t need to tune a separate “micro-batch” or gradient-accumulation value: OrcaCloud auto-detects how many samples actually fit on each GPU and splits your effective batch into per-device micro-batches with gradient accumulation as needed. So batch_size=128 always trains as if 128 samples produced one gradient step, regardless of which GPU you land on.
Mixed precision (bf16) and gradient checkpointing are also auto-enabled when they help. On supported GPUs (A100/H100 class) OrcaCloud trains in bfloat16 to roughly halve memory, and it turns on gradient checkpointing automatically when a configuration would otherwise not fit (e.g. a large contrastive batch), at the cost of some extra compute.

If training fails because the model doesn’t fit

If you see Model does not fit in GPU memory even at batch_size=1, your two levers are:

Lower max_seq_length (e.g. from "p99" to "p95", or from 512 to 256). Because cost is quadratic in sequence length, this is by far the most effective.
Pick a smaller base model. For instance, use F2LLM_80M instead of F2LLM_160M or F2LLM_330M.

Triplet loss has its own failure mode: it can’t split a batch across devices, so a batch_size larger than the per-device limit raises a triplet-specific error. Lower batch_size (or switch to "contrastive") when you hit it.

Instructions When Finetuning#

Instruction-tuned base models such as F2LLM_*, QWEN_*, and the HARRIER_* family perform best when you pair them with a task-specific instruction. The instruction is attached to the memoryset (not the finetune call) so that it is applied consistently every time text is embedded. See Writing instructions in the embedding models guide for the full walkthrough.

When you finetune from a memoryset and you do not pass an instruction in hyperparams, OrcaCloud copies the memoryset’s instruction into the training config if the memoryset has one and the base model supports instructions. That keeps training aligned with how memories in that memoryset were embedded. If you omit instruction on both the memoryset and hyperparams, training uses the base model’s default prompt.

When you finetune from a datasource, there is no memoryset instruction to copy, so set hyperparams={"instruction": "..."} when you want a custom training instruction on an instruction-capable base model.

Passing instruction explicitly in hyperparams always wins over the memoryset default for that run.

Hyperparameter Sweeps#

When you are not sure which hyperparameter values work best, you can run an automated sweep. OrcaCloud uses Optuna to search the hyperparameter space efficiently.

Sweeps are driven by trial_count. Any value greater than 1 activates a sweep, and you have two ways to control what is searched:

Auto sweep. Only set trial_count. OrcaCloud automatically picks which parameters to sweep and injects sensible, loss-specific search ranges for them. This is the easiest way to start.
Custom sweep. Set trial_count and pass explicit ranges in hyperparams. Any parameter whose value is a tuple or list is swept. Any parameter set to a scalar is held fixed for every trial. Parameters you don’t mention keep their defaults and are excluded from the search.

In other words: the system always auto-selects which parameters to sweep. Passing tuples or lists in hyperparams simply overrides the default selection for those specific parameters.

Auto Sweep#

The easiest way to sweep is to just set trial_count. OrcaCloud picks the highest-impact parameters for your loss function (typically learning_rate, batch_size, epochs, warmup, and loss_scale where applicable) and injects loss-specific ranges:

finetuned_model = PretrainedEmbeddingModel.F2LLM_160M.finetune(
    "my_sweep_model",
    memoryset,
    trial_count=9, # (1)!
)

A good rule of thumb: budget roughly 2n + 1 trials for n search parameters. With 9 trials, the system can effectively explore up to 4 parameters.

Custom Sweep#

For finer control, add explicit search ranges to hyperparams. The syntax per-parameter determines what happens:

Tuples (min, max) define a continuous range (log-uniform for learning_rate, uniform otherwise).
Lists [a, b, c] define categorical choices. Optuna picks one per trial.
Scalars fix a parameter to a single value (excluded from the search, even if it would normally be auto-swept).

finetuned_model = PretrainedEmbeddingModel.F2LLM_160M.finetune(
    "my_custom_sweep_model",
    memoryset,
    loss="contrastive",
    trial_count=15,
    hyperparams={
        "learning_rate": (1e-5, 1e-3),   # continuous range
        "batch_size": [32, 64, 128],      # categorical choices
        "epochs": 4,                      # fixed, not swept
        "loss_scale": (16.0, 40.0),       # continuous range
    },
)

Start with auto sweep

If you are new to sweeping, start by only setting trial_count and letting the system choose what to search. You can always narrow down the search space later once you see which parameters matter most.

Inspecting Trials#

After a sweep completes, inspect the individual trials to understand which configurations performed best:

for trial in finetuned_model.trials:
    print(trial)

FinetunedEmbeddingModelTrial({
    hyperparameters: {'learning_rate': 5.2e-05, 'batch_size': 64, 'epochs': 3, 'warmup': 142},
    metrics: {'f1_score': 0.91, 'accuracy': 0.91}
})
FinetunedEmbeddingModelTrial({
    hyperparameters: {'learning_rate': 1.8e-05, 'batch_size': 128, 'epochs': 5, 'warmup': 87},
    metrics: {'f1_score': 0.89, 'accuracy': 0.89}
})

The best trial is automatically selected as the final model.

Evaluation During Training#

OrcaCloud evaluates your model during training to track progress and enable early stopping. The eval_method parameter controls how evaluation is performed:

Method	Description	Default for
`"head"`	Uses the trained prediction head for fast evaluation. Auto-enables early stopping.	Single-run prediction loss
`"neighbor"`	Nearest-neighbor evaluation (FAISS KNN). More representative of retrieval performance but slower.	Sweeps and metric losses
`"loss"`	Monitors the training loss on the eval set. Lightweight but less informative.	(none)

You rarely need to set eval_method explicitly. The defaults are chosen based on your loss function and whether you are running a sweep.

Early stopping

For single runs with prediction loss, early stopping is auto-enabled with a patience of 2 and defaults to 3 epochs. You can configure it explicitly with "early_stopping": True (patience 2) or "early_stopping": 5 (patience 5) in hyperparams.

Background Jobs#

Finetuning can take a while depending on your dataset size and configuration. Use background=True to get a Job handle immediately and continue working:

job = PretrainedEmbeddingModel.F2LLM_160M.finetune(
    "my_model",
    memoryset,
    background=True,
)

# Check status
print(job.status)

# Block until complete and get the result
finetuned_model = job.result()

Managing Finetuned Models#

Open an Existing Model#

from orca_sdk import FinetunedEmbeddingModel

model = FinetunedEmbeddingModel.open("my_finetuned_model")

FinetunedEmbeddingModel({
    name: my_finetuned_model,
    embedding_dim: 768,
    max_seq_length: 40960,
    base_model: PretrainedEmbeddingModel.F2LLM_160M
})

List All Models#

FinetunedEmbeddingModel.all()

Delete a Model#

FinetunedEmbeddingModel.drop("my_finetuned_model", if_not_exists="ignore") # (1)!

You cannot delete a model that is currently in use by a memoryset. The if_not_exists="ignore" option prevents an error if the model has already been deleted.

Tips and Best Practices#

Start simple

Run finetuning with all defaults first to establish a baseline. Then iterate on loss function, hyperparameters, or sweep configuration based on results.

Use if_exists=\"open\" in notebooks

This prevents errors when re-running cells and avoids accidentally re-training a model.

Contrastive loss for retrieval

If your primary goal is to improve nearest-neighbor retrieval quality (e.g. for a classification model backed by a memoryset), try loss="contrastive". It directly optimizes embedding similarity rather than training a prediction head.

Provide separate eval data for small datasets

When your training set is small, the automatic eval split may not be representative. Pass an explicit eval_datasource for more reliable evaluation.

Enable early stopping for metric losses and sweeps

Early stopping is on by default for prediction-loss runs, which use fast head/loss evaluation. Metric losses (contrastive, triplet, proxy) and sweeps default to eval_method="neighbor", which is too expensive to run mid-training, so early stopping is off. Set "early_stopping": True in hyperparams when you want it on those runs.