orca_sdk.embedding_model#
LearningRateScheduler
module-attribute
#
Learning rate scheduler for embedding model finetuning.
"linear": Linearly decays to zero after warmup (default)."cosine": Cosine annealing to zero."constant": Fixed learning rate (warmup is applied when configured).
FinetuningLoss
module-attribute
#
Loss function for embedding model finetuning.
"prediction": Linear prediction head. Works for both categorical labels and continuous scores."contrastive": In-batch contrastive loss. Often produces better embeddings than prediction; trains embeddings directly for similarity and scales well to large batches."triplet": Batch-hard triplet loss — pulls same-class embeddings together. Simpler than contrastive but requires all samples to fit on one GPU."proxy": Proxy-anchor loss — learns class proxies in embedding space. Particularly useful for class-imbalanced datasets.
EmbeddingFinetuneHyperparams
#
Bases: TypedDict
Training hyperparameters for embedding model finetuning.
All fields are optional — sensible loss-specific defaults are applied for anything you don’t set. You only need to override what you care about.
Sweep mode (trial_count > 1) runs an Optuna hyperparameter search. The easiest way to sweep is to just set trial_count — the system auto-injects loss-specific search ranges for the most impactful parameters (learning_rate, batch_size, epochs, warmup, and loss_scale where applicable) up to what the trial budget can support.
For finer control, pass explicit ranges in this dict:
- Tuples (min, max) search a continuous range (log-uniform for learning_rate, uniform otherwise).
- Lists [a, b, c] search discrete categorical choices.
- Scalars fix a parameter to a single value (excluded from the search).
The sweepable parameters are: learning_rate, epochs, batch_size, warmup, weight_decay, loss_scale, normalize_embeddings, and learning_rate_scheduler. As a rule of thumb, Optuna needs roughly 2n + 1 trials for n search parameters.
epochs
instance-attribute
#
Number of full passes over the training data. Defaults to 1 for single runs, 2 for sweeps, or 3 when early stopping is enabled.
warmup
instance-attribute
#
Learning rate warmup. int = steps, float = fraction of total steps (0-1).
weight_decay
instance-attribute
#
L2 regularization strength (typical range: 0.0 to 0.1).
learning_rate_scheduler
instance-attribute
#
How the learning rate changes after warmup.
normalize_embeddings
instance-attribute
#
L2-normalize embeddings before the classification/regression head.
max_seq_length
instance-attribute
#
Maximum token length for input text, or a percentile string.
truncation_side
instance-attribute
#
Which end to cut when text exceeds max_seq_length.
early_stopping
instance-attribute
#
Stop when eval metric plateaus. True = patience 2, int = custom patience. Auto-enabled for “head” and “loss” eval methods; must be set explicitly for “neighbor” eval.
early_stopping_threshold
instance-attribute
#
Minimum improvement to count as progress for early stopping.
eval_method
instance-attribute
#
How to measure model quality during training. Defaults to “head” for single-run classification/regression (fast, auto-enables early stopping), “neighbor” for sweeps and metric losses (runs once at trial end by default).
eval_steps
instance-attribute
#
How often to evaluate (int = every N steps). Defaults to every 50 steps for “head”/”loss” eval, end-of-training for “neighbor” eval.
neighbor_eval_count
instance-attribute
#
Number of nearest neighbors for neighbor evaluation.
neighbor_eval_pool_subsample
instance-attribute
#
Reduce the neighbor search pool. int = sample count, float = fraction.
EmbeddingModelBase
#
Bases: ABC
embed
#
Generate embeddings for a value or list of values
Parameters:
-
value(str | list[str]) –The value or list of values to embed
-
max_seq_length(int | None, default:None) –The maximum sequence length to truncate the input to
-
instruction(str | None, default:None) –Optional instruction for instruction-tuned embedding models.
Returns:
evaluate
#
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: str,
score_column: None = None,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[True]
) -> Job[ClassificationMetrics]
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: str,
score_column: None = None,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[False] = False
) -> ClassificationMetrics
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: None = None,
score_column: str,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[True]
) -> Job[RegressionMetrics]
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: None = None,
score_column: str,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[False] = False
) -> RegressionMetrics
evaluate(
datasource,
*,
value_column="value",
label_column=None,
score_column=None,
eval_datasource=None,
subsample=None,
neighbor_count=5,
batch_size=32,
weigh_memories=True,
background=False
)
Evaluate the finetuned embedding model
PretrainedEmbeddingModel
#
Bases: EmbeddingModelBase
A pretrained embedding model
Models:
OrcaCloud supports a select number of small to medium sized embedding models that perform well on the Hugging Face MTEB Leaderboard. These can be accessed as class attributes. We currently support:
CLIP_BASE: Multi-modal CLIP model from Hugging Face (sentence-transformers/clip-ViT-L-14)DISTILBERT: DistilBERT embedding model from Hugging Face (distilbert-base-uncased)E5_SMALL: Intfloat’s multilingual E5 Small model, a compact multilingual embedding model.E5_BASE: Intfloat’s multilingual E5 Base model, a general-purpose multilingual embedding model.E5_LARGE: E5-Large instruction-tuned embedding model from Hugging Face (intfloat/multilingual-e5-large-instruct)F2LLM_80M: CodeFuse’s F2LLM-v2 80M model, an ultra-compact multilingual instruction-following embedding model (40k tokens).F2LLM_160M: CodeFuse’s F2LLM-v2 160M model, a compact multilingual instruction-following embedding model (40k tokens).F2LLM_330M: CodeFuse’s F2LLM-v2 330M model, a multilingual instruction-following embedding model (40k tokens).F2LLM_600M: CodeFuse’s F2LLM-v2 0.6B model, a multilingual instruction-following embedding model (40k tokens).GTE_SMALL: GTE-Small embedding model from Hugging Face (Supabase/gte-small)GTE_BASE: Alibaba’s GTE model from Hugging Face (Alibaba-NLP/gte-base-en-v1.5)GTE_BASE_MULTILINGUAL: Alibaba’s GTE Multilingual Base model, a general-purpose multilingual embedding model.GTE_LARGE: Alibaba’s GTE-Large-EN-v1.5 model, a high-performance English embedding model.QWEN_600M: Alibaba’s Qwen3-Embedding 0.6B model, a multilingual instruction-following embedding model (32k tokens).HARRIER_270M: Microsoft’s Harrier 270M model, a compact long-context multilingual model (32k tokens) with instruction support.HARRIER_600M: Microsoft’s Harrier 0.6B model, a long-context multilingual embedding model (32k tokens) with instruction support.
Instruction Support:
Some models support instruction-following for better task-specific embeddings. You can check if a model supports instructions
using the supports_instructions attribute.
Examples:
Attributes:
-
name(PretrainedEmbeddingModelName) –Name of the pretrained embedding model
-
embedding_dim(PretrainedEmbeddingModelName) –Dimension of the embeddings that are generated by the model
-
max_seq_length(PretrainedEmbeddingModelName) –Maximum input length (in tokens not characters) that this model can process. Inputs that are longer will be truncated during the embedding process
-
num_params(PretrainedEmbeddingModelName) –Number of parameters in the model
-
supports_instructions(PretrainedEmbeddingModelName) –Whether this model supports instruction-following
embed
#
Generate embeddings for a value or list of values
Parameters:
-
value(str | list[str]) –The value or list of values to embed
-
max_seq_length(int | None, default:None) –The maximum sequence length to truncate the input to
-
instruction(str | None, default:None) –Optional instruction for instruction-tuned embedding models.
Returns:
evaluate
#
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: str,
score_column: None = None,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[True]
) -> Job[ClassificationMetrics]
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: str,
score_column: None = None,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[False] = False
) -> ClassificationMetrics
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: None = None,
score_column: str,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[True]
) -> Job[RegressionMetrics]
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: None = None,
score_column: str,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[False] = False
) -> RegressionMetrics
evaluate(
datasource,
*,
value_column="value",
label_column=None,
score_column=None,
eval_datasource=None,
subsample=None,
neighbor_count=5,
batch_size=32,
weigh_memories=True,
background=False
)
Evaluate the finetuned embedding model
all
classmethod
#
List all pretrained embedding models in the OrcaCloud
Returns:
-
list[PretrainedEmbeddingModel]–A list of all pretrained embedding models available in the OrcaCloud
open
classmethod
#
Open an embedding model by name.
This is an alternative method to access models for environments where IDE autocomplete for model names is not available.
Parameters:
-
name(PretrainedEmbeddingModelName) –Name of the model to open (e.g., “GTE_BASE”, “CLIP_BASE”)
Returns:
-
PretrainedEmbeddingModel–The embedding model instance
Examples:
exists
classmethod
#
finetune
#
finetune(
name: str,
train_datasource: (
Datasource | LabeledMemoryset | ScoredMemoryset
),
*,
eval_datasource: Datasource | None = None,
label_column: str | None = None,
score_column: str | None = None,
value_column: str = "value",
loss: FinetuningLoss = "prediction",
trial_count: int | None = None,
hyperparams: EmbeddingFinetuneHyperparams | None = None,
if_exists: CreateMode = "error",
background: Literal[True],
seed: int | None = None
) -> Job[FinetunedEmbeddingModel]
finetune(
name: str,
train_datasource: (
Datasource | LabeledMemoryset | ScoredMemoryset
),
*,
eval_datasource: Datasource | None = None,
label_column: str | None = None,
score_column: str | None = None,
value_column: str = "value",
loss: FinetuningLoss = "prediction",
trial_count: int | None = None,
hyperparams: EmbeddingFinetuneHyperparams | None = None,
if_exists: CreateMode = "error",
background: Literal[False] = False,
seed: int | None = None
) -> FinetunedEmbeddingModel
finetune(
name,
train_datasource,
*,
eval_datasource=None,
label_column=None,
score_column=None,
value_column="value",
loss="prediction",
trial_count=None,
hyperparams=None,
if_exists="error",
background=False,
seed=None
)
Finetune an embedding model
Trains a new embedding model starting from this pretrained base. All hyperparameters have sensible loss-specific defaults — in the simplest case you only need a name and training data.
Parameters:
-
name(str) –Name of the finetuned embedding model
-
train_datasource(Datasource | LabeledMemoryset | ScoredMemoryset) –Data to train on
-
eval_datasource(Datasource | None, default:None) –Data to evaluate on. When omitted a split is held out from the training data automatically.
-
label_column(str | None, default:None) –Column name for categorical labels in the datasource
-
score_column(str | None, default:None) –Column name for continuous scores in the datasource
-
value_column(str, default:'value') –Column name for the text to embed
-
loss(FinetuningLoss, default:'prediction') –Loss function for training
-
trial_count(int | None, default:None) –Number of hyperparameter configurations to try. 1 (default) runs a single training job. Values > 1 activate a hyperparameter sweep. The easiest way to sweep is to just set trial_count and let the system auto-select which parameters to search with sensible loss-specific ranges. For finer control, pass explicit ranges via hyperparams.
-
hyperparams(EmbeddingFinetuneHyperparams | None, default:None) –Training hyperparameters to override. All parameters have loss-specific defaults so you only need to specify what you want to change. In sweep mode, use tuples for continuous ranges and lists for categorical choices.
-
if_exists(CreateMode, default:'error') –What to do if a finetuned embedding model with the same name already exists, defaults to “error”. Other option is “open” to open the existing finetuned embedding model.
-
background(bool, default:False) –Whether to run the operation in the background and return a job handle
-
seed(int | None, default:None) –Random seed for reproducibility
Returns:
-
FinetunedEmbeddingModel | Job[FinetunedEmbeddingModel]–The finetuned embedding model
Raises:
-
ValueError–If the finetuned embedding model already exists and if_exists is “error” or if it is “open” but the base model param does not match the existing model
-
ValueError–If
train_datasourceis a plainDatasourceand neitherlabel_columnnorscore_columnis provided
Examples:
Minimal single run with default hyperparameters:
Single run with custom hyperparameters:
Default sweep, just set trial_count and the system picks what to search:
Custom sweep with explicit ranges and choices:
FinetunedEmbeddingModelTrial
#
A trial for a finetuned embedding model
Attributes:
-
status(TrialStatus) –The status of the trial
-
hyperparameters(dict[str, Any]) –The hyperparameters used for the trial
-
metrics(dict[str, float]) –The metrics for the trial
-
started_at(datetime) –The start time of the trial
-
completed_at(datetime | None) –When the trial finished, if known
FinetunedEmbeddingModel
#
Bases: EmbeddingModelBase
A finetuned embedding model in the OrcaCloud
Attributes:
-
name(str) –Name of the finetuned embedding model
-
embedding_dim(str) –Dimension of the embeddings that are generated by the model
-
max_seq_length(str) –Maximum input length (in tokens not characters) that this model can process. Inputs that are longer will be truncated during the embedding process
-
id(str) –Unique identifier of the finetuned embedding model
-
base_model(PretrainedEmbeddingModel | None) –Base model the finetuned embedding model was trained on (None for uploaded models)
-
created_at(datetime) –When the model was finetuned
-
description(str | None) –Optional description of the embedding model
Note
For uploaded models (created via _upload), base_model is None,
num_params may be extracted from the model if possible (otherwise None/0),
and supports_instructions is False since this property cannot be determined
from the model config alone.
trials
property
#
List the trials for the finetuned embedding model
Returns:
-
list[FinetunedEmbeddingModelTrial]–A list of finetuned embedding model trials
embed
#
Generate embeddings for a value or list of values
Parameters:
-
value(str | list[str]) –The value or list of values to embed
-
max_seq_length(int | None, default:None) –The maximum sequence length to truncate the input to
-
instruction(str | None, default:None) –Optional instruction for instruction-tuned embedding models.
Returns:
evaluate
#
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: str,
score_column: None = None,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[True]
) -> Job[ClassificationMetrics]
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: str,
score_column: None = None,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[False] = False
) -> ClassificationMetrics
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: None = None,
score_column: str,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[True]
) -> Job[RegressionMetrics]
evaluate(
datasource: Datasource,
*,
value_column: str = "value",
label_column: None = None,
score_column: str,
eval_datasource: Datasource | None = None,
subsample: int | float | None = None,
neighbor_count: int = 5,
batch_size: int = 32,
weigh_memories: bool = True,
background: Literal[False] = False
) -> RegressionMetrics
evaluate(
datasource,
*,
value_column="value",
label_column=None,
score_column=None,
eval_datasource=None,
subsample=None,
neighbor_count=5,
batch_size=32,
weigh_memories=True,
background=False
)
Evaluate the finetuned embedding model
all
classmethod
#
List all finetuned embedding model handles in the OrcaCloud
Returns:
-
list[FinetunedEmbeddingModel]–A list of all finetuned embedding model handles in the OrcaCloud
open
classmethod
#
Get a handle to a finetuned embedding model in the OrcaCloud
Parameters:
-
name(str) –The name or unique identifier of a finetuned embedding model
Returns:
-
FinetunedEmbeddingModel–A handle to the finetuned embedding model in the OrcaCloud
Raises:
-
LookupError–If the finetuned embedding model does not exist
exists
classmethod
#
drop
classmethod
#
Delete the finetuned embedding model from the OrcaCloud
Parameters:
-
name_or_id(str) –The name or id of the finetuned embedding model
-
if_not_exists(DropMode, default:'error') –What to do if the finetuned embedding model does not exist, defaults to
"error". Other option is"ignore"to do nothing if the model does not exist. -
cascade(bool, default:False) –If True, also delete all associated memorysets and their predictive models. Defaults to False.
Raises:
-
LookupError–If the finetuned embedding model does not exist and
if_not_existsis"error" -
RuntimeError–If the model has associated memorysets and cascade is False