Skip to content

orca_sdk.embedding_model#

LearningRateScheduler module-attribute #

LearningRateScheduler = Literal[
    "linear", "cosine", "constant"
]

Learning rate scheduler for embedding model finetuning.

  • "linear": Linearly decays to zero after warmup (default).
  • "cosine": Cosine annealing to zero.
  • "constant": Fixed learning rate (warmup is applied when configured).

FinetuningLoss module-attribute #

FinetuningLoss = Literal[
    "prediction", "triplet", "contrastive", "proxy"
]

Loss function for embedding model finetuning.

  • "prediction": Linear prediction head. Works for both categorical labels and continuous scores.
  • "contrastive": In-batch contrastive loss. Often produces better embeddings than prediction; trains embeddings directly for similarity and scales well to large batches.
  • "triplet": Batch-hard triplet loss — pulls same-class embeddings together. Simpler than contrastive but requires all samples to fit on one GPU.
  • "proxy": Proxy-anchor loss — learns class proxies in embedding space. Particularly useful for class-imbalanced datasets.

EmbeddingFinetuneHyperparams #

Bases: TypedDict

Training hyperparameters for embedding model finetuning.

All fields are optional — sensible loss-specific defaults are applied for anything you don’t set. You only need to override what you care about.

Sweep mode (trial_count > 1) runs an Optuna hyperparameter search. The easiest way to sweep is to just set trial_count — the system auto-injects loss-specific search ranges for the most impactful parameters (learning_rate, batch_size, epochs, warmup, and loss_scale where applicable) up to what the trial budget can support.

For finer control, pass explicit ranges in this dict:

  • Tuples (min, max) search a continuous range (log-uniform for learning_rate, uniform otherwise).
  • Lists [a, b, c] search discrete categorical choices.
  • Scalars fix a parameter to a single value (excluded from the search).

The sweepable parameters are: learning_rate, epochs, batch_size, warmup, weight_decay, loss_scale, normalize_embeddings, and learning_rate_scheduler. As a rule of thumb, Optuna needs roughly 2n + 1 trials for n search parameters.

epochs instance-attribute #

epochs

Number of full passes over the training data. Defaults to 1 for single runs, 2 for sweeps, or 3 when early stopping is enabled.

max_steps instance-attribute #

max_steps

Maximum training steps. Overrides epochs when set.

learning_rate instance-attribute #

learning_rate

Peak learning rate after warmup.

batch_size instance-attribute #

batch_size

Total samples per training step.

warmup instance-attribute #

warmup

Learning rate warmup. int = steps, float = fraction of total steps (0-1).

weight_decay instance-attribute #

weight_decay

L2 regularization strength (typical range: 0.0 to 0.1).

learning_rate_scheduler instance-attribute #

learning_rate_scheduler

How the learning rate changes after warmup.

loss_scale instance-attribute #

loss_scale

Inverse temperature for contrastive and proxy losses.

normalize_embeddings instance-attribute #

normalize_embeddings

L2-normalize embeddings before the classification/regression head.

max_seq_length instance-attribute #

max_seq_length

Maximum token length for input text, or a percentile string.

truncation_side instance-attribute #

truncation_side

Which end to cut when text exceeds max_seq_length.

early_stopping instance-attribute #

early_stopping

Stop when eval metric plateaus. True = patience 2, int = custom patience. Auto-enabled for “head” and “loss” eval methods; must be set explicitly for “neighbor” eval.

early_stopping_threshold instance-attribute #

early_stopping_threshold

Minimum improvement to count as progress for early stopping.

eval_method instance-attribute #

eval_method

How to measure model quality during training. Defaults to “head” for single-run classification/regression (fast, auto-enables early stopping), “neighbor” for sweeps and metric losses (runs once at trial end by default).

eval_steps instance-attribute #

eval_steps

How often to evaluate (int = every N steps). Defaults to every 50 steps for “head”/”loss” eval, end-of-training for “neighbor” eval.

neighbor_eval_count instance-attribute #

neighbor_eval_count

Number of nearest neighbors for neighbor evaluation.

neighbor_eval_pool_subsample instance-attribute #

neighbor_eval_pool_subsample

Reduce the neighbor search pool. int = sample count, float = fraction.

EmbeddingModelBase #

Bases: ABC

embed #

embed(
    value: str,
    max_seq_length: int | None = None,
    instruction: str | None = None,
) -> list[float]
embed(
    value: list[str],
    max_seq_length: int | None = None,
    instruction: str | None = None,
) -> list[list[float]]
embed(value, max_seq_length=None, instruction=None)

Generate embeddings for a value or list of values

Parameters:

  • value (str | list[str]) –

    The value or list of values to embed

  • max_seq_length (int | None, default: None ) –

    The maximum sequence length to truncate the input to

  • instruction (str | None, default: None ) –

    Optional instruction for instruction-tuned embedding models.

Returns:

  • list[float] | list[list[float]]

    A matrix of floats representing the embedding for each value if the input is a list of values, or a list of floats representing the embedding for the single value if the input is a single value

evaluate #

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[True]
) -> Job[ClassificationMetrics]
evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[False] = False
) -> ClassificationMetrics
evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[True]
) -> Job[RegressionMetrics]
evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[False] = False
) -> RegressionMetrics
evaluate(
    datasource,
    *,
    value_column="value",
    label_column=None,
    score_column=None,
    eval_datasource=None,
    subsample=None,
    neighbor_count=5,
    batch_size=32,
    weigh_memories=True,
    background=False
)

Evaluate the finetuned embedding model

PretrainedEmbeddingModel #

Bases: EmbeddingModelBase

A pretrained embedding model

Models:

OrcaCloud supports a select number of small to medium sized embedding models that perform well on the Hugging Face MTEB Leaderboard. These can be accessed as class attributes. We currently support:

Instruction Support:

Some models support instruction-following for better task-specific embeddings. You can check if a model supports instructions using the supports_instructions attribute.

Examples:

>>> PretrainedEmbeddingModel.GTE_BASE
PretrainedEmbeddingModel({name: GTE_BASE, embedding_dim: 768, max_seq_length: 8192})
1
2
3
>>> # Using instruction with an instruction-supporting model
>>> model = PretrainedEmbeddingModel.E5_LARGE
>>> embeddings = model.embed("Hello world", instruction="Represent this sentence for retrieval")

Attributes:

  • name (PretrainedEmbeddingModelName) –

    Name of the pretrained embedding model

  • embedding_dim (PretrainedEmbeddingModelName) –

    Dimension of the embeddings that are generated by the model

  • max_seq_length (PretrainedEmbeddingModelName) –

    Maximum input length (in tokens not characters) that this model can process. Inputs that are longer will be truncated during the embedding process

  • num_params (PretrainedEmbeddingModelName) –

    Number of parameters in the model

  • supports_instructions (PretrainedEmbeddingModelName) –

    Whether this model supports instruction-following

embed #

embed(
    value: str,
    max_seq_length: int | None = None,
    instruction: str | None = None,
) -> list[float]
embed(
    value: list[str],
    max_seq_length: int | None = None,
    instruction: str | None = None,
) -> list[list[float]]
embed(value, max_seq_length=None, instruction=None)

Generate embeddings for a value or list of values

Parameters:

  • value (str | list[str]) –

    The value or list of values to embed

  • max_seq_length (int | None, default: None ) –

    The maximum sequence length to truncate the input to

  • instruction (str | None, default: None ) –

    Optional instruction for instruction-tuned embedding models.

Returns:

  • list[float] | list[list[float]]

    A matrix of floats representing the embedding for each value if the input is a list of values, or a list of floats representing the embedding for the single value if the input is a single value

evaluate #

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[True]
) -> Job[ClassificationMetrics]
evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[False] = False
) -> ClassificationMetrics
evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[True]
) -> Job[RegressionMetrics]
evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[False] = False
) -> RegressionMetrics
evaluate(
    datasource,
    *,
    value_column="value",
    label_column=None,
    score_column=None,
    eval_datasource=None,
    subsample=None,
    neighbor_count=5,
    batch_size=32,
    weigh_memories=True,
    background=False
)

Evaluate the finetuned embedding model

all classmethod #

all()

List all pretrained embedding models in the OrcaCloud

Returns:

open classmethod #

open(name)

Open an embedding model by name.

This is an alternative method to access models for environments where IDE autocomplete for model names is not available.

Parameters:

  • name (PretrainedEmbeddingModelName) –

    Name of the model to open (e.g., “GTE_BASE”, “CLIP_BASE”)

Returns:

Examples:

>>> model = PretrainedEmbeddingModel.open("GTE_BASE")

exists classmethod #

exists(name)

Check if a pretrained embedding model exists by name

Parameters:

  • name (str) –

    The name of the pretrained embedding model

Returns:

  • bool

    True if the pretrained embedding model exists, False otherwise

finetune #

finetune(
    name: str,
    train_datasource: (
        Datasource | LabeledMemoryset | ScoredMemoryset
    ),
    *,
    eval_datasource: Datasource | None = None,
    label_column: str | None = None,
    score_column: str | None = None,
    value_column: str = "value",
    loss: FinetuningLoss = "prediction",
    trial_count: int | None = None,
    hyperparams: EmbeddingFinetuneHyperparams | None = None,
    if_exists: CreateMode = "error",
    background: Literal[True],
    seed: int | None = None
) -> Job[FinetunedEmbeddingModel]
finetune(
    name: str,
    train_datasource: (
        Datasource | LabeledMemoryset | ScoredMemoryset
    ),
    *,
    eval_datasource: Datasource | None = None,
    label_column: str | None = None,
    score_column: str | None = None,
    value_column: str = "value",
    loss: FinetuningLoss = "prediction",
    trial_count: int | None = None,
    hyperparams: EmbeddingFinetuneHyperparams | None = None,
    if_exists: CreateMode = "error",
    background: Literal[False] = False,
    seed: int | None = None
) -> FinetunedEmbeddingModel
finetune(
    name,
    train_datasource,
    *,
    eval_datasource=None,
    label_column=None,
    score_column=None,
    value_column="value",
    loss="prediction",
    trial_count=None,
    hyperparams=None,
    if_exists="error",
    background=False,
    seed=None
)

Finetune an embedding model

Trains a new embedding model starting from this pretrained base. All hyperparameters have sensible loss-specific defaults — in the simplest case you only need a name and training data.

Parameters:

  • name (str) –

    Name of the finetuned embedding model

  • train_datasource (Datasource | LabeledMemoryset | ScoredMemoryset) –

    Data to train on

  • eval_datasource (Datasource | None, default: None ) –

    Data to evaluate on. When omitted a split is held out from the training data automatically.

  • label_column (str | None, default: None ) –

    Column name for categorical labels in the datasource

  • score_column (str | None, default: None ) –

    Column name for continuous scores in the datasource

  • value_column (str, default: 'value' ) –

    Column name for the text to embed

  • loss (FinetuningLoss, default: 'prediction' ) –

    Loss function for training

  • trial_count (int | None, default: None ) –

    Number of hyperparameter configurations to try. 1 (default) runs a single training job. Values > 1 activate a hyperparameter sweep. The easiest way to sweep is to just set trial_count and let the system auto-select which parameters to search with sensible loss-specific ranges. For finer control, pass explicit ranges via hyperparams.

  • hyperparams (EmbeddingFinetuneHyperparams | None, default: None ) –

    Training hyperparameters to override. All parameters have loss-specific defaults so you only need to specify what you want to change. In sweep mode, use tuples for continuous ranges and lists for categorical choices.

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a finetuned embedding model with the same name already exists, defaults to “error”. Other option is “open” to open the existing finetuned embedding model.

  • background (bool, default: False ) –

    Whether to run the operation in the background and return a job handle

  • seed (int | None, default: None ) –

    Random seed for reproducibility

Returns:

Raises:

  • ValueError

    If the finetuned embedding model already exists and if_exists is “error” or if it is “open” but the base model param does not match the existing model

  • ValueError

    If train_datasource is a plain Datasource and neither label_column nor score_column is provided

Examples:

Minimal single run with default hyperparameters:

1
2
3
>>> model = PretrainedEmbeddingModel.GTE_BASE
>>> memoryset = LabeledMemoryset.open("my_memoryset")
>>> model.finetune("my_model", memoryset)

Single run with custom hyperparameters:

1
2
3
4
>>> datasource = Datasource.open("my_datasource")
>>> model.finetune("my_model", datasource, label_column="label", loss="contrastive", hyperparams={
...     "epochs": 5, "learning_rate": 1e-4, "batch_size": 64,
... })

Default sweep, just set trial_count and the system picks what to search:

>>> model.finetune("my_model", memoryset, trial_count=9)

Custom sweep with explicit ranges and choices:

1
2
3
4
5
>>> model.finetune("my_model", memoryset, trial_count=15, hyperparams={
...     "learning_rate": (1e-5, 1e-3),
...     "batch_size": [32, 64, 128],
...     "epochs": 4,
... })

FinetunedEmbeddingModelTrial #

A trial for a finetuned embedding model

Attributes:

  • status (TrialStatus) –

    The status of the trial

  • hyperparameters (dict[str, Any]) –

    The hyperparameters used for the trial

  • metrics (dict[str, float]) –

    The metrics for the trial

  • started_at (datetime) –

    The start time of the trial

  • completed_at (datetime | None) –

    When the trial finished, if known

FinetunedEmbeddingModel #

Bases: EmbeddingModelBase

A finetuned embedding model in the OrcaCloud

Attributes:

  • name (str) –

    Name of the finetuned embedding model

  • embedding_dim (str) –

    Dimension of the embeddings that are generated by the model

  • max_seq_length (str) –

    Maximum input length (in tokens not characters) that this model can process. Inputs that are longer will be truncated during the embedding process

  • id (str) –

    Unique identifier of the finetuned embedding model

  • base_model (PretrainedEmbeddingModel | None) –

    Base model the finetuned embedding model was trained on (None for uploaded models)

  • created_at (datetime) –

    When the model was finetuned

  • description (str | None) –

    Optional description of the embedding model

Note

For uploaded models (created via _upload), base_model is None, num_params may be extracted from the model if possible (otherwise None/0), and supports_instructions is False since this property cannot be determined from the model config alone.

trials property #

trials

List the trials for the finetuned embedding model

Returns:

embed #

embed(
    value: str,
    max_seq_length: int | None = None,
    instruction: str | None = None,
) -> list[float]
embed(
    value: list[str],
    max_seq_length: int | None = None,
    instruction: str | None = None,
) -> list[list[float]]
embed(value, max_seq_length=None, instruction=None)

Generate embeddings for a value or list of values

Parameters:

  • value (str | list[str]) –

    The value or list of values to embed

  • max_seq_length (int | None, default: None ) –

    The maximum sequence length to truncate the input to

  • instruction (str | None, default: None ) –

    Optional instruction for instruction-tuned embedding models.

Returns:

  • list[float] | list[list[float]]

    A matrix of floats representing the embedding for each value if the input is a list of values, or a list of floats representing the embedding for the single value if the input is a single value

evaluate #

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[True]
) -> Job[ClassificationMetrics]
evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[False] = False
) -> ClassificationMetrics
evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[True]
) -> Job[RegressionMetrics]
evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    background: Literal[False] = False
) -> RegressionMetrics
evaluate(
    datasource,
    *,
    value_column="value",
    label_column=None,
    score_column=None,
    eval_datasource=None,
    subsample=None,
    neighbor_count=5,
    batch_size=32,
    weigh_memories=True,
    background=False
)

Evaluate the finetuned embedding model

all classmethod #

all()

List all finetuned embedding model handles in the OrcaCloud

Returns:

open classmethod #

open(name)

Get a handle to a finetuned embedding model in the OrcaCloud

Parameters:

  • name (str) –

    The name or unique identifier of a finetuned embedding model

Returns:

Raises:

  • LookupError

    If the finetuned embedding model does not exist

exists classmethod #

exists(name_or_id)

Check if a finetuned embedding model with the given name or id exists.

Parameters:

  • name_or_id (str) –

    The name or id of the finetuned embedding model

Returns:

  • bool

    True if the finetuned embedding model exists, False otherwise

drop classmethod #

drop(name_or_id, *, if_not_exists='error', cascade=False)

Delete the finetuned embedding model from the OrcaCloud

Parameters:

  • name_or_id (str) –

    The name or id of the finetuned embedding model

  • if_not_exists (DropMode, default: 'error' ) –

    What to do if the finetuned embedding model does not exist, defaults to "error". Other option is "ignore" to do nothing if the model does not exist.

  • cascade (bool, default: False ) –

    If True, also delete all associated memorysets and their predictive models. Defaults to False.

Raises:

  • LookupError

    If the finetuned embedding model does not exist and if_not_exists is "error"

  • RuntimeError

    If the model has associated memorysets and cascade is False