orca_sdk.embedding_model#

LearningRateScheduler `module-attribute` #

LearningRateScheduler = Literal[
    "linear", "cosine", "constant"
]

Learning rate scheduler for embedding model finetuning.

"linear": Linearly decays to zero after warmup (default).
"cosine": Cosine annealing to zero.
"constant": Fixed learning rate (warmup is applied when configured).

FinetuningLoss `module-attribute` #

FinetuningLoss = Literal[
    "prediction", "triplet", "contrastive", "proxy"
]

Loss function for embedding model finetuning.

"prediction": Linear prediction head. Works for both categorical labels and continuous scores.
"contrastive": In-batch contrastive loss. Often produces better embeddings than prediction; trains embeddings directly for similarity and scales well to large batches.
"triplet": Batch-hard triplet loss — pulls same-class embeddings together. Simpler than contrastive but requires all samples to fit on one GPU.
"proxy": Proxy-anchor loss — learns class proxies in embedding space. Particularly useful for class-imbalanced datasets.

EmbeddingFinetuneHyperparams #

Bases: TypedDict

Training hyperparameters for embedding model finetuning.

All fields are optional — sensible loss-specific defaults are applied for anything you don’t set. You only need to override what you care about.

Sweep mode (trial_count > 1) runs an Optuna hyperparameter search. The easiest way to sweep is to just set trial_count — the system auto-injects loss-specific search ranges for the most impactful parameters (learning_rate, batch_size, epochs, warmup, and loss_scale where applicable) up to what the trial budget can support.

For finer control, pass explicit ranges in this dict:

Tuples (min, max) search a continuous range (log-uniform for learning_rate, uniform otherwise).
Lists [a, b, c] search discrete categorical choices.
Scalars fix a parameter to a single value (excluded from the search).

The sweepable parameters are: learning_rate, epochs, batch_size, warmup, weight_decay, loss_scale, normalize_embeddings, and learning_rate_scheduler. As a rule of thumb, Optuna needs roughly 2n + 1 trials for n search parameters.

epochs `instance-attribute` #

epochs

Number of full passes over the training data. Defaults to 1 for single runs, 2 for sweeps, or 3 when early stopping is enabled.

max_steps `instance-attribute` #

max_steps

Maximum training steps. Overrides epochs when set.

learning_rate `instance-attribute` #

learning_rate

Peak learning rate after warmup.

batch_size `instance-attribute` #

batch_size

Total samples per training step.

warmup `instance-attribute` #

warmup

Learning rate warmup. int = steps, float = fraction of total steps (0-1).

weight_decay `instance-attribute` #

weight_decay

L2 regularization strength (typical range: 0.0 to 0.1).

learning_rate_scheduler `instance-attribute` #

learning_rate_scheduler

How the learning rate changes after warmup.

loss_scale `instance-attribute` #

loss_scale

Inverse temperature for contrastive and proxy losses.

normalize_embeddings `instance-attribute` #

normalize_embeddings

L2-normalize embeddings before the classification/regression head.

max_seq_length `instance-attribute` #

max_seq_length

Maximum token length for input text, or a percentile string.

truncation_side `instance-attribute` #

truncation_side

Which end to cut when text exceeds max_seq_length.

early_stopping `instance-attribute` #

early_stopping

Stop when eval metric plateaus. True = patience 2, int = custom patience. Auto-enabled for “head” and “loss” eval methods; must be set explicitly for “neighbor” eval.

early_stopping_threshold `instance-attribute` #

early_stopping_threshold

Minimum improvement to count as progress for early stopping.

eval_method `instance-attribute` #

eval_method

How to measure model quality during training. Defaults to “head” for single-run classification/regression (fast, auto-enables early stopping), “neighbor” for sweeps and metric losses (runs once at trial end by default).

eval_steps `instance-attribute` #

eval_steps

How often to evaluate (int = every N steps). Defaults to every 50 steps for “head”/”loss” eval, end-of-training for “neighbor” eval.

neighbor_eval_count `instance-attribute` #

neighbor_eval_count

Number of nearest neighbors for neighbor evaluation.

neighbor_eval_pool_subsample `instance-attribute` #

neighbor_eval_pool_subsample

Reduce the neighbor search pool. int = sample count, float = fraction.

EmbeddingModelBase #

Bases: ABC

embed #

embed(
    value: str,
    *,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    instruction: str | None = None
) -> list[float]

embed(
    value: list[str],
    *,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    instruction: str | None = None
) -> list[list[float]]

embed(
    value,
    *,
    max_seq_length=None,
    truncation_side="right",
    instruction=None
)

Generate embeddings for a value or list of values

Parameters:

value (str | list[str]) –

The value or list of values to embed
instruction (str | None, default: None ) –

Optional task instruction for instruction-tuned embedding models
max_seq_length (int | None, default: None ) –

Optional maximum number of tokens per value beyond which inputs are truncated
truncation_side (Literal['left', 'right'], default: 'right' ) –

Optional side to truncate values that exceed max_seq_length

Returns:

list[float] | list[list[float]] –

A matrix of floats representing the embedding for each value if the input is a list of values, or a list of floats representing the embedding for the single value if the input is a single value

evaluate #

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[True]
) -> Job[ClassificationMetrics]

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[False] = False
) -> ClassificationMetrics

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[True]
) -> Job[RegressionMetrics]

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[False] = False
) -> RegressionMetrics

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str | None = None,
    score_column: str | None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: bool = False
) -> (
    ClassificationMetrics
    | RegressionMetrics
    | Job[ClassificationMetrics]
    | Job[RegressionMetrics]
    | Job[ClassificationMetrics | RegressionMetrics]
)

evaluate(
    datasource,
    *,
    value_column="value",
    label_column=None,
    score_column=None,
    eval_datasource=None,
    subsample=None,
    neighbor_count=5,
    batch_size=32,
    weigh_memories=True,
    instruction=None,
    max_seq_length=None,
    truncation_side="right",
    background=False
)

Evaluate this embedding model on a datasource

Embeds the datasource values to use as the nearest-neighbor memory and measures classification metrics when label_column is set or regression metrics when score_column is set.

Parameters:

datasource (Datasource) –

Data to use as the nearest-neighbor memory, if no eval_datasource is provided an evaluation set is created by splitting 20% of the data
value_column (str, default: 'value' ) –

Column name for the text to embed
label_column (str | None, default: None ) –

Column name for categorical labels, required for classification
score_column (str | None, default: None ) –

Column name for continuous scores, required for regression
eval_datasource (Datasource | None, default: None ) –

Data to evaluate on. When omitted, datasource is used for both memory and evaluation.
subsample (int | float | None, default: None ) –

Optional number (int) or fraction (float in (0, 1]) of rows to sample
neighbor_count (int, default: 5 ) –

Number of nearest neighbors to consider when scoring predictions
batch_size (int, default: 32 ) –

Batch size for embedding values during evaluation
weigh_memories (bool, default: True ) –

Whether to weight neighbor votes by similarity
instruction (str | None, default: None ) –

Optional task instruction for instruction-tuned embedding models
max_seq_length (int | None, default: None ) –

Optional maximum number of tokens per value beyond which inputs are truncated
truncation_side (Literal['left', 'right'], default: 'right' ) –

Side to truncate values that exceed max_seq_length
background (bool, default: False ) –

Whether to run the operation in the background and return a job handle

Returns:

ClassificationMetrics | RegressionMetrics | Job[ClassificationMetrics] | Job[RegressionMetrics] | Job[ClassificationMetrics | RegressionMetrics] –

metrics measuring how well nearest-neighbor predictions from the embeddings match the expected labels or scores, or a job handle that resolves to those metrics when background is True

PretrainedEmbeddingModel #

Bases: EmbeddingModelBase

A pretrained embedding model

Models:

OrcaCloud supports a select number of small to medium sized embedding models that perform well on the Hugging Face MTEB Leaderboard. These can be accessed as class attributes. We currently support:

CLIP_BASE: Multi-modal CLIP model from Hugging Face (sentence-transformers/clip-ViT-L-14)
DISTILBERT: DistilBERT embedding model from Hugging Face (distilbert-base-uncased)
E5_SMALL: Intfloat’s multilingual E5 Small model, a compact multilingual embedding model.
E5_BASE: Intfloat’s multilingual E5 Base model, a general-purpose multilingual embedding model.
E5_LARGE: E5-Large instruction-tuned embedding model from Hugging Face (intfloat/multilingual-e5-large-instruct)
F2LLM_80M: CodeFuse’s F2LLM-v2 80M model, an ultra-compact multilingual instruction-following embedding model (40k tokens).
F2LLM_160M: CodeFuse’s F2LLM-v2 160M model, a compact multilingual instruction-following embedding model (40k tokens).
F2LLM_330M: CodeFuse’s F2LLM-v2 330M model, a multilingual instruction-following embedding model (40k tokens).
F2LLM_600M: CodeFuse’s F2LLM-v2 0.6B model, a multilingual instruction-following embedding model (40k tokens).
GTE_SMALL: GTE-Small embedding model from Hugging Face (Supabase/gte-small)
GTE_BASE: Alibaba’s GTE model from Hugging Face (Alibaba-NLP/gte-base-en-v1.5)
GTE_BASE_MULTILINGUAL: Alibaba’s GTE Multilingual Base model, a general-purpose multilingual embedding model.
GTE_LARGE: Alibaba’s GTE-Large-EN-v1.5 model, a high-performance English embedding model.
QWEN_600M: Alibaba’s Qwen3-Embedding 0.6B model, a multilingual instruction-following embedding model (32k tokens).
HARRIER_270M: Microsoft’s Harrier 270M model, a compact long-context multilingual model (32k tokens) with instruction support.
HARRIER_600M: Microsoft’s Harrier 0.6B model, a long-context multilingual embedding model (32k tokens) with instruction support.

Instruction Support:

Some models support instruction-following for better task-specific embeddings. You can check if a model supports instructions using the supports_instructions attribute.

Examples:

>>> PretrainedEmbeddingModel.GTE_BASE
PretrainedEmbeddingModel({name: GTE_BASE, embedding_dim: 768, max_seq_length: 8192})

>>> # Using instruction with an instruction-supporting model
>>> model = PretrainedEmbeddingModel.E5_LARGE
>>> embeddings = model.embed("Hello world", instruction="Represent this sentence for retrieval")

Attributes:

name (PretrainedEmbeddingModelName) –

Name of the pretrained embedding model
embedding_dim (PretrainedEmbeddingModelName) –

Dimension of the embeddings that are generated by the model
max_seq_length (PretrainedEmbeddingModelName) –

Maximum input length (in tokens not characters) that this model can process. Inputs that are longer will be truncated during the embedding process
num_params (PretrainedEmbeddingModelName) –

Number of parameters in the model
supports_instructions (PretrainedEmbeddingModelName) –

Whether this model supports instruction-following

embed #

embed(
    value: str,
    *,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    instruction: str | None = None
) -> list[float]

embed(
    value: list[str],
    *,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    instruction: str | None = None
) -> list[list[float]]

embed(
    value,
    *,
    max_seq_length=None,
    truncation_side="right",
    instruction=None
)

Generate embeddings for a value or list of values

Parameters:

value (str | list[str]) –

The value or list of values to embed
instruction (str | None, default: None ) –

Optional task instruction for instruction-tuned embedding models
max_seq_length (int | None, default: None ) –

Optional maximum number of tokens per value beyond which inputs are truncated
truncation_side (Literal['left', 'right'], default: 'right' ) –

Optional side to truncate values that exceed max_seq_length

Returns:

list[float] | list[list[float]] –

A matrix of floats representing the embedding for each value if the input is a list of values, or a list of floats representing the embedding for the single value if the input is a single value

evaluate #

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[True]
) -> Job[ClassificationMetrics]

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[False] = False
) -> ClassificationMetrics

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[True]
) -> Job[RegressionMetrics]

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[False] = False
) -> RegressionMetrics

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str | None = None,
    score_column: str | None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: bool = False
) -> (
    ClassificationMetrics
    | RegressionMetrics
    | Job[ClassificationMetrics]
    | Job[RegressionMetrics]
    | Job[ClassificationMetrics | RegressionMetrics]
)

evaluate(
    datasource,
    *,
    value_column="value",
    label_column=None,
    score_column=None,
    eval_datasource=None,
    subsample=None,
    neighbor_count=5,
    batch_size=32,
    weigh_memories=True,
    instruction=None,
    max_seq_length=None,
    truncation_side="right",
    background=False
)

Evaluate this embedding model on a datasource

Embeds the datasource values to use as the nearest-neighbor memory and measures classification metrics when label_column is set or regression metrics when score_column is set.

Parameters:

datasource (Datasource) –

Data to use as the nearest-neighbor memory, if no eval_datasource is provided an evaluation set is created by splitting 20% of the data
value_column (str, default: 'value' ) –

Column name for the text to embed
label_column (str | None, default: None ) –

Column name for categorical labels, required for classification
score_column (str | None, default: None ) –

Column name for continuous scores, required for regression
eval_datasource (Datasource | None, default: None ) –

Data to evaluate on. When omitted, datasource is used for both memory and evaluation.
subsample (int | float | None, default: None ) –

Optional number (int) or fraction (float in (0, 1]) of rows to sample
neighbor_count (int, default: 5 ) –

Number of nearest neighbors to consider when scoring predictions
batch_size (int, default: 32 ) –

Batch size for embedding values during evaluation
weigh_memories (bool, default: True ) –

Whether to weight neighbor votes by similarity
instruction (str | None, default: None ) –

Optional task instruction for instruction-tuned embedding models
max_seq_length (int | None, default: None ) –

Optional maximum number of tokens per value beyond which inputs are truncated
truncation_side (Literal['left', 'right'], default: 'right' ) –

Side to truncate values that exceed max_seq_length
background (bool, default: False ) –

Whether to run the operation in the background and return a job handle

Returns:

ClassificationMetrics | RegressionMetrics | Job[ClassificationMetrics] | Job[RegressionMetrics] | Job[ClassificationMetrics | RegressionMetrics] –

metrics measuring how well nearest-neighbor predictions from the embeddings match the expected labels or scores, or a job handle that resolves to those metrics when background is True

all `classmethod` #

all()

List all pretrained embedding models in the OrcaCloud

Returns:

list[PretrainedEmbeddingModel] –

A list of all pretrained embedding models available in the OrcaCloud

open `classmethod` #

open(name)

Open an embedding model by name.

This is an alternative method to access models for environments where IDE autocomplete for model names is not available.

Parameters:

name (PretrainedEmbeddingModelName) –

Name of the model to open (e.g., “GTE_BASE”, “CLIP_BASE”)

Returns:

PretrainedEmbeddingModel –

The embedding model instance

Examples:

>>> model = PretrainedEmbeddingModel.open("GTE_BASE")

exists `classmethod` #

exists(name)

Check if a pretrained embedding model exists by name

Parameters:

name (str) –

The name of the pretrained embedding model

Returns:

bool –

True if the pretrained embedding model exists, False otherwise

finetune #

finetune(
    name: str,
    train_datasource: (
        Datasource | LabeledMemoryset | ScoredMemoryset
    ),
    *,
    eval_datasource: Datasource | None = None,
    label_column: str | None = None,
    score_column: str | None = None,
    value_column: str = "value",
    loss: FinetuningLoss = "prediction",
    trial_count: int | None = None,
    hyperparams: EmbeddingFinetuneHyperparams | None = None,
    if_exists: CreateMode = "error",
    background: Literal[True],
    seed: int | None = None
) -> Job[FinetunedEmbeddingModel]

finetune(
    name: str,
    train_datasource: (
        Datasource | LabeledMemoryset | ScoredMemoryset
    ),
    *,
    eval_datasource: Datasource | None = None,
    label_column: str | None = None,
    score_column: str | None = None,
    value_column: str = "value",
    loss: FinetuningLoss = "prediction",
    trial_count: int | None = None,
    hyperparams: EmbeddingFinetuneHyperparams | None = None,
    if_exists: CreateMode = "error",
    background: Literal[False] = False,
    seed: int | None = None
) -> FinetunedEmbeddingModel

finetune(
    name,
    train_datasource,
    *,
    eval_datasource=None,
    label_column=None,
    score_column=None,
    value_column="value",
    loss="prediction",
    trial_count=None,
    hyperparams=None,
    if_exists="error",
    background=False,
    seed=None
)

Finetune an embedding model

Trains a new embedding model starting from this pretrained base. All hyperparameters have sensible loss-specific defaults — in the simplest case you only need a name and training data.

Parameters:

name (str) –

Name of the finetuned embedding model
train_datasource (Datasource | LabeledMemoryset | ScoredMemoryset) –

Data to train on
eval_datasource (Datasource | None, default: None ) –

Data to evaluate on. When omitted a split is held out from the training data automatically.
label_column (str | None, default: None ) –

Column name for categorical labels in the datasource
score_column (str | None, default: None ) –

Column name for continuous scores in the datasource
value_column (str, default: 'value' ) –

Column name for the text to embed
loss (FinetuningLoss, default: 'prediction' ) –

Loss function for training
trial_count (int | None, default: None ) –

Number of hyperparameter configurations to try. 1 (default) runs a single training job. Values > 1 activate a hyperparameter sweep. The easiest way to sweep is to just set trial_count and let the system auto-select which parameters to search with sensible loss-specific ranges. For finer control, pass explicit ranges via hyperparams.
hyperparams (EmbeddingFinetuneHyperparams | None, default: None ) –

Training hyperparameters to override. All parameters have loss-specific defaults so you only need to specify what you want to change. In sweep mode, use tuples for continuous ranges and lists for categorical choices.
if_exists (CreateMode, default: 'error' ) –

What to do if a finetuned embedding model with the same name already exists, defaults to “error”. Other option is “open” to open the existing finetuned embedding model.
background (bool, default: False ) –

Whether to run the operation in the background and return a job handle
seed (int | None, default: None ) –

Random seed for reproducibility

Returns:

FinetunedEmbeddingModel | Job[FinetunedEmbeddingModel] –

The finetuned embedding model

Raises:

ValueError –

If the finetuned embedding model already exists and if_exists is “error” or if it is “open” but the base model param does not match the existing model
ValueError –

If train_datasource is a plain Datasource and neither label_column nor score_column is provided

Examples:

Minimal single run with default hyperparameters:

>>> model = PretrainedEmbeddingModel.GTE_BASE
>>> memoryset = LabeledMemoryset.open("my_memoryset")
>>> model.finetune("my_model", memoryset)

Single run with custom hyperparameters:

>>> datasource = Datasource.open("my_datasource")
>>> model.finetune("my_model", datasource, label_column="label", loss="contrastive", hyperparams={
...     "epochs": 5, "learning_rate": 1e-4, "batch_size": 64,
... })

Default sweep, just set trial_count and the system picks what to search:

>>> model.finetune("my_model", memoryset, trial_count=9)

Custom sweep with explicit ranges and choices:

>>> model.finetune("my_model", memoryset, trial_count=15, hyperparams={
...     "learning_rate": (1e-5, 1e-3),
...     "batch_size": [32, 64, 128],
...     "epochs": 4,
... })

FinetunedEmbeddingModelTrial #

A trial for a finetuned embedding model

Attributes:

status (TrialStatus) –

The status of the trial
hyperparameters (dict[str, Any]) –

The hyperparameters used for the trial
metrics (dict[str, float]) –

The metrics for the trial
started_at (datetime) –

The start time of the trial
completed_at (datetime | None) –

When the trial finished, if known

FinetunedEmbeddingModel #

Bases: EmbeddingModelBase

A finetuned embedding model in the OrcaCloud

Attributes:

name (str) –

Name of the finetuned embedding model
embedding_dim (str) –

Dimension of the embeddings that are generated by the model
max_seq_length (str) –

Maximum input length (in tokens not characters) that this model can process. Inputs that are longer will be truncated during the embedding process
id (str) –

Unique identifier of the finetuned embedding model
base_model (PretrainedEmbeddingModel | None) –

Base model the finetuned embedding model was trained on (None for uploaded models)
created_at (datetime) –

When the model was finetuned
description (str | None) –

Optional description of the embedding model

Note

For uploaded models (created via _upload), base_model is None, num_params may be extracted from the model if possible (otherwise None/0), and supports_instructions is False since this property cannot be determined from the model config alone.

trials `property` #

trials

List the trials for the finetuned embedding model

Returns:

list[FinetunedEmbeddingModelTrial] –

A list of finetuned embedding model trials

embed #

embed(
    value: str,
    *,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    instruction: str | None = None
) -> list[float]

embed(
    value: list[str],
    *,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    instruction: str | None = None
) -> list[list[float]]

embed(
    value,
    *,
    max_seq_length=None,
    truncation_side="right",
    instruction=None
)

Generate embeddings for a value or list of values

Parameters:

value (str | list[str]) –

The value or list of values to embed
instruction (str | None, default: None ) –

Optional task instruction for instruction-tuned embedding models
max_seq_length (int | None, default: None ) –

Optional maximum number of tokens per value beyond which inputs are truncated
truncation_side (Literal['left', 'right'], default: 'right' ) –

Optional side to truncate values that exceed max_seq_length

Returns:

list[float] | list[list[float]] –

A matrix of floats representing the embedding for each value if the input is a list of values, or a list of floats representing the embedding for the single value if the input is a single value

evaluate #

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[True]
) -> Job[ClassificationMetrics]

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str,
    score_column: None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[False] = False
) -> ClassificationMetrics

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[True]
) -> Job[RegressionMetrics]

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: None = None,
    score_column: str,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: Literal[False] = False
) -> RegressionMetrics

evaluate(
    datasource: Datasource,
    *,
    value_column: str = "value",
    label_column: str | None = None,
    score_column: str | None = None,
    eval_datasource: Datasource | None = None,
    subsample: int | float | None = None,
    neighbor_count: int = 5,
    batch_size: int = 32,
    weigh_memories: bool = True,
    instruction: str | None = None,
    max_seq_length: int | None = None,
    truncation_side: Literal["left", "right"] = "right",
    background: bool = False
) -> (
    ClassificationMetrics
    | RegressionMetrics
    | Job[ClassificationMetrics]
    | Job[RegressionMetrics]
    | Job[ClassificationMetrics | RegressionMetrics]
)

evaluate(
    datasource,
    *,
    value_column="value",
    label_column=None,
    score_column=None,
    eval_datasource=None,
    subsample=None,
    neighbor_count=5,
    batch_size=32,
    weigh_memories=True,
    instruction=None,
    max_seq_length=None,
    truncation_side="right",
    background=False
)

Evaluate this embedding model on a datasource

Embeds the datasource values to use as the nearest-neighbor memory and measures classification metrics when label_column is set or regression metrics when score_column is set.

Parameters:

datasource (Datasource) –

Data to use as the nearest-neighbor memory, if no eval_datasource is provided an evaluation set is created by splitting 20% of the data
value_column (str, default: 'value' ) –

Column name for the text to embed
label_column (str | None, default: None ) –

Column name for categorical labels, required for classification
score_column (str | None, default: None ) –

Column name for continuous scores, required for regression
eval_datasource (Datasource | None, default: None ) –

Data to evaluate on. When omitted, datasource is used for both memory and evaluation.
subsample (int | float | None, default: None ) –

Optional number (int) or fraction (float in (0, 1]) of rows to sample
neighbor_count (int, default: 5 ) –

Number of nearest neighbors to consider when scoring predictions
batch_size (int, default: 32 ) –

Batch size for embedding values during evaluation
weigh_memories (bool, default: True ) –

Whether to weight neighbor votes by similarity
instruction (str | None, default: None ) –

Optional task instruction for instruction-tuned embedding models
max_seq_length (int | None, default: None ) –

Optional maximum number of tokens per value beyond which inputs are truncated
truncation_side (Literal['left', 'right'], default: 'right' ) –

Side to truncate values that exceed max_seq_length
background (bool, default: False ) –

Whether to run the operation in the background and return a job handle

Returns:

ClassificationMetrics | RegressionMetrics | Job[ClassificationMetrics] | Job[RegressionMetrics] | Job[ClassificationMetrics | RegressionMetrics] –

metrics measuring how well nearest-neighbor predictions from the embeddings match the expected labels or scores, or a job handle that resolves to those metrics when background is True

all `classmethod` #

all()

List all finetuned embedding model handles in the OrcaCloud

Returns:

list[FinetunedEmbeddingModel] –

A list of all finetuned embedding model handles in the OrcaCloud

open `classmethod` #

open(name)

Get a handle to a finetuned embedding model in the OrcaCloud

Parameters:

name (str) –

The name or unique identifier of a finetuned embedding model

Returns:

FinetunedEmbeddingModel –

A handle to the finetuned embedding model in the OrcaCloud

Raises:

LookupError –

If the finetuned embedding model does not exist

exists `classmethod` #

exists(name_or_id)

Check if a finetuned embedding model with the given name or id exists.

Parameters:

name_or_id (str) –

The name or id of the finetuned embedding model

Returns:

bool –

True if the finetuned embedding model exists, False otherwise

drop `classmethod` #

drop(name_or_id, *, if_not_exists='error', cascade=False)

Delete the finetuned embedding model from the OrcaCloud

Parameters:

name_or_id (str) –

The name or id of the finetuned embedding model
if_not_exists (DropMode, default: 'error' ) –

What to do if the finetuned embedding model does not exist, defaults to "error". Other option is "ignore" to do nothing if the model does not exist.
cascade (bool, default: False ) –

If True, also delete all associated memorysets and their predictive models. Defaults to False.

Raises:

LookupError –

If the finetuned embedding model does not exist and if_not_exists is "error"
RuntimeError –

If the model has associated memorysets and cascade is False

orca_sdk.embedding_model#

LearningRateScheduler module-attribute #

FinetuningLoss module-attribute #

EmbeddingFinetuneHyperparams #

epochs instance-attribute #

max_steps instance-attribute #

learning_rate instance-attribute #

batch_size instance-attribute #

warmup instance-attribute #

weight_decay instance-attribute #

learning_rate_scheduler instance-attribute #

loss_scale instance-attribute #

normalize_embeddings instance-attribute #

max_seq_length instance-attribute #

truncation_side instance-attribute #

early_stopping instance-attribute #

early_stopping_threshold instance-attribute #

eval_method instance-attribute #

eval_steps instance-attribute #

neighbor_eval_count instance-attribute #

neighbor_eval_pool_subsample instance-attribute #

EmbeddingModelBase #

embed #

evaluate #

PretrainedEmbeddingModel #

embed #

evaluate #

all classmethod #

open classmethod #

exists classmethod #

finetune #

FinetunedEmbeddingModelTrial #

FinetunedEmbeddingModel #

trials property #

embed #

evaluate #

all classmethod #

open classmethod #

exists classmethod #

drop classmethod #

LearningRateScheduler `module-attribute` #

FinetuningLoss `module-attribute` #

epochs `instance-attribute` #

max_steps `instance-attribute` #

learning_rate `instance-attribute` #

batch_size `instance-attribute` #

warmup `instance-attribute` #

weight_decay `instance-attribute` #

learning_rate_scheduler `instance-attribute` #

loss_scale `instance-attribute` #

normalize_embeddings `instance-attribute` #

max_seq_length `instance-attribute` #

truncation_side `instance-attribute` #

early_stopping `instance-attribute` #

early_stopping_threshold `instance-attribute` #

eval_method `instance-attribute` #

eval_steps `instance-attribute` #

neighbor_eval_count `instance-attribute` #

neighbor_eval_pool_subsample `instance-attribute` #

all `classmethod` #

open `classmethod` #

exists `classmethod` #

trials `property` #

all `classmethod` #

open `classmethod` #

exists `classmethod` #

drop `classmethod` #