orca_sdk.embedding_model#

PretrainedEmbeddingModel #

Bases: _EmbeddingModel

A pretrained embedding model

Models:

OrcaCloud supports a select number of small to medium sized embedding models that perform well on the Hugging Face MTEB Leaderboard. These can be accessed as class attributes. We currently support:

CDE_SMALL: Context-aware CDE small model from Hugging Face (jxm/cde-small-v1)
CLIP_BASE: Multi-modal CLIP model from Hugging Face (sentence-transformers/clip-ViT-L-14)
GTE_BASE: Alibaba’s GTE model from Hugging Face (Alibaba-NLP/gte-base-en-v1.5)
DISTILBERT: DistilBERT embedding model from Hugging Face (distilbert-base-uncased)
GTE_SMALL: GTE-Small embedding model from Hugging Face (Supabase/gte-small)
E5_LARGE: E5-Large instruction-tuned embedding model from Hugging Face (intfloat/multilingual-e5-large-instruct)
GIST_LARGE: GIST-Large embedding model from Hugging Face (avsolatorio/GIST-large-Embedding-v0)
MXBAI_LARGE: Mixbreas’s Large embedding model from Hugging Face (mixedbread-ai/mxbai-embed-large-v1)
QWEN2_1_5B: Alibaba’s Qwen2-1.5B instruction-tuned embedding model from Hugging Face (Alibaba-NLP/gte-Qwen2-1.5B-instruct)

Examples:

>>> PretrainedEmbeddingModel.CDE_SMALL
PretrainedEmbeddingModel({name: CDE_SMALL, embedding_dim: 768, max_seq_length: 512})

Attributes:

name –

Name of the pretrained embedding model
embedding_dim –

Dimension of the embeddings that are generated by the model
max_seq_length –

Maximum input length (in tokens not characters) that this model can process. Inputs that are longer will be truncated during the embedding process
uses_context –

Whether the pretrained embedding model uses context

embed #

embed(
    value: str, max_seq_length: int | None = None
) -> list[float]

embed(
    value: list[str], max_seq_length: int | None = None
) -> list[list[float]]

embed(value, max_seq_length=None)

Generate embeddings for a value or list of values

Parameters:

value (str | list[str]) –

The value or list of values to embed
max_seq_length (int | None, default: None ) –

The maximum sequence length to truncate the input to

Returns:

list[float] | list[list[float]] –

A matrix of floats representing the embedding for each value if the input is a list of values, or a list of floats representing the embedding for the single value if the input is a single value

all `classmethod` #

all()

List all pretrained embedding models in the OrcaCloud

Returns:

list[PretrainedEmbeddingModel] –

A list of all pretrained embedding models available in the OrcaCloud

open `classmethod` #

open(name)

Open an embedding model by name.

This is an alternative method to access models for environments where IDE autocomplete for model names is not available.

Parameters:

name (str) –

Name of the model to open (e.g., “GTE_BASE”, “CLIP_BASE”)

Returns:

PretrainedEmbeddingModel –

The embedding model instance

Examples:

>>> model = PretrainedEmbeddingModel.open("GTE_BASE")

exists `classmethod` #

exists(name)

Check if a pretrained embedding model exists by name

Parameters:

name (str) –

The name of the pretrained embedding model

Returns:

bool –

True if the pretrained embedding model exists, False otherwise

finetune #

finetune(
    name: str,
    train_datasource: Datasource | LabeledMemoryset,
    *,
    eval_datasource: Datasource | None = None,
    label_column: str = "label",
    value_column: str = "value",
    training_method: (
        EmbeddingFinetuningMethod | str
    ) = EmbeddingFinetuningMethod.CLASSIFICATION,
    training_args: dict | None = None,
    if_exists: CreateMode = "error",
    background: Literal[True]
) -> Job[FinetunedEmbeddingModel]

finetune(
    name: str,
    train_datasource: Datasource | LabeledMemoryset,
    *,
    eval_datasource: Datasource | None = None,
    label_column: str = "label",
    value_column: str = "value",
    training_method: (
        EmbeddingFinetuningMethod | str
    ) = EmbeddingFinetuningMethod.CLASSIFICATION,
    training_args: dict | None = None,
    if_exists: CreateMode = "error",
    background: Literal[False] = False
) -> FinetunedEmbeddingModel

finetune(
    name,
    train_datasource,
    *,
    eval_datasource=None,
    label_column="label",
    value_column="value",
    training_method=EmbeddingFinetuningMethod.CLASSIFICATION,
    training_args=None,
    if_exists="error",
    background=False
)

Finetune an embedding model

Parameters:

name (str) –

Name of the finetuned embedding model
train_datasource (Datasource | LabeledMemoryset) –

Data to train on
eval_datasource (Datasource | None, default: None ) –

Optionally provide data to evaluate on
label_column (str, default: 'label' ) –

Column name of the label
value_column (str, default: 'value' ) –

Column name of the value
training_method (EmbeddingFinetuningMethod | str, default: CLASSIFICATION ) –

Training method to use
training_args (dict | None, default: None ) –

Optional override for Hugging Face TrainingArguments. If not provided, reasonable training arguments will be used for the specified training method
if_exists (CreateMode, default: 'error' ) –

What to do if a finetuned embedding model with the same name already exists, defaults to "error". Other option is "open" to open the existing finetuned embedding model.
background (bool, default: False ) –

Whether to run the operation in the background and return a job handle

Returns:

FinetunedEmbeddingModel | Job[FinetunedEmbeddingModel] –

The finetuned embedding model

Raises:

ValueError –

If the finetuned embedding model already exists and if_exists is "error" or if it is "open" but the base model param does not match the existing model

Examples:

>>> datasource = Datasource.open("my_datasource")
>>> model = PretrainedEmbeddingModel.CLIP_BASE
>>> model.finetune("my_finetuned_model", datasource)

FinetunedEmbeddingModel #

Bases: _EmbeddingModel

A finetuned embedding model in the OrcaCloud

Attributes:

name –

Name of the finetuned embedding model
embedding_dim –

Dimension of the embeddings that are generated by the model
max_seq_length –

Maximum input length (in tokens not characters) that this model can process. Inputs that are longer will be truncated during the embedding process
uses_context –

Whether the model uses the memoryset to contextualize embeddings (acts akin to inverse document frequency in TFIDF features)
id (str) –

Unique identifier of the finetuned embedding model
base_model (PretrainedEmbeddingModel) –

Base model the finetuned embedding model was trained on
created_at (datetime) –

When the model was finetuned

base_model `property` #

base_model

Pretrained model the finetuned embedding model was based on

embed #

embed(
    value: str, max_seq_length: int | None = None
) -> list[float]

embed(
    value: list[str], max_seq_length: int | None = None
) -> list[list[float]]

embed(value, max_seq_length=None)

Generate embeddings for a value or list of values

Parameters:

value (str | list[str]) –

The value or list of values to embed
max_seq_length (int | None, default: None ) –

The maximum sequence length to truncate the input to

Returns:

list[float] | list[list[float]] –

A matrix of floats representing the embedding for each value if the input is a list of values, or a list of floats representing the embedding for the single value if the input is a single value

all `classmethod` #

all()

List all finetuned embedding model handles in the OrcaCloud

Returns:

list[FinetunedEmbeddingModel] –

A list of all finetuned embedding model handles in the OrcaCloud

open `classmethod` #

open(name)

Get a handle to a finetuned embedding model in the OrcaCloud

Parameters:

name (str) –

The name or unique identifier of a finetuned embedding model

Returns:

FinetunedEmbeddingModel –

A handle to the finetuned embedding model in the OrcaCloud

Raises:

LookupError –

If the finetuned embedding model does not exist

exists `classmethod` #

exists(name_or_id)

Check if a finetuned embedding model with the given name or id exists.

Parameters:

name_or_id (str) –

The name or id of the finetuned embedding model

Returns:

bool –

True if the finetuned embedding model exists, False otherwise

drop `classmethod` #

drop(name_or_id, *, if_not_exists='error')

Delete the finetuned embedding model from the OrcaCloud

Parameters:

name_or_id (str) –

The name or id of the finetuned embedding model

Raises:

LookupError –

If the finetuned embedding model does not exist and if_not_exists is "error"

orca_sdk.embedding_model#

PretrainedEmbeddingModel #

embed #

all classmethod #

open classmethod #

exists classmethod #

finetune #

FinetunedEmbeddingModel #

base_model property #

embed #

all classmethod #

open classmethod #

exists classmethod #

drop classmethod #

all `classmethod` #

open `classmethod` #

exists `classmethod` #

base_model `property` #

all `classmethod` #

open `classmethod` #

exists `classmethod` #

drop `classmethod` #