Embedding Models#

This guide explains what embedding models are and how they work in OrcaCloud. You will learn about the available pretrained embedding models, how to generate embeddings manually, and how to finetune embedding models to improve performance for your specific use case.

What are Embedding Models?#

Embedding models are at the heart of retrieval-augmented systems. They convert text (or other data types) into dense vector representations called embeddings. These embeddings capture semantic meaning in a way that allows for efficient similarity comparisons. In OrcaCloud, embedding models serve two critical functions:

Memory Indexing: When you add memories to a memoryset, the embedding model converts each memory’s value into a vector that is stored in the OrcaCloud. This enables fast semantic search for similar memories.
Query Embedding: During inference, the same embedding model converts the input query into a vector, which is then used to find the most similar memories in the memoryset.

The quality of these embeddings directly impacts the performance of your retrieval-augmented models. Better embeddings lead to more relevant memory lookups, which in turn lead to more accurate predictions.

Pretrained Embedding Models#

OrcaCloud provides several pretrained embedding models that perform well on the Hugging Face MTEB Leaderboard. These models can be accessed as class attributes of the PretrainedEmbeddingModel class:

F2LLM_160M: CodeFuse’s F2LLM-v2 160M model is a compact multilingual instruction-following embedding model. It has a 768-dimensional embedding space and can handle sequences up to 40,960 tokens in length, making it a strong default for most retrieval tasks.
CDE_SMALL: The context-aware CDE small model is designed to generate embeddings that take into account both the document and its neighboring context, rather than just encoding documents in isolation. This contextual awareness helps it better capture relationships between documents and achieve stronger performance, especially on out-of-domain tasks. It has a 768-dimensional embedding space and can handle sequences up to 512 tokens in length. The model achieves state-of-the-art results on the MTEB benchmark.

Using a Pretrained Model#

To use a pretrained model, you can simply access it as a class attribute:

PretrainedEmbeddingModel.F2LLM_160M

PretrainedEmbeddingModel({name: F2LLM_160M, embedding_dim: 768, max_seq_length: 40960})

List All Pretrained Models#

You can list all pretrained embedding models that are currently available in OrcaCloud using the PretrainedEmbeddingModel.all class method:

PretrainedEmbeddingModel.all()

[PretrainedEmbeddingModel({name: CDE_SMALL, embedding_dim: 768, max_seq_length: 512}),
 PretrainedEmbeddingModel({name: F2LLM_160M, embedding_dim: 768, max_seq_length: 40960})]

We are always adding new models to OrcaCloud, so make sure to check back regularly to see the latest additions and contact us if there is a specific model you’d like to try with Orca.

Generate Embeddings#

While memorysets and models handle embedding generation automatically, you can also generate embeddings manually using the embed method of an embedding model. This can be useful for debugging, visualization, or custom similarity calculations.

embedding = PretrainedEmbeddingModel.F2LLM_160M.embed(
    "I love this movie",
    max_seq_length=10, # (1)!
)

You can optionally specify a maximum sequence length to improve performance if you know your inputs will be shorter than the model’s default maximum. This value needs to be less than or equal to the model’s max_seq_length and is specified in tokens not characters.

[0.023, -0.015, 0.042, 0.018, -0.031, ...]

To embed multiple texts at once, pass a list of strings to the embed method:

embeddings = PretrainedEmbeddingModel.F2LLM_160M.embed([
    "I love this movie",
    "This movie is terrible"
])

[Embedding([0.023, -0.015, 0.042, 0.018, -0.031, ...]),
 Embedding([0.015, -0.032, 0.028, -0.045, 0.012, ...])]

Using Instructions#

Some embedding models – including F2LLM_160M and the rest of the F2LLM_* family, QWEN_600M, HARRIER_270M, and HARRIER_600M – are instruction-tuned. Passing a short, task-specific instruction typically yields better, more discriminative embeddings because it tells the model what kind of similarity you care about (“is this the same topic?” vs. “is this the same sentiment?”). Models that don’t support instructions silently ignore the argument.

embedding = PretrainedEmbeddingModel.F2LLM_160M.embed(
    "I love this movie",
    instruction="Classify the sentiment of this movie review", # (1)!
)

The model formats this into its expected prompt (e.g. Instruct: Classify the sentiment of this movie review\nQuery: I love this movie). Keep instructions short and action-oriented – a sentence or less is usually ideal.

You can check whether a given model supports instructions via the supports_instructions attribute:

PretrainedEmbeddingModel.F2LLM_160M.supports_instructions

True

When you use instruction-tuned models inside a memoryset, you typically set the instruction once at memoryset creation time via the instruction argument on LabeledMemoryset.create / ScoredMemoryset.create. The memoryset then applies it consistently every time text is embedded – both when memories are inserted and when queries are looked up – so insertion and retrieval stay aligned. Omitting instruction falls back to the model’s built-in default prompt.

Fine-Tuning#

While pretrained embedding models work well for many applications, you can often achieve better performance by fine-tuning a model on your specific data. Consider fine-tuning when your data has domain-specific terminology, when you need to distinguish between subtle differences that general models consider similar, or when you want to optimize for a specific task like classification or clustering.

Fine-tuning in OrcaCloud is a single method call on any pretrained model. In the simplest case you only need a name and training data:

from orca_sdk import PretrainedEmbeddingModel, LabeledMemoryset

memoryset = LabeledMemoryset.open("my_memoryset")
finetuned_model = PretrainedEmbeddingModel.F2LLM_160M.finetune(
    "my_finetuned_model",
    memoryset,
    if_exists="open",
)

For full coverage of loss functions, hyperparameter configuration, automated sweeps, and best practices, see the Fine-Tuning guide.

Managing Finetuned Models#

To open an existing finetuned model, use the FinetunedEmbeddingModel.open method:

from orca_sdk import FinetunedEmbeddingModel

FinetunedEmbeddingModel.open("my_finetuned_model")

FinetunedEmbeddingModel({
    name: my_finetuned_model,
    embedding_dim: 768,
    max_seq_length: 40960,
    base_model: PretrainedEmbeddingModel.F2LLM_160M
})

You can list all your finetuned models with FinetunedEmbeddingModel.all:

FinetunedEmbeddingModel.all()

To delete a finetuned model when you no longer need it, use FinetunedEmbeddingModel.drop:

FinetunedEmbeddingModel.drop("my_finetuned_model", if_not_exists="ignore") # (1)!

You cannot delete embedding models that are currently in use by a memoryset. The if_not_exists="ignore" option prevents an error if the model has already been deleted.

Choosing an Embedding Model#

Selecting the right embedding model for your use case is crucial for optimal performance. Here are some considerations:

Domain: If your data is from a specialized domain (e.g., medical, legal), consider using a context-aware model like CDE_SMALL or finetuning a model on domain-specific data.
Sequence Length: Choose a model whose maximum sequence length accommodates your data. For example, F2LLM_160M can handle up to 40,960 tokens, while CDE_SMALL is limited to 512 tokens. Longer sequences cost quadratically more memory and compute to embed – even if the model supports them, you often don’t need to use the full window. When fine-tuning, OrcaCloud’s default max_seq_length="p99" automatically picks the shortest cutoff that preserves 99% of your training samples (see Sequence length, batch size, and memory in the fine-tuning guide).
Instruction support: If you have a well-defined task (“classify sentiment”, “find near-duplicates”, “match questions to answers”), prefer an instruction-tuned model like F2LLM_160M or QWEN_600M and set the instruction on your memoryset. Task conditioning usually beats a non-instruction model of the same size.
Performance vs. Speed: Larger models generally provide better embeddings but may be slower and use more memory. Consider your latency and throughput requirements; a fine-tuned smaller model often outperforms a larger pretrained one on a specific task at a fraction of the cost.

The best approach is often to experiment with different models and evaluate their performance on your specific task. Contact our team of ML experts if you need help choosing the right model for your use case.