Embeddings#

This guide is about the what and why of embeddings, the background you need before you pick an embedding model, write instructions for one, or finetune one in OrcaCloud. For the how-to, see the Embedding Models guide.

What Is an Embedding?#

An embedding is a fixed-size vector of floating-point numbers that represents a piece of text (or, occasionally, an image), produced by a neural network we call an embedding model. The embedding is engineered so that two inputs with similar meanings land near each other in that vector space, and two inputs with unrelated meanings land far apart. “Near” is usually measured by cosine similarity or Euclidean distance, which is equivalent once vectors are L2-normalized.

That’s the whole contract. The embedding itself is not human-readable, and no individual dimension corresponds to a human-meaningful concept. What matters is that the geometry of the space encodes semantics: distances and directions in it line up with similarities and contrasts in the text.

Embeddings are useful exactly because of this geometry. If you can turn a query into an embedding and each of your documents into an embedding, you can replace find documents matching this query with find the nearest vectors, a single operation that a vector database can do in sub-millisecond time even over millions of vectors. In OrcaCloud the same operation powers both memoryset search and the retrieval step inside a retrieval-augmented classification or regression model.

Where Embeddings Come From#

Almost every modern embedding model starts life as a pretrained language model and is then retrained to produce embeddings that behave the way we want. The lineage matters: an embedding’s dimension, context length, and broad strengths and weaknesses are largely inherited from the language model the embedding model was built on.

The retraining stage, sometimes called contrastive pretraining, teaches the model similarity rather than language modeling. The recipe, in broad strokes, is:

Collect a very large corpus of text pairs and triples that are labeled as similar or dissimilar. Typical sources are question/answer pairs, paraphrase datasets, titles and abstracts, and (in recent years) synthetic pairs generated by LLMs.
Pass each text through the model to produce an embedding.
Update the model’s weights so that similar pairs end up with a higher cosine similarity than dissimilar pairs. The standard loss for this is InfoNCE, which pushes paired embeddings together while simultaneously pushing them away from every other embedding in the same batch.

Conditioning Embeddings on Instructions#

Most embedding models embed a given piece of text the same way regardless of why you’re embedding it. They have one fixed opinion about what “similar” means, averaged across whatever mix of training data they saw. Two movie reviews can be “similar” because they share sentiment, because they discuss the same movie, or because they use the same vocabulary, and a fixed embedding has to pick some blend of these.

Instruction-tuned embedding models add a small ingredient on top. During training, each input is prefixed with a short task description, an instruction like Given a customer support ticket, identify the user intent. The model learns to shift its embeddings slightly based on this prefix, so feeding a different instruction at inference time can nudge the same text toward a different similarity notion.

In practice the effect of the instruction is usually modest. A well-chosen instruction tends to add a small but real lift over a generic one on a task the model wasn’t specifically pretrained for, but it doesn’t fundamentally change what the model understands. Instruction tuning lets you pick among similarity notions the model has already learned; it won’t make the model understand a domain it has never seen. For that, you need to update the weights themselves via finetuning.

What Finetuning Changes#

Finetuning an embedding model means continuing its training on your labeled data so that its similarity function becomes specifically shaped around the distinctions that matter for your task. Two support tickets that share a vocabulary but differ in intent (something a pretrained model routinely confuses) can be pulled apart in the embedding space. Two tickets that use totally different words but mean the same thing can be pulled together.

This is a weight-level change: a finetuned model has a genuinely different similarity function than the one it started from. That’s why finetuning is typically a much bigger lever than either picking a larger pretrained model or writing a better instruction. A finetuned small model often outperforms a much larger pretrained one on the specific task it was trained for, at a small fraction of the latency and serving cost.

The training recipe for finetuning mirrors the contrastive pretraining recipe above, but with your labeled memories as the similarity signal rather than a generic web-scale dataset. OrcaCloud supports several loss flavors (contrastive, triplet, proxy-anchor, and a simple prediction-head loss), each appropriate for different data regimes. See the Finetuning guide for the practical details.

The Trade-Off Space#

Every embedding model sits somewhere in a small space defined by a handful of numbers: embedding dimension, maximum context length, parameter count, and (for instruction-tuned models) whether it was trained to condition on an instruction. None of these are free.

When we talk about the quality axis of that space, we lean on MTEB, the standard public benchmark for embedding models. It averages a model’s performance over dozens of retrieval and classification tasks and is the closest thing the field has to a single “quality” number. It’s a useful starting point for comparing pretrained models, but it is not the last word on your task, which is why finetuning regularly reshuffles the leaderboard once you’re training on your own data.

Embedding dimension (the length of the vector) sets how much information the model can pack into each embedding, and therefore both the storage cost per memory (a memoryset with 1024-dimensional embeddings requires twice the space for vector storage as the same memoryset with 512-dimensional embeddings) and the speed of approximate-nearest-neighbor lookup. Higher dim usually correlates with higher quality, but only weakly, and the correlation flattens well before the largest dimensions in the wild.

Maximum context length (the longest input the model can process in one pass) is determined by the underlying language model’s architecture and position-encoding scheme. Older encoder-style bases top out at a few hundred tokens; modern LLM-style bases reach tens of thousands. Longer context is paid for at inference time: transformer attention is quadratic in the context length, so embedding a 4k-token document is roughly 64x more expensive than embedding a 512-token one. The same scaling shows up as GPU memory pressure during finetuning, where most of the memory goes into per-token activations rather than the model weights. That means doubling the context length hurts you much more than doubling the parameter count.

Parameter count (how big the model is) sets both quality and latency. Quality scales with size sub-linearly: doubling parameters typically adds a small number of MTEB points. Latency scales roughly linearly. The practical sweet spot for most classification and regression workloads is well below the largest models on the leaderboard.

Instruction support (whether the model was trained to condition on an instruction) expands the range of similarity notions a single model can express, with typically modest per-task gains. In OrcaCloud, instruction-tuned models let you set the instruction once on the memoryset so everything downstream uses it consistently.

These trade-offs interact, which is why there is no single “best” embedding model. Picking one is a decision about where on the frontier you want to sit for your specific data, latency budget, and corpus size. That’s the subject of the Embedding Models how-to guide.