Retrieval Augmentation#

This guide will walk you through the benefits of retrieval augmentation and when it is a good fit for your use case.

What is Retrieval Augmentation?#

You might have heard of retrieval-augmented generation (RAG) in the context of LLMs, Orca takes a similar approach but applies it to other types of models like classification, regression, and recommendation models.

Retrieval Augmentation is a technique that allows machine learning models to adapt to new data without retraining. It does this by looking up relevant data from a database of memories that is separate from the model’s logic. This allows the model to be updated or edited in real time, thereby allowing the model to adapt to new information without retraining and to maintain its accuracy even as the data changes.

Benefits of Retrieval Augmentation#

Traditional machine learning models like classifiers and recommenders are trained to make predictions based on the distribution of the data they see during training. Once deployed, these predictions do not change without retraining. In static environments, this type of rigidity is not a problem. However, in most business applications, such models suffer performance deterioration when:

Data is highly dynamic and drifts faster than you can retrain your model.
An existing, proven model is used for a new client with similar, but not-fully aligned preferences.
Significant new data becomes available that must be included in the training set in order to improve performance but retraining the model adds an unwanted time delay.
There is an inadvertent bias in the training data driven by limited available data or skews in synthetic training data.

Retrieval-augmented models built with Orca are trained to look up external data —referred to as memories— that are stored independent of the model’s logic. As a result, these memories can be updated or edited in real time, thereby allowing the model to adapt to new information without retraining and to maintain its performance even as the data changes.

Retrieval-augmentation allows your models to meet your specific business KPIs over time. By separating the model’s logic from the memories, your model is not a black box anymore and you can inspect why it made a specific prediction. Coupled with Orca’s analysis and telemetry tools, you can automatically identify problems and even suggest changes to your memories to improve model performance online.

How Are These Models Trained?#

Orca models consist of two to three components:

An embedding model that generates representations for computing semantic similarity between your model inputs and memories.
An optional reranking model that can be used to refine the results of the semantic search.
A model head that combines the memory labels or scores based on the input and memory embeddings to make a final prediction.

Each of these components is bootstrapped from a pre-trained model, so there is no training needed to get started. Each can be fine-tuned for your specific use case to improve performance. We have generally found that optimizing the embedding model yields the most performance improvements and have tools to make it easy to find the best pre-trained embedding model to start with and finetune a model for your specific use case.

The structure of the memories typically reflects the structure of your training data. Orca provides tools to help you optimize the content of your memories based on analysis of the distribution of the memories in the embedding space.

Orca’s team of machine learning engineers will partner with you to find the best solution to your specific needs.

What Happens at Inference?#

Orca’s retrieval-augmented models will lookup and use memories to guide it’s predictions. Thus, your model behavior can be changed at any time without retraining or redeploying it by simply updating the memories.

Memory data is stored in the OrcaCloud which is optimized for generating embeddings for your memories and accessing them efficiently. Orca also tracks memory lookups for each model prediction and allows you to record feedback. Furthermore, we provide tooling that enables you to continuously tune the memories for your model based on this telemetry, by proactively identifying unhelpful memories and surfacing areas in which you don’t have enough memories to make high confidence inferences. These capabilities will enable you to adapt your model to rapidly changing conditions.

Optimize Model Performance#

Before you release your retrieval-augmented model to production, you can leverage Orca’s tools to optimize model performance by analyzing your memory data and tuning it based on our recommendations.

Once you deploy your retrieval-augmented model to production, you can leverage Orca’s build in telemetry tools to record feedback and monitor memory usage. You can leverage this data to drive ongoing model performance by:

Assessing which memories contribute to accurate and inaccurate answers, to make surgical updates to problematic memories.
Alerting you of categories in which your model has low confidence because of missing memory data and helping you fill in the gaps in your memories based on the real world data your model is seeing at inference.
Determining which memories contributed to a specific output if you are required to for compliance, safety, or trust reasons.
Replacing synthetic information with actual data you collect from user behaviors while interacting with your model.
Deriving suggestions for improving your memory data and embedding models based on the feedback you recorded.