Retrieval Augmentation#
This guide will walk you through the benefits of retrieval augmentation and when it is a good fit for your use case.
What is Retrieval Augmentation?#
You might have heard of retrieval-augmented generation (RAG) in the context of LLMs, Orca takes a similar approach but applies it to arbitrary neural networks built with PyTorch.
Retrieval Augmentation is a technique that allows machine learning models to adapt to new data without retraining. It does this by looking up relevant data from a database of memories that is separate from the model’s logic. This allows the model to be updated or edited in real time, thereby allowing the model to adapt to new information without retraining and to maintain its accuracy even as the data changes.
Benefits of Retrieval Augmentation#
Traditional machine learning models like classifiers and recommenders are trained to make predictions based on the distribution of the data they see during training. Once deployed, these predictions do not change without retraining. In static environments, this type of rigidity is not a problem. However, in most business applications, such models suffer performance deterioration when:
- Data is highly dynamic and drifts faster than you can retrain your model.
- An existing, proven model is used for a new client with similar, but not-fully aligned preferences.
- Significant new data becomes available that must be included in the training set in order to improve performance but retraining the model adds an unwanted time delay.
- There is an inadvertent bias in the training data driven by limited available data or skews in synthetic training data.
Retrieval-augmented models built with Orca are trained to look up external data —referred to as memories— that are stored independent of the model’s logic. As a result, these memories can be updated or edited in real time, thereby allowing the model to adapt to new information without retraining and to maintain its performance even as the data changes.
Retrieval-augmentation allows your models to meet your specific business KPIs over time. In some cases, it may yield improvements on benchmarks compared to similar sized models, but, typically, models (especially those already performing well against benchmarks) see similar evaluation metrics. However, retrieval-augmented models maintain performance simply by adding or adjusting the data stored in the memories compared to traditional models that degrade between training runs. Additionally, you can continue to improve your base model’s performance with new training data or other enhancements even after adding retrieval-augmentation.
How Are These Models Trained?#
With Orca, you can add retrieval-augmentation to any neural network built with PyTorch.
Orca enables you to augment your model by injecting relevant memories that are looked up based on the inputs to your model(1). Your model is then trained to learn how to use the memory data most efficiently to supplement the knowledge innately stored into the model from training.
- Orca generates embeddings for your memories and leverages approximate nearest neighbor search and fine-tuned reranking models to find the most relevant memories for a given input
The structure of the memories typically reflects the structure of your training data. How memories are integrated into your specific model architecture will vary based on your specific needs and Orca’s team of machine learning engineers will partner with you to find the best solution to your specific needs. We also provide tools and guidance on how to optimize the content of your memories based on your existing training data.
What Happens at Inference?#
Once your retrieval-augmented model is set up and trained, it will intentionally and reliably lookup and use memories at inference.
Thus, your model behavior can be changed at any time without retraining or redeploying it by simply updating the memories.
To store your memories, Orca provides a custom database solution that is optimized for generating embeddings for your memories and accessing them efficiently. Orca also tracks memory lookups for each model run and allows you to record feedback. Furthermore, we provide tooling that enables you to continuously tune the memories for your model based on this telemetry, by proactively identifying unhelpful memories and surfacing areas in which you don’t have enough memories to make high confidence inferences. These capabilities will enable you to adapt your model to rapidly changing conditions.
Optimize Model Performance#
Once you deploy a retrieval-augmented model to production, you can leverage Orca to optimize model performance. By instrumenting your model with OrcaLib, Orca actively records memory lookups, analyzes memory relevance, and allows you to record feedback for all model runs. You can leverage this data to drive ongoing model performance by:
- Assessing which memories typically contribute to accurate and inaccurate answers, then make surgical updates to problematic memories.
- Auditing underperforming categories that could benefit from additional memories and providing new (real or synthetic) memories via OrcaDB.
- Determining which memories contributed to a specific output if you are required to for compliance, safety, or trust reasons.
- Replacing synthetic information with actual data you collect from user behaviors while interacting with your model.