Memory Curation#

This guide will show how to use Orca’s memoryset analysis tools to curate memories for your retrieval augmented models.

What is Memory Curation?#

The quality of memories in your memoryset directly impacts the performance of your retrieval-augmented models. Memory curation is the process of analyzing and refining your memories to ensure your model makes accurate predictions.

Orca provides tools to help you identify and address common issues in your memorysets, such as removing duplicate memories, identifying potentially mislabeled examples, and optimizing the distribution of memories across the embedding space.

Memory curation with the OrcaSDK generally consists of two steps:

Run an analysis on the memoryset that stores results in the metrics attribute of each memory.
Use the results to identify and address issues in your memoryset.

Using the Analyze Method#

You can run multiple analyses on your memoryset in a single call using the analyze method. The following analyses are supported:

"duplicate": Find exact and potential duplicate memories in the memoryset
"label": Analyze labels to find potential mislabelings
"neighbor": Analyze neighbors to find outliers
"cluster": Cluster to understand the structure of the memoryspace
"projection": Project memory embeddings into 2D for visualization

The method takes a list of analysis to run as arguments:

memoryset.analyze(
    "neighbor", "cluster", "projection", "label", # (1)!
    {"name": "duplicate", "potential_duplicate_threshold": 0.99} # (2)!
    lookup_count=15 # (3)!
    clear_metrics=True # (4)!
)

The method takes a list of analysis names as arguments.
To customize the configuration for an individual analysis pass a dictionary instead of a string, where the name key specifies the analysis to run and the other keys are the configuration options for the analysis.
Global configuration options like the lookup count can be set as keyword arguments.
If clear_metrics is True the metrics of all memories will be deleted before running the analysis.

The analysis stores results in the metrics attribute of each memory and returns aggregated metrics for the memoryset (like the number of clusters found).

Finding Duplicate Memories#

Duplicate memories in your memoryset waste storage space, slow down retrieval, and potentially bias your model by giving more weight to repeated examples. The duplicate analysis helps you identify and manage duplicate memories:

memoryset.analyze(
    { # (1)!
        "name": "duplicate",
        "potential_duplicate_threshold": 0.95 # (2)!
    },
    lookup_count=3 # (3)!
)

Instead of a string we can pass a dictionary with the name of the analysis to run and any available configuration options for the analysis.
Here we set the threshold for a memory to be considered a duplicate.
We can control the number of neighbors used for the analysis, setting this to 3 will speed up the analysis at the cost of not being able to find more than 3 potential duplicates.

{ "duplicate": { "num_duplicates": 10 } }

This method will find both exact and potential duplicates and set appropriate metrics attributes on each memory.

Removing Exact Duplicates#

Exact duplicates are memories with the same value. They will have the metrics["is_duplicate"] field set to True and the metrics["duplicate_memory_ids"] field will contain the IDs of other memories with the same value. The last memory of a set of exact duplicates will not have the is_duplicate field set, so you can combine query and delete to easily remove all but one of a set of exact duplicates.

memoryset.delete(
    m.memory_id for m in memoryset.query(
        filters=[("metrics.is_duplicate", "==", True)]
    )
)

Removing Potential Duplicates#

Potential duplicates are memories with a similarity score higher than the potential_duplicate_threshold we set in the analysis config. They metrics["has_potential_duplicates"] field will be True and the metrics["potential_duplicate_memory_ids"] field will contain the IDs of other memories with the same value. Unlike with exact duplicates, all memories that have a potential duplicate are marked as has_potential_duplicates and you have to come up with your own strategy for how to identify which of a set of potential duplicates to keep and which to delete. You could for example construct a graph of potential duplicates and remove all but the first memory from each connected component of the graph:

from networkx import Graph, connected_components # (1)!

graph = Graph()
for mem in memoryset.query(filters=[("metrics.has_potential_duplicates", "==", True)]):
    graph.add_node(mem.memory_id)
    for duplicate_id in mem.metrics["potential_duplicate_memory_ids"]:
        graph.add_edge(mem.memory_id, duplicate_id) # (2)!

memory_ids_to_delete = set()
for component in connected_components(graph):
    keep = next(iter(component)) # (3)!
    memory_ids_to_delete.update(component - {keep}) # (4)!

memoryset.delete(memory_ids_to_delete)

For simplicity we use the networkx library to construct a graph of potential duplicates.
Add an edge between a memory and each of its potential duplicates.
We keep the first memory of each connected component, you can change this to keep a different memory or use a different strategy for selecting which memory to keep.
Mark all but the first memory from each connected component for removal.

Analyzing Label Consistency#

Mislabeled memories can significantly impact model performance. The label analysis helps you identify potentially mislabeled memories by analyzing the labels of semantically similar memories:

memoryset.analyze("label")

{
  "label": {
    "label_metrics": [
      {
        "label": 0,
        "label_name": "negative",
        "average_lookup_score": 0.95,
        "memory_count": 100,
      }, {
        "label": 1,
        "label_name": "positive",
        "average_lookup_score": 0.90,
        "memory_count": 100,
      }
    ],
    "neighbor_prediction_accuracy": 0.95,
    "mean_neighbor_label_confidence": 0.95,
    "mean_neighbor_label_entropy": 0.95,
    "mean_neighbor_predicted_label_ambiguity": 0.95,
  }
}

The method returns aggregate metrics for each label class, but the most important information is stored in the metrics attribute of each memory. The following metrics are computed for each memory:

neighbor_predicted_label: The label that would be predicted based on neighboring memories
neighbor_predicted_label_confidence: Confidence score for the predicted label
neighbor_predicted_label_ambiguity: Difference between the confidence of the top two predicted labels
current_label_neighbor_confidence: Confidence score for the memory’s current label
normalized_neighbor_label_entropy: Entropy of the label distribution (higher values indicate more uncertainty)
neighbor_predicted_label_matches_current_label: Whether the predicted label matches the current label

These metrics can help you identify memories that might be mislabeled or are in ambiguous regions of the embedding space.

Fixing Potential Mislabels#

To fix potential mislabeled memories, you can use the display_label_analysis method to open a UI that allows you to review and update potentially mislabeled memories.

memoryset.display_label_analysis()

You can also just fix all likely mislabeled memories by calling the update method in conjunction with the query method:

memoryset.update(
    {
        "memory_id": memory.memory_id,
        "label": memory.metrics.neighbor_predicted_label, # (1)!
    }
    for memory in memoryset.query(
        filters=[
            ("metrics.neighbor_predicted_label_matches_current_label", "==", False), # (2)!
            ("metrics.neighbor_predicted_label_confidence", ">", 0.8), # (3)!
        ]
    )
)

Update the label to the label that the model would predict based on the neighbors.
Filter for memories where the analysis found a predicted label that does not match the current label.
Filter for memories where the model is highly confident in the predicted label. You can adjust this threshold to control how aggressive the relabeling is. Higher thresholds will result in fewer but more confident relabeling suggestions.

Optimizing Memory Distribution#

After fixing any potentially mislabeled memories, you can use the metrics from the label analysis to identify areas where:

You need more memories (low confidence regions) (1)
You have conflicting memories (high entropy regions) (2)
You have redundant memories (high density regions with consistent labels) (3)

Sample query:

lonely_memories = memoryset.query(
    filters=[("metrics.neighbor_predicted_label_confidence", "<", 0.3)]
)

Sample query:

ambiguous_memories = memoryset.query(
    filters=[("metrics.normalized_neighbor_label_entropy", ">", 0.6)]
)

Sample query:

redundant_memories = memoryset.query(
    filters=[
        ("metrics.neighbor_predicted_label_matches_current_label", "==", True),
        ("metrics.neighbor_predicted_label_confidence", ">", 0.9),
    ]
)

This will help you create an effective memoryset that has a good distribution of memories across the embedding space with clear decision boundaries between different labels.

Upcoming Enhancements#

Orca is continuously improving its memory curation capabilities. Future enhancements will include:

Automated memory generation to fill gaps in your memoryset
Active learning workflows to prioritize which memories to label
Drift detection to identify when your production data is diverging from your memoryset