Skip to content

Memory Curation#

This guide will show how to use Orca’s memoryset analysis tools to curate memories for your retrieval augmented models.

What is Memory Curation?#

The quality of memories in your memoryset directly impacts the performance of your retrieval-augmented models. Memory curation is the process of analyzing and refining your memories to ensure your model makes accurate predictions.

Orca provides tools to help you identify and address common issues in your memorysets, such as removing duplicate memories, identifying potentially mislabeled examples, and optimizing the distribution of memories across the embedding space.

Memory curation with the Orca SDK generally consists of two steps:

  1. Run an analysis on the memoryset that stores results in the metrics attribute of each memory.
  2. Use the results to identify and address issues in your memoryset.

Removing Duplicate Memories#

Duplicate memories in your memoryset waste storage space, slow down retrieval, and potentially bias your model by giving more weight to repeated examples. The find_duplicates method helps you identify and manage duplicate memories:

memoryset.find_duplicates()
{ "num_duplicates": 10 }

This method will mark memories as duplicates based on exact value matches and store this information in the metrics attribute of each memory. For each duplicate, the duplicate_memory_ids field contains the IDs of other memories with the same value.

To delete the duplicate memories, we can combined the delete method with the query method:

1
2
3
4
5
memoryset.delete(
    m.memory_id for m in memoryset.query(
        filters=[("metrics.is_duplicate", "==", True)]
    )
)

Analyzing Label Consistency#

Mislabeled memories can significantly impact model performance. The analyze_labels method helps you identify potentially mislabeled memories by analyzing the labels of semantically similar memories:

memoryset.analyze_labels(neighbor_count=10) # (1)!

  1. The neighbor_count parameter controls the number of neighbors to use for the analysis.
{"label_metrics": [
    {
        "label": 0,
        "label_name": "negative",
        "average_lookup_score": 0.95,
        "memory_count": 100,
    }, {
        "label": 1,
        "label_name": "positive",
        "average_lookup_score": 0.90,
        "memory_count": 100,
    }
]}

The method returns some aggregate metrics for each label class, but the most important information is stored in the metrics attribute of each memory. The following metrics are computed for each memory:

  • neighbor_predicted_label: The label that would be predicted based on neighboring memories
  • neighbor_predicted_label_confidence: Confidence score for the predicted label
  • neighbor_predicted_label_ambiguity: Difference between the confidence of the top two predicted labels
  • current_label_neighbor_confidence: Confidence score for the memory’s current label
  • normalized_neighbor_label_entropy: Entropy of the label distribution (higher values indicate more uncertainty)
  • neighbor_predicted_label_matches_current_label: Whether the predicted label matches the current label

These metrics can help you identify memories that might be mislabeled or are in ambiguous regions of the embedding space.

Fixing Potential Mislabels#

To fix potential mislabeled memories, you can use the display_label_analysis method to open a UI that allows you to review and update potentially mislabeled memories.

memoryset.display_label_analysis()

You can also just fix all likely mislabeled memories by calling the update method in conjunction with the query method:

memoryset.update(
    {
        "memory_id": memory.memory_id,
        "label": memory.metrics.neighbor_predicted_label, # (1)!
    }
    for memory in memoryset.query(
        filters=[
            ("metrics.neighbor_predicted_label_matches_current_label", "==", False), # (2)!
            ("metrics.neighbor_predicted_label_confidence", ">", 0.8), # (3)!
        ]
    )
)

  1. Update the label to the label that the model would predict based on the neighbors.
  2. Filter for memories where the analysis found a predicted label that does not match the current label.
  3. Filter for memories where the model is highly confident in the predicted label. You can adjust this threshold to control how aggressive the relabeling is. Higher thresholds will result in fewer but more confident relabeling suggestions.

Optimizing Memory Distribution#

After fixing any potentially mislabeled memories, you can use the metrics from the label analysis to identify areas where:

  • You need more memories (low confidence regions) (1)
  • You have conflicting memories (high entropy regions) (2)
  • You have redundant memories (high density regions with consistent labels) (3)
  1. Sample query:
    1
    2
    3
    lonely_memories = memoryset.query(
        filters=[("metrics.neighbor_predicted_label_confidence", "<", 0.3)]
    )
    
  2. Sample query:
    1
    2
    3
    ambiguous_memories = memoryset.query(
        filters=[("metrics.normalized_neighbor_label_entropy", ">", 0.6)]
    )
    
  3. Sample query:
    1
    2
    3
    4
    5
    6
    redundant_memories = memoryset.query(
        filters=[
            ("metrics.neighbor_predicted_label_matches_current_label", "==", True),
            ("metrics.neighbor_predicted_label_confidence", ">", 0.9),
        ]
    )
    

This will help you create an effective memoryset that has a good distribution of memories across the embedding space, with clear decision boundaries between different labels.

Upcoming Enhancements#

Orca is continuously improving its memory curation capabilities. Future enhancements will include:

  • Automated memory generation to fill gaps in your memoryset
  • Clustering analysis to identify distinct groups within your data
  • Outlier detection to find anomalous memories
  • Active learning workflows to prioritize which memories to label
  • Drift detection to identify when your production data is diverging from your memoryset