Memories and Memorysets#

This guide dives into the details of how to work with memories and memorysets in OrcaCloud. You will learn what memories are, how to create a memoryset, how to lookup memories that are similar to a given query, and how to update or delete memories.

What are Memories?#

In the context of Orca, memories are additional data that your model uses to guide its predictions. Your model will look up relevant memories based on the input it receives and use them to inform its output. Memories are stored in OrcaCloud and can thus be updated at any time, which allows changing the model’s behavior without retraining or redeploying it. For more information about memories, check out our memories concept guide

To interact with memories in Orca, you will use memorysets that provide a high-level interface for storing, looking up, and updating and deleting memories. You can think of memorysets as tables in the vector database where each row is a memory.

The memorysets store memories with the following properties:

value: value of the memory. (1)
embedding: embedding of the value of the memory for semantic search, automatically generated by the embedding model of the memoryset.
source_id: optional unique identifier of the memory in your system of reference (has to be a string).
metrics: metrics about the memory, generated when running an analysis on the memoryset.
memory_id: unique identifier for the memory, automatically generated on insert.
memory_version: version of the memory, automatically updated when the label or value changes
...: The memoryset can also contain additional properties, which are stored in a metadata dictionary but can also be accessed as individual attributes on the instance.

The SDK currently only supports working with text memories. We have experimental support for images as well in the OrcaCloud. Please contact us if you have a use case for this.

Different types of memorysets will store additional properties. For example, the LabeledMemoryset that we use above, also stores:

label: label of the memory
label_name: human-readable name of the label, automatically populated from the label names of the memoryset.

Label Names

label_names is a list of human-readable names for the labels in the memoryset. It must match the number of labels in the datasource in which the index of the label name matches the label value. If the datasource is created from a Hugging Face Dataset with a ClassLabel feature for labels, the label names will be inferred from that. Otherwise, the label names must be provided manually.

memoryset = LabeledMemoryset.from_dict(
    "my_memoryset",
    {
        "text": ["I love this movie", "This movie is bad"],
        "label": [0, 1],
    },
    label_names=["neg", "pos"],
)
memoryset[0:2]

[LabeledMemory({ label: <neg: 0>, value: 'I love this movie' }),
LabeledMemory({ label: <pos: 1>, value: 'This movie is bad' })]

Create a Memoryset#

In this guide we will use the LabeledMemoryset, which is a memoryset that stores labels for classification tasks, as an example. The memoryset will automatically generate embeddings for your memories using the embedding model you specify.

from orca_sdk import Datasource, LabeledMemoryset, PretrainedEmbeddingModel

memoryset = LabeledMemoryset.create(
    "imdb_reviews", # (1)!
    Datasource.from_dict({ # (2)!
        "text": ["I love this movie", "This movie is bad"],
        "label": [0, 1],
        "external_id": ["123", "456"]
    }),
    embedding_model=PretrainedEmbeddingModel.GTE_BASE,  # (3)!
    value_column="text",  # (4)!
    label_column="label",  # (5)!
    source_id_column="external_id", # (6)!
    label_names=["neg", "pos"], # (7)!
    max_seq_length_override=75, # (8)!
    if_exists="open", # (9)!
)

Name of the memoryset in the OrcaCloud that will store the memories.
Datasource that contains the memories to store in the memoryset.
Embedding model that will be used to embed the memories for semantic search.
Name of the column in the datasource that contains the memory values (e.g. text). Will default to "value" if not specified.
Name of the column in the datasource that contains the associated labels. Will default to "label" if not specified.
Optional name of the column in the datasource that contains the external source IDs.
List of human-readable names for the labels in the memoryset, must match the number of labels in the datasource in which the index of the label name matches the label value. If the datasource contains a ClassLabel feature for labels, the label names will be inferred from that.
Maximum sequence length for the embedding model.
What to do if a memoryset with the same name already exists, defaults to "error". Other option is "open" to open the existing memoryset.

Above we create a LabeledMemoryset from a Datasource. Additionally, the OrcaSDK provides a number of convenience methods so you can create them directly from a Hugging Face or Pytorch Dataset, list, column dictionary, pandas DataFrame, pyarrow Table, or local file. All these methods will create the Datasource under the hood and then create the LabeledMemoryset from it.

DatasetPyTorchListDictionaryPandasPyarrowFile

from datasets import Dataset

dataset = Dataset.from_dict(
    {
        "text": ["I love this movie", "This movie is bad"],
        "label": [0, 1]
    }
).cast_column("label", ClassLabel(num_classes=2, names=["neg", "pos"]))

memoryset = LabeledMemoryset.from_hf_dataset(
    "imdb_reviews_from_hf_dataset",
    dataset,
    embedding_model=PretrainedEmbeddingModel.GTE_BASE,
    value_column="text",
)

Since this dataset contains a ClassLabel feature for labels, the label names will be inferred from the dataset.

from torch.utils.data import Dataset

class PytorchTupleDataset(Dataset):
    def __init__(self):
        self.data = [
            {"text": "i love this movie", "label": 0},
            {"text": "this movie is bad", "label": 1},
        ]

    def __getitem__(self, i):
        return self.data[i]["text"], self.data[i]["label"]

    def __len__(self):
        return len(self.data)

dataset = PytorchTupleDataset()

memoryset = LabeledMemoryset.from_pytorch(
    "imdb_reviews_from_pytorch_dataset",
    dataset, # (1)!
    embedding_model=PretrainedEmbeddingModel.CLIP_BASE,
    column_names=["value", "label"], # (2)!
    label_names=["neg", "pos"],
)

This also support DataLoader objects.
If the provided dataset or data loader returns unnamed tuples, this argument must be provided to specify the names of the columns.

data = [
    {"text": "i love this movie", "label": 0},
    {"text": "this movie is bad", "label": 1},
]

memoryset = LabeledMemoryset.from_list(
    "imdb_reviews_from_list",
    data,
    embedding_model=PretrainedEmbeddingModel.CDE_SMALL,
    label_names=["neg", "pos"],
    value_column="text",
)

data = {
    "value": ["i love this movie", "this movie is bad"],
    "label": [0, 1],
    "external_id": ["123", "456"],
}

memoryset = LabeledMemoryset.from_dict(
    "imdb_reviews_from_dict",
    data,
    embedding_model=PretrainedEmbeddingModel.CDE_SMALL,
    label_names=["neg", "pos"],
    source_id_column="external_id",
)

from pandas import DataFrame

df = DataFrame({
    "value": ["i love this movie", "this movie is bad"],
    "label": [0, 1],
})

memoryset = LabeledMemoryset.from_pandas(
    "imdb_reviews_from_pandas",
    df,
    embedding_model=PretrainedEmbeddingModel.GTE_BASE,
    label_names=["neg", "pos"],
)

from pyarrow import Table

table = Table.from_arrays(
    [["i love this movie", "this movie is bad"], [0, 1]],
    names=["value", "label"],
)

memoryset = LabeledMemoryset.from_arrow(
    "imdb_reviews_from_pyarrow",
    table,
    embedding_model=PretrainedEmbeddingModel.CLIP_BASE,
    label_names=["neg", "pos"],
    if_exists="open",
)

memoryset = LabeledMemoryset.from_disk(
    "imdb_reviews_from_csv",
    "imdb_reviews.csv", # (1)!
    embedding_model=PretrainedEmbeddingModel.CLIP_BASE,
)

Path to the local file to create the memoryset from. The file type will be inferred from the file extension. We support:
- Pickle files (.pkl)
- JSON and JSON Lines files (.json, .jsonl)
- CSV files (.csv)
- Parquet files (.parquet)
- a directory containing a saved HuggingFace Dataset

Open an Existing Memoryset#

If you already have a memoryset in the OrcaCloud, you can open it by using the LabeledMemoryset.open method:

LabeledMemoryset.open("imdb_reviews")

LabeledMemoryset({
    name: 'imdb_reviews',
    length: 10,
    label_names: ['neg', 'pos'],
    embedding_model: PretrainedEmbeddingModel({name: GTE_BASE, embedding_dim: 768, max_seq_length: 8192}),
})

This will give you a handle to an existing memoryset that you can use to interact with the memoryset and the memories in the memoryset.

List all Memorysets#

You can list all memorysets in your OrcaCloud by using the LabeledMemoryset.all method:

LabeledMemoryset.all()

This will return a list of handles to all memorysets in your OrcaCloud.

Delete a Memoryset#

You can delete a memoryset by using the LabeledMemoryset.drop method:

LabeledMemoryset.drop(
    "imdb_reviews_from_csv",  # (1)!
    if_not_exists="error" # (2)!
)

The name or ID of the memoryset to drop.
What to do if the memoryset does not exist, defaults to "error". Other option is "ignore" to do nothing if the memoryset does not exist.

This will delete the memoryset from the OrcaCloud. If the memoryset does not exist, it will raise a LookupError. You can also specify the if_not_exists parameter as "ignore" if you do not wish to raise an error.

Clone a Memoryset#

You can clone a memoryset by using the clone method and optionally change the embedding model used to embed the memories:

memoryset.clone(
    "memoryset_cde_small",
    embedding_model=PretrainedEmbeddingModel.CDE_SMALL,
)

LabeledMemoryset({
    name: 'memoryset_cde_small',
    length: 10,
    label_names: ['neg', 'pos'],
    embedding_model: PretrainedEmbeddingModel({name: CDE_SMALL, embedding_dim: 768, max_seq_length: 512}),
})

This will create a new memoryset with the same memories as the original one, but with a different embedding model and return a handle to the new memoryset.

Filter Memories#

You can filter memories by label, value, source_id, or metadata columns using the query method:

memoryset.query(
    offset=0,
    limit=2,
    filters=[("label", "==", 0)]
)

[LabeledMemory({ label: <neg: 0>, value: 'This movie was so frustrating. Everything seemed energetic and I was totally prepared to have a good...' }),
 LabeledMemory({ label: <neg: 0>, value: 'This movie spends most of its time preaching that it is the script that makes the movie, but apparen...' })]

The filters arg param takes a list of tuples. Each tuple contains a column name, a comparison operator, and a value. We support filtering on the value, label, source_id, and custom metadata columns. The comparison operator can be one of the following: ==, !=, >, >=, <, <=, in, not in, like. Please see the FilterItemTuple documentation for more details.

Some examples of valid filters:

("label", "==", 0),
("value", "like", "good movie"),
("source_id", "!=", "123"),
("random_metadata_column", "in", ["value1", "value2"]),
("another_metadata_column", ">", 10),

Look up Relevant Memories#

The main purpose of a memoryset is to enable efficiently looking up memories that are similar to a given query (typically an input to a model). You can use the search method for this:

memoryset.search("Is this a good movie?", count=1)

[LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.58, value: 'I was truly and wonderfully surprised at "O\' Brother, Where Art Thou?" The video store was out of al...' })]

The search method takes a single query or a list of queries and is automatically batched for efficiency.

The result is a list of LabeledMemoryLookup objects that contain the memory properties and an additional lookup_score property with a score between 0 and 1 that indicates the similarity between the query and the memory.

Get Memories#

If you already have the memory_ids of the memories you want to retrieve, you can use the get method:

# Get a single memory
memoryset.get("5fb9521a-d3c2-430f-b43a-f51ff92643de")

LabeledMemory({ label: <neg: 0>, value: 'This movie was so frustrating. Everything seemed energetic and I was totally prepared to have a good...' })

# Get multiple memories
memoryset.get(["5fb9521a-d3c2-430f-b43a-f51ff92643de", "5fb9521a-d3c2-430f-b43a-f51ff92643de"])

[LabeledMemory({ label: <neg: 0>, value: 'This movie was so frustrating. Everything seemed energetic and I was totally prepared to have a good...' }),
 LabeledMemory({ label: <neg: 0>, value: 'This movie spends most of its time preaching that it is the script that makes the movie, but apparen...' })]

This will return a single LabeledMemory or a list of LabeledMemory objects that match the provided memory_id(s) depending on the input type.

You can also get a memory by index or slice:

memoryset[0]

LabeledMemory({ label: <neg: 0>, value: 'This movie was so frustrating. Everything seemed energetic and I was totally prepared to have a good...' })

memoryset[0:2]

[LabeledMemory({ label: <neg: 0>, value: 'This movie was so frustrating. Everything seemed energetic and I was totally prepared to have a good...' }),
 LabeledMemory({ label: <neg: 0>, value: 'This movie spends most of its time preaching that it is the script that makes the movie, but apparen...' })]

Insert Memories#

You can insert additional memories into an existing memoryset by using the insert method:

memoryset.insert([
    {
        "value": "I love this movie",
        "label": 0,
        "source_id": "tt0109830",
        "title": "Forest Gump"
    },
]) # (1)!

This method takes a list of dictionaries with value, label, and optionally source_id keys. Any other key/value pairs in the dictionaries will be stored as metadata.

This will insert the memories into the memoryset and refresh the memoryset handle.

Update#

You can update a memory in the memoryset by using the update method. You have to provide the memory_id of the memory you want to update and any keys you want to update.

memoryset.update([{
    "memory_id": "5fb9521a-d3c2-430f-b43a-f51ff92643de",
    "title": "Forest Gump (1994)" # (1)!
}])

Update the title of the memory.

This will update the memory in the memoryset and return the updated memory. You can also update multiple memories at once by providing a list of dictionaries with the memory_id and the keys you want to update.

memoryset.update([
    {
        "memory_id": "01954998-a3fe-7a36-a017-9561706c8310",
        "label": 0
    }, # (1)!
    {
        "memory_id": "01954998-cec4-7096-8de1-f8dbba5ced48",
        "value": "Not a great movie",
        "label": 0
    }, # (2)!
])

Update the label of the first memory.
Update the value and label of the second memory.

This will update the memory or memories in the memoryset and return the updated memory or memories.

If you have an instance of a LabeledMemory, you can also update it by using the update method:

memoryset[0].update(value="I love this movie", label=1, source_id="tt0111161")

LabeledMemory({ label: <pos: 1>, value: 'I love this movie' })

Delete Memories#

You can delete a memory in the memoryset by using the delete method:

memoryset.delete("5fb9521a-d3c2-430f-b43a-f51ff92643de")

This will delete the memory or memories in the memoryset and refresh the memoryset handle. You can also delete multiple memories at once by providing any iterable of memory_ids:

memoryset.delete(
    m.memory_id for m in memoryset.query(
        filters=[("metrics.is_duplicate", "==", True)]
    )
)