Skip to content

orcalib.memoryset.memoryset#

LabeledMemoryset #

1
2
3
4
5
6
7
8
9
LabeledMemoryset(
    uri=None,
    api_key=None,
    secret_key=None,
    database=None,
    table=None,
    embedding_model=EmbeddingModel.GTE_BASE,
    reranker=None,
)

Collection of memories with labels that are stored in an OrcaDB table and can be queried using embedding similarity search.

Note

This will create a database if it doesn’t exist yet and a table in it.

Parameters:

  • uri (str | None, default: None ) –

    URL of the database that should store the memories table or name of the table for the memories. Either a file URL or the URL to a hosted OrcaDB instance is accepted. If empty, the ORCADB_URL environment variable is used instead. If a string is provided, it is interpreted as the name of the table to create in the database specified by the ORCADB_URL environment variable.

  • api_key (str | None, default: None ) –

    API key for the OrcaDB instance. If not provided, the ORCADB_API_KEY environment variable or the credentials encoded in the uri are used

  • secret_key (str | None, default: None ) –

    Secret key for the OrcaDB instance. If not provided, the ORCADB_SECRET_KEY environment variable or the credentials encoded in the uri are used.

  • database (str | None, default: None ) –

    Name of the database. Do not provide this if it is already encoded in the uri.

  • table (str | None, default: None ) –

    Name of the table. Do not provide this if it is already encoded in the uri.

  • embedding_model (EmbeddingModel, default: GTE_BASE ) –

    Embedding model to use for semantic similarity search.

  • reranker (Reranker | None, default: None ) –

    optional reranking model to use during lookup.

Examples:

Infer connection details from the ORCADB_URL, ORCADB_API_KEY, and ORCADB_SECRET_KEY environment variables:

>>> import os
>>> os.environ["ORCADB_URL"] = "https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-db"
>>> LabeledMemoryset()
LabeledMemoryset(table="memories", database="my-db")
>>> LabeledMemoryset("my_memories_table")
LabeledMemoryset(table="my_memories_table", database="my-db")

All connection details can be fully encoded in the the uri:

>>> LabeledMemoryset("https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-db/my-memories-table")
LabeledMemoryset(table="my-memories-table", database="my-db")

Or they can be provided explicitly:

>>> LabeledMemoryset(
...    "https://instance.orcadb.cloud",
...    api_key="my-api-key",
...    secret_key="my-secret-key",
...    database="my-db",
...    table="my-memories-table"
... )
LabeledMemoryset(table="my-memories-table", database="my-db")

insert #

insert(dataset, log=True)

Inserts a dataset into the LabeledMemoryset database.

For dict-like or list of dict-like datasets, there must be a label key and one of the following keys: text, image, or value. If there are only two keys and one is label, the other will be inferred to be value.

For list-like datasets, the first element of each tuple must be the value and the second must be the label.

Parameters:

  • dataset (DatasetLike) –

    data to insert into the memoryset

  • log (bool, default: True ) –

    whether to show a progressbar and log messages

Examples:

Example 1: Inserting a dictionary-like dataset#

>>> dataset = [{
...    "text": "text 1",
...    "label": 0
... }]
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> memoryset.insert(dataset)

Example 2: Inserting a list-like dataset#

>>> dataset = [
...    ("text 1", 0),
...    ("text 2", 1)
]
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> memoryset.insert(dataset)

Example 3: Inserting a Hugging Face Dataset#

from datasets import Dataset

>>> dataset = load_dataset("frgfm/imagenette", "320px")
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> memoryset.insert(dataset)

lookup #

lookup(
    query,
    *,
    column_oriented=False,
    k=1,
    batch_size=32,
    run_ids=None,
    rerank=None,
    log=False
)

Retrieves the most similar memories to the query from the memoryset.

Parameters:

  • query (InputType | list[InputType] | ndarray) –

    The query to retrieve memories for. Can be a single value, a list of values, or a numpy array with value embeddings.

  • k (int, default: 1 ) –

    The number of memories to retrieve.

  • batch_size (int, default: 32 ) –

    The number of queries to process at a time.

  • run_ids (list[int] | None, default: None ) –

    A list of run IDs to track with the lookup.

  • rerank (bool | None, default: None ) –

    Whether to rerank the results. If None (default), results will be reranked if a reranker is attached to the Memoryset.

  • log (bool, default: False ) –

    Whether to log the lookup process and show progress bars.

Returns:

Examples:

Example 1: Retrieving the most similar memory to a single example#

>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> query = "Apple"
>>> memories = memoryset.lookup(query, k=1)
[
    [
        LabeledMemoryLookup(
            value='Orange',
            memory_id=12,
            memory_version=1,
            label=0,
            label_name='fruit',
            embedding=array([...], dtype=float32),
            metadata=None,
            lookup_score=.98,
            reranker_score=None,
            reranker_embedding=None
        )
    ]
]

to_list #

to_list(limit=None)

Get a list of all the memories in the memoryset.

Returns:

to_pandas #

to_pandas(limit=None)

Get a DataFrame representation of the memoryset.

Returns:

  • DataFrame

    DataFrame containing the memories

update_embedding_model #

update_embedding_model(embedding_model, destination=None)

Updates the embedding model for the memoryset and re-embeds all memories in the current memoryset or a new destination memoryset if it is provided.

Note

This will reset the destination memoryset if it already exists.

Parameters:

  • embedding_model (EmbeddingModel) –

    new embedding model to use.

  • destination (LabeledMemoryset | str | None, default: None ) –

    destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist. It this is None the current memoryset will be updated.

Returns:

Examples:

Replace the embedding model for the current memoryset:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.update_model(EmbeddingModel.CLIP_BASE)

Create a new memoryset with a new embedding model:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> new_memoryset = memoryset.update_model(EmbeddingModel.CLIP_BASE, "my_new_memoryset")

clone #

clone(destination)

Clone the current memoryset into a new memoryset.

Note

This will reset the destination memoryset if it already exists.

Parameters:

  • destination (LabeledMemoryset | str) –

    The destination memoryset to clone this memoryset into, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.

Returns:

  • LabeledMemoryset

    The destination memoryset that the memories were cloned into.

Examples:

Clone a local memoryset into a hosted database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.clone("https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-database#my_memoryset")

Clone a local memoryset into a new table in the same database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.clone("my_new_memoryset")

map #

map(fn, destination=None)

Apply a function to all the memories in the memoryset and store them in the current memoryset or a new destination memoryset if it is provided.

Note

If your function returns a column that already exists, then it overwrites it.

Parameters:

  • fn (Callable[[LabeledMemory], dict[str, Any] | LabeledMemory]) –

    Function that takes in the memory and returns a new memory or a dictionary containing the values to update in the memory.

  • destination (LabeledMemoryset | str | None, default: None ) –

    The destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.

Returns:

Examples:

Add new metadata to all memories in the memoryset:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.map(lambda m: dict(metadata=dict(**m.metadata, new_key="new_value")))

Create a new memoryset with swapped labels in a new table in the same database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> swapped_memoryset = memoryset.map(
...     lambda m: dict(label=1 if m.label == 0 else 0),
...     "my_swapped_memoryset"
... )

filter #

filter(fn, destination=None)

Filters the current memoryset using the given function and stores the result in the current memoryset or a new destination memoryset if it is provided.

Note

This will reset the destination memoryset if it already exists.

Parameters:

  • fn (Callable[[LabeledMemory], bool]) –

    Function that takes in the memory and returns a boolean indicating whether the memory should be included or not.

  • destination (LabeledMemoryset | str | None, default: None ) –

    The destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.

Returns:

Examples:

Filter out memories with a label of 0:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.filter(lambda m: m.label != 0)

Create a new memoryset with some metadata in a new table in the same database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> filtered_memoryset = memoryset.filter(
...     lambda m: m.metadata["key"] == "filter_value",
...     "my_filtered_memoryset"
... )

drop_table #

drop_table()

Drop the table associated with this Memoryset.

reset #

reset()

Drop all data from the table associated with this Memoryset.