Skip to content

orcalib.memoryset#

EmbeddingModel #

1
2
3
EmbeddingModel(
    name, version=0, embedding_dim=None, tokenizer=None
)

Embedding models for use with memorysets

Warning

Only the models that are available as class properties like EmbeddingModel.CLIP_BASE as well as fine-tuned versions of them are guaranteed to work.

Parameters:

  • name (str) –

    the name of the model to use, can be a HuggingFace model name or path to a local saved model, only models that are available as class properties like EmbeddingModel.CLIP_BASE as well as fine-tuned versions of them are guaranteed to work

  • version (int, default: 0 ) –

    optional version number of the model to use, this is only used for default models

  • embedding_dim (int | None, default: None ) –

    optional overwrite for embeddings dimension in case it is not correctly specified in the config

  • tokenizer (str | None, default: None ) –

    optional name of a tokenizer model to use, if not given it will be the same as name

embed #

embed(data, show_progress_bar=False, batch_size=32)

Generate embeddings for the given input

Parameters:

  • data (InputType | list[InputType]) –

    the data to encode, will be converted to a list if a scalar is given

  • show_progress_bar (bool, default: False ) –

    whether to show a progress bar

  • batch_size (int, default: 32 ) –

    the size of the batches to use

Returns:

  • ndarray

    matrix with embeddings of shape len_data x embedding_dim

LabeledMemory dataclass #

1
2
3
4
5
6
7
8
9
LabeledMemory(
    value,
    embedding,
    memory_id,
    memory_version,
    metadata,
    label,
    label_name=None,
)

Bases: _LabeledMemoryFields, Memory

A labeled memory is a single item that can be stored in the database and has a label.

value instance-attribute #

value

The value used to generate the embedding for looking up this memory.

embedding instance-attribute #

embedding

The embedding of the memory value, automatically generated by the Memoryset model.

memory_id instance-attribute #

memory_id

The ID of the memory in the table, automatically generated by the Memoryset.

memory_version instance-attribute #

memory_version

The version of the memory, automatically maintained by the Memoryset.

metadata instance-attribute #

metadata

Metadata associated with the memory that is not used in the model.

label instance-attribute #

label

The label of the memory.

label_name class-attribute instance-attribute #

label_name = None

The human-readable name of the label.

LabeledMemoryLookup dataclass #

LabeledMemoryLookup(
    lookup_score,
    value,
    embedding,
    memory_id,
    memory_version,
    metadata,
    label,
    label_name=None,
    reranker_score=None,
    reranker_embedding=None,
    attention_weight=None,
)

Bases: _OptionalLookupProperties, _LabeledMemoryFields, Memory, _RequiredLookupProperties

Single labeled memory lookup result.

lookup_score instance-attribute #

lookup_score

The similarity score between the query and the memory.

value instance-attribute #

value

The value used to generate the embedding for looking up this memory.

embedding instance-attribute #

embedding

The embedding of the memory value, automatically generated by the Memoryset model.

memory_id instance-attribute #

memory_id

The ID of the memory in the table, automatically generated by the Memoryset.

memory_version instance-attribute #

memory_version

The version of the memory, automatically maintained by the Memoryset.

metadata instance-attribute #

metadata

Metadata associated with the memory that is not used in the model.

label instance-attribute #

label

The label of the memory.

label_name class-attribute instance-attribute #

label_name = None

The human-readable name of the label.

reranker_score class-attribute instance-attribute #

reranker_score = None

The similarity score assigned by the reranker.

Note

This will be automatically generated if a reranker is attached to the memoryset.

reranker_embedding class-attribute instance-attribute #

reranker_embedding = None

The reranker embedding for this memory value.

Note

This will be automatically generated if a reranker is attached to the memoryset.

attention_weight class-attribute instance-attribute #

attention_weight = None

The attention the model gave to this memory lookup.

Note

This is not provided during lookup but can instead be optionally added by the model during its forward pass to store for later analysis.

Memory dataclass #

1
2
3
Memory(
    value, embedding, memory_id, memory_version, metadata
)

The base class for a labeled memory. This includes fields that are ALWAYS required.

value instance-attribute #

value

The value used to generate the embedding for looking up this memory.

embedding instance-attribute #

embedding

The embedding of the memory value, automatically generated by the Memoryset model.

memory_id instance-attribute #

memory_id

The ID of the memory in the table, automatically generated by the Memoryset.

memory_version instance-attribute #

memory_version

The version of the memory, automatically maintained by the Memoryset.

metadata instance-attribute #

metadata

Metadata associated with the memory that is not used in the model.

MemoryLookup dataclass #

MemoryLookup(
    lookup_score,
    value,
    embedding,
    memory_id,
    memory_version,
    metadata,
    reranker_score=None,
    reranker_embedding=None,
    attention_weight=None,
)

Bases: _OptionalLookupProperties, Memory, _RequiredLookupProperties

Single labeled memory lookup result.

lookup_score instance-attribute #

lookup_score

The similarity score between the query and the memory.

value instance-attribute #

value

The value used to generate the embedding for looking up this memory.

embedding instance-attribute #

embedding

The embedding of the memory value, automatically generated by the Memoryset model.

memory_id instance-attribute #

memory_id

The ID of the memory in the table, automatically generated by the Memoryset.

memory_version instance-attribute #

memory_version

The version of the memory, automatically maintained by the Memoryset.

metadata instance-attribute #

metadata

Metadata associated with the memory that is not used in the model.

reranker_score class-attribute instance-attribute #

reranker_score = None

The similarity score assigned by the reranker.

Note

This will be automatically generated if a reranker is attached to the memoryset.

reranker_embedding class-attribute instance-attribute #

reranker_embedding = None

The reranker embedding for this memory value.

Note

This will be automatically generated if a reranker is attached to the memoryset.

attention_weight class-attribute instance-attribute #

attention_weight = None

The attention the model gave to this memory lookup.

Note

This is not provided during lookup but can instead be optionally added by the model during its forward pass to store for later analysis.

LabeledMemoryset #

1
2
3
4
5
6
7
8
9
LabeledMemoryset(
    uri=None,
    api_key=None,
    secret_key=None,
    database=None,
    table=None,
    embedding_model=EmbeddingModel.GTE_BASE,
    reranker=None,
)

Collection of memories with labels that are stored in an OrcaDB table and can be queried using embedding similarity search.

Note

This will create a database if it doesn’t exist yet and a table in it.

Parameters:

  • uri (str | None, default: None ) –

    URL of the database that should store the memories table or name of the table for the memories. Either a file URL or the URL to a hosted OrcaDB instance is accepted. If empty, the ORCADB_URL environment variable is used instead. If a string is provided, it is interpreted as the name of the table to create in the database specified by the ORCADB_URL environment variable.

  • api_key (str | None, default: None ) –

    API key for the OrcaDB instance. If not provided, the ORCADB_API_KEY environment variable or the credentials encoded in the uri are used

  • secret_key (str | None, default: None ) –

    Secret key for the OrcaDB instance. If not provided, the ORCADB_SECRET_KEY environment variable or the credentials encoded in the uri are used.

  • database (str | None, default: None ) –

    Name of the database. Do not provide this if it is already encoded in the uri.

  • table (str | None, default: None ) –

    Name of the table. Do not provide this if it is already encoded in the uri.

  • embedding_model (EmbeddingModel, default: GTE_BASE ) –

    Embedding model to use for semantic similarity search.

  • reranker (Reranker | None, default: None ) –

    optional reranking model to use during lookup.

Examples:

Infer connection details from the ORCADB_URL, ORCADB_API_KEY, and ORCADB_SECRET_KEY environment variables:

>>> import os
>>> os.environ["ORCADB_URL"] = "https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-db"
>>> LabeledMemoryset()
LabeledMemoryset(table="memories", database="my-db")
>>> LabeledMemoryset("my_memories_table")
LabeledMemoryset(table="my_memories_table", database="my-db")

All connection details can be fully encoded in the the uri:

>>> LabeledMemoryset("https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-db/my-memories-table")
LabeledMemoryset(table="my-memories-table", database="my-db")

Or they can be provided explicitly:

>>> LabeledMemoryset(
...    "https://instance.orcadb.cloud",
...    api_key="my-api-key",
...    secret_key="my-secret-key",
...    database="my-db",
...    table="my-memories-table"
... )
LabeledMemoryset(table="my-memories-table", database="my-db")

insert #

insert(dataset, log=True)

Inserts a dataset into the LabeledMemoryset database.

For dict-like or list of dict-like datasets, there must be a label key and one of the following keys: text, image, or value. If there are only two keys and one is label, the other will be inferred to be value.

For list-like datasets, the first element of each tuple must be the value and the second must be the label.

Parameters:

  • dataset (DatasetLike) –

    data to insert into the memoryset

  • log (bool, default: True ) –

    whether to show a progressbar and log messages

Examples:

Example 1: Inserting a dictionary-like dataset#

>>> dataset = [{
...    "text": "text 1",
...    "label": 0
... }]
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> memoryset.insert(dataset)

Example 2: Inserting a list-like dataset#

>>> dataset = [
...    ("text 1", 0),
...    ("text 2", 1)
]
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> memoryset.insert(dataset)

Example 3: Inserting a Hugging Face Dataset#

from datasets import Dataset

>>> dataset = load_dataset("frgfm/imagenette", "320px")
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> memoryset.insert(dataset)

lookup #

lookup(
    query,
    *,
    column_oriented=False,
    k=1,
    batch_size=32,
    run_ids=None,
    rerank=None,
    log=False
)

Retrieves the most similar memories to the query from the memoryset.

Parameters:

  • query (InputType | list[InputType] | ndarray) –

    The query to retrieve memories for. Can be a single value, a list of values, or a numpy array with value embeddings.

  • k (int, default: 1 ) –

    The number of memories to retrieve.

  • batch_size (int, default: 32 ) –

    The number of queries to process at a time.

  • run_ids (list[int] | None, default: None ) –

    A list of run IDs to track with the lookup.

  • rerank (bool | None, default: None ) –

    Whether to rerank the results. If None (default), results will be reranked if a reranker is attached to the Memoryset.

  • log (bool, default: False ) –

    Whether to log the lookup process and show progress bars.

Returns:

Examples:

Example 1: Retrieving the most similar memory to a single example#

>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> query = "Apple"
>>> memories = memoryset.lookup(query, k=1)
[
    [
        LabeledMemoryLookup(
            value='Orange',
            memory_id=12,
            memory_version=1,
            label=0,
            label_name='fruit',
            embedding=array([...], dtype=float32),
            metadata=None,
            lookup_score=.98,
            reranker_score=None,
            reranker_embedding=None
        )
    ]
]

to_list #

to_list(limit=None)

Get a list of all the memories in the memoryset.

Returns:

to_pandas #

to_pandas(limit=None)

Get a DataFrame representation of the memoryset.

Returns:

  • DataFrame

    DataFrame containing the memories

update_embedding_model #

update_embedding_model(embedding_model, destination=None)

Updates the embedding model for the memoryset and re-embeds all memories in the current memoryset or a new destination memoryset if it is provided.

Note

This will reset the destination memoryset if it already exists.

Parameters:

  • embedding_model (EmbeddingModel) –

    new embedding model to use.

  • destination (LabeledMemoryset | str | None, default: None ) –

    destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist. It this is None the current memoryset will be updated.

Returns:

Examples:

Replace the embedding model for the current memoryset:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.update_model(EmbeddingModel.CLIP_BASE)

Create a new memoryset with a new embedding model:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> new_memoryset = memoryset.update_model(EmbeddingModel.CLIP_BASE, "my_new_memoryset")

clone #

clone(destination)

Clone the current memoryset into a new memoryset.

Note

This will reset the destination memoryset if it already exists.

Parameters:

  • destination (LabeledMemoryset | str) –

    The destination memoryset to clone this memoryset into, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.

Returns:

  • LabeledMemoryset

    The destination memoryset that the memories were cloned into.

Examples:

Clone a local memoryset into a hosted database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.clone("https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-database#my_memoryset")

Clone a local memoryset into a new table in the same database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.clone("my_new_memoryset")

map #

map(fn, destination=None)

Apply a function to all the memories in the memoryset and store them in the current memoryset or a new destination memoryset if it is provided.

Note

If your function returns a column that already exists, then it overwrites it.

Parameters:

  • fn (Callable[[LabeledMemory], dict[str, Any] | LabeledMemory]) –

    Function that takes in the memory and returns a new memory or a dictionary containing the values to update in the memory.

  • destination (LabeledMemoryset | str | None, default: None ) –

    The destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.

Returns:

Examples:

Add new metadata to all memories in the memoryset:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.map(lambda m: dict(metadata=dict(**m.metadata, new_key="new_value")))

Create a new memoryset with swapped labels in a new table in the same database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> swapped_memoryset = memoryset.map(
...     lambda m: dict(label=1 if m.label == 0 else 0),
...     "my_swapped_memoryset"
... )

filter #

filter(fn, destination=None)

Filters the current memoryset using the given function and stores the result in the current memoryset or a new destination memoryset if it is provided.

Note

This will reset the destination memoryset if it already exists.

Parameters:

  • fn (Callable[[LabeledMemory], bool]) –

    Function that takes in the memory and returns a boolean indicating whether the memory should be included or not.

  • destination (LabeledMemoryset | str | None, default: None ) –

    The destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.

Returns:

Examples:

Filter out memories with a label of 0:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.filter(lambda m: m.label != 0)

Create a new memoryset with some metadata in a new table in the same database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> filtered_memoryset = memoryset.filter(
...     lambda m: m.metadata["key"] == "filter_value",
...     "my_filtered_memoryset"
... )

drop_table #

drop_table()

Drop the table associated with this Memoryset.

reset #

reset()

Drop all data from the table associated with this Memoryset.