orcalib.memoryset#

EmbeddingModel #

EmbeddingModel(
    name, version=0, embedding_dim=None, tokenizer=None
)

Embedding models for use with memorysets

Warning

Only the models that are available as class properties like EmbeddingModel.CLIP_BASE as well as fine-tuned versions of them are guaranteed to work.

Parameters:

name (str) –

the name of the model to use, can be a HuggingFace model name or path to a local saved model, only models that are available as class properties like EmbeddingModel.CLIP_BASE as well as fine-tuned versions of them are guaranteed to work
version (int, default: 0 ) –

optional version number of the model to use, this is only used for default models
embedding_dim (int | None, default: None ) –

optional overwrite for embeddings dimension in case it is not correctly specified in the config
tokenizer (str | None, default: None ) –

optional name of a tokenizer model to use, if not given it will be the same as name

embed #

embed(data, show_progress_bar=False, batch_size=32)

Generate embeddings for the given input

Parameters:

data (InputType | list[InputType]) –

the data to encode, will be converted to a list if a scalar is given
show_progress_bar (bool, default: False ) –

whether to show a progress bar
batch_size (int, default: 32 ) –

the size of the batches to use

Returns:

ndarray –

matrix with embeddings of shape len_data x embedding_dim

LabeledMemory `dataclass` #

LabeledMemory(
    value,
    embedding,
    memory_id,
    memory_version,
    metadata,
    label,
    label_name=None,
)

Bases: _LabeledMemoryFields, Memory

A labeled memory is a single item that can be stored in the database and has a label.

value `instance-attribute` #

value

The value used to generate the embedding for looking up this memory.

embedding `instance-attribute` #

1	`embedding`

The embedding of the memory value, automatically generated by the Memoryset model.

memory_id `instance-attribute` #

1	`memory_id`

The ID of the memory in the table, automatically generated by the Memoryset.

memory_version `instance-attribute` #

1	`memory_version`

The version of the memory, automatically maintained by the Memoryset.

metadata `instance-attribute` #

1	`metadata`

Metadata associated with the memory that is not used in the model.

label `instance-attribute` #

label

The label of the memory.

label_name `class-attribute` `instance-attribute` #

label_name = None

The human-readable name of the label.

LabeledMemoryLookup `dataclass` #

LabeledMemoryLookup(
    lookup_score,
    value,
    embedding,
    memory_id,
    memory_version,
    metadata,
    label,
    label_name=None,
    reranker_score=None,
    reranker_embedding=None,
    attention_weight=None,
)

Bases: _OptionalLookupProperties, _LabeledMemoryFields, Memory, _RequiredLookupProperties

Single labeled memory lookup result.

lookup_score `instance-attribute` #

1	`lookup_score`

The similarity score between the query and the memory.

value `instance-attribute` #

value

The value used to generate the embedding for looking up this memory.

embedding `instance-attribute` #

1	`embedding`

The embedding of the memory value, automatically generated by the Memoryset model.

memory_id `instance-attribute` #

1	`memory_id`

The ID of the memory in the table, automatically generated by the Memoryset.

memory_version `instance-attribute` #

1	`memory_version`

The version of the memory, automatically maintained by the Memoryset.

metadata `instance-attribute` #

1	`metadata`

Metadata associated with the memory that is not used in the model.

label `instance-attribute` #

label

The label of the memory.

label_name `class-attribute` `instance-attribute` #

label_name = None

The human-readable name of the label.

reranker_score `class-attribute` `instance-attribute` #

reranker_score = None

The similarity score assigned by the reranker.

Note

This will be automatically generated if a reranker is attached to the memoryset.

reranker_embedding `class-attribute` `instance-attribute` #

reranker_embedding = None

The reranker embedding for this memory value.

Note

This will be automatically generated if a reranker is attached to the memoryset.

attention_weight `class-attribute` `instance-attribute` #

attention_weight = None

The attention the model gave to this memory lookup.

Note

This is not provided during lookup but can instead be optionally added by the model during its forward pass to store for later analysis.

Memory `dataclass` #

Memory(
    value, embedding, memory_id, memory_version, metadata
)

The base class for a labeled memory. This includes fields that are ALWAYS required.

value `instance-attribute` #

value

The value used to generate the embedding for looking up this memory.

embedding `instance-attribute` #

1	`embedding`

The embedding of the memory value, automatically generated by the Memoryset model.

memory_id `instance-attribute` #

1	`memory_id`

The ID of the memory in the table, automatically generated by the Memoryset.

memory_version `instance-attribute` #

1	`memory_version`

The version of the memory, automatically maintained by the Memoryset.

metadata `instance-attribute` #

1	`metadata`

Metadata associated with the memory that is not used in the model.

MemoryLookup `dataclass` #

MemoryLookup(
    lookup_score,
    value,
    embedding,
    memory_id,
    memory_version,
    metadata,
    reranker_score=None,
    reranker_embedding=None,
    attention_weight=None,
)

Bases: _OptionalLookupProperties, Memory, _RequiredLookupProperties

Single labeled memory lookup result.

lookup_score `instance-attribute` #

1	`lookup_score`

The similarity score between the query and the memory.

value `instance-attribute` #

value

The value used to generate the embedding for looking up this memory.

embedding `instance-attribute` #

1	`embedding`

The embedding of the memory value, automatically generated by the Memoryset model.

memory_id `instance-attribute` #

1	`memory_id`

The ID of the memory in the table, automatically generated by the Memoryset.

memory_version `instance-attribute` #

1	`memory_version`

The version of the memory, automatically maintained by the Memoryset.

metadata `instance-attribute` #

1	`metadata`

Metadata associated with the memory that is not used in the model.

reranker_score `class-attribute` `instance-attribute` #

reranker_score = None

The similarity score assigned by the reranker.

Note

This will be automatically generated if a reranker is attached to the memoryset.

reranker_embedding `class-attribute` `instance-attribute` #

reranker_embedding = None

The reranker embedding for this memory value.

Note

This will be automatically generated if a reranker is attached to the memoryset.

attention_weight `class-attribute` `instance-attribute` #

attention_weight = None

The attention the model gave to this memory lookup.

Note

This is not provided during lookup but can instead be optionally added by the model during its forward pass to store for later analysis.

LabeledMemoryset #

LabeledMemoryset(
    uri=None,
    api_key=None,
    secret_key=None,
    database=None,
    table=None,
    embedding_model=EmbeddingModel.GTE_BASE,
    reranker=None,
)

Collection of memories with labels that are stored in an OrcaDB table and can be queried using embedding similarity search.

Note

This will create a database if it doesn’t exist yet and a table in it.

Parameters:

uri (str | None, default: None ) –

URL of the database that should store the memories table or name of the table for the memories. Either a file URL or the URL to a hosted OrcaDB instance is accepted. If empty, the ORCADB_URL environment variable is used instead. If a string is provided, it is interpreted as the name of the table to create in the database specified by the ORCADB_URL environment variable.
api_key (str | None, default: None ) –

API key for the OrcaDB instance. If not provided, the ORCADB_API_KEY environment variable or the credentials encoded in the uri are used
secret_key (str | None, default: None ) –

Secret key for the OrcaDB instance. If not provided, the ORCADB_SECRET_KEY environment variable or the credentials encoded in the uri are used.
database (str | None, default: None ) –

Name of the database. Do not provide this if it is already encoded in the uri.
table (str | None, default: None ) –

Name of the table. Do not provide this if it is already encoded in the uri.
embedding_model (EmbeddingModel, default: GTE_BASE ) –

Embedding model to use for semantic similarity search.
reranker (Reranker | None, default: None ) –

optional reranking model to use during lookup.

Examples:

Infer connection details from the ORCADB_URL, ORCADB_API_KEY, and ORCADB_SECRET_KEY environment variables:

>>> import os
>>> os.environ["ORCADB_URL"] = "https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-db"
>>> LabeledMemoryset()
LabeledMemoryset(table="memories", database="my-db")
>>> LabeledMemoryset("my_memories_table")
LabeledMemoryset(table="my_memories_table", database="my-db")

All connection details can be fully encoded in the the uri:

>>> LabeledMemoryset("https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-db/my-memories-table")
LabeledMemoryset(table="my-memories-table", database="my-db")

Or they can be provided explicitly:

>>> LabeledMemoryset(
...    "https://instance.orcadb.cloud",
...    api_key="my-api-key",
...    secret_key="my-secret-key",
...    database="my-db",
...    table="my-memories-table"
... )
LabeledMemoryset(table="my-memories-table", database="my-db")

insert #

insert(dataset, log=True)

Inserts a dataset into the LabeledMemoryset database.

For dict-like or list of dict-like datasets, there must be a label key and one of the following keys: text, image, or value. If there are only two keys and one is label, the other will be inferred to be value.

For list-like datasets, the first element of each tuple must be the value and the second must be the label.

Parameters:

dataset (DatasetLike) –

data to insert into the memoryset
log (bool, default: True ) –

whether to show a progressbar and log messages

Examples:

Example 1: Inserting a dictionary-like dataset#

>>> dataset = [{
...    "text": "text 1",
...    "label": 0
... }]
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> memoryset.insert(dataset)

Example 2: Inserting a list-like dataset#

>>> dataset = [
...    ("text 1", 0),
...    ("text 2", 1)
]
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> memoryset.insert(dataset)

Example 3: Inserting a Hugging Face Dataset#

from datasets import Dataset

>>> dataset = load_dataset("frgfm/imagenette", "320px")
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> memoryset.insert(dataset)

lookup #

lookup(
    query,
    *,
    column_oriented=False,
    k=1,
    batch_size=32,
    run_ids=None,
    rerank=None,
    log=False
)

Retrieves the most similar memories to the query from the memoryset.

Parameters:

query (InputType | list[InputType] | ndarray) –

The query to retrieve memories for. Can be a single value, a list of values, or a numpy array with value embeddings.
k (int, default: 1 ) –

The number of memories to retrieve.
batch_size (int, default: 32 ) –

The number of queries to process at a time.
run_ids (list[int] | None, default: None ) –

A list of run IDs to track with the lookup.
rerank (bool | None, default: None ) –

Whether to rerank the results. If None (default), results will be reranked if a reranker is attached to the Memoryset.
log (bool, default: False ) –

Whether to log the lookup process and show progress bars.

Returns:

list[list[LabeledMemoryLookup]] | list[MemoryLookupResults] –

A list of lists of LabeledMemoryLookups, where each inner list contains the k most similar memories to the corresponding query.

Examples:

Example 1: Retrieving the most similar memory to a single example#

>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> query = "Apple"
>>> memories = memoryset.lookup(query, k=1)
[
    [
        LabeledMemoryLookup(
            value='Orange',
            memory_id=12,
            memory_version=1,
            label=0,
            label_name='fruit',
            embedding=array([...], dtype=float32),
            metadata=None,
            lookup_score=.98,
            reranker_score=None,
            reranker_embedding=None
        )
    ]
]

to_list #

to_list(limit=None)

Get a list of all the memories in the memoryset.

Returns:

list[LabeledMemory] –

list containing the memories

to_pandas #

to_pandas(limit=None)

Get a DataFrame representation of the memoryset.

Returns:

DataFrame –

DataFrame containing the memories

update_embedding_model #

update_embedding_model(embedding_model, destination=None)

Updates the embedding model for the memoryset and re-embeds all memories in the current memoryset or a new destination memoryset if it is provided.

Note

This will reset the destination memoryset if it already exists.

Parameters:

embedding_model (EmbeddingModel) –

new embedding model to use.
destination (LabeledMemoryset | str | None, default: None ) –

destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist. It this is None the current memoryset will be updated.

Returns:

LabeledMemoryset –

The destination memoryset with the updated embeddings.

Examples:

Replace the embedding model for the current memoryset:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.update_model(EmbeddingModel.CLIP_BASE)

Create a new memoryset with a new embedding model:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> new_memoryset = memoryset.update_model(EmbeddingModel.CLIP_BASE, "my_new_memoryset")

clone #

clone(destination)

Clone the current memoryset into a new memoryset.

Note

This will reset the destination memoryset if it already exists.

Parameters:

destination (LabeledMemoryset | str) –

The destination memoryset to clone this memoryset into, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.

Returns:

LabeledMemoryset –

The destination memoryset that the memories were cloned into.

Examples:

Clone a local memoryset into a hosted database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.clone("https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-database#my_memoryset")

Clone a local memoryset into a new table in the same database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.clone("my_new_memoryset")

map #

map(fn, destination=None)

Apply a function to all the memories in the memoryset and store them in the current memoryset or a new destination memoryset if it is provided.

Note

If your function returns a column that already exists, then it overwrites it.

Parameters:

fn (Callable[[LabeledMemory], dict[str, Any] | LabeledMemory]) –

Function that takes in the memory and returns a new memory or a dictionary containing the values to update in the memory.
destination (LabeledMemoryset | str | None, default: None ) –

The destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.

Returns:

LabeledMemoryset –

The destination memoryset with the updated memories.

Examples:

Add new metadata to all memories in the memoryset:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.map(lambda m: dict(metadata=dict(**m.metadata, new_key="new_value")))

Create a new memoryset with swapped labels in a new table in the same database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> swapped_memoryset = memoryset.map(
...     lambda m: dict(label=1 if m.label == 0 else 0),
...     "my_swapped_memoryset"
... )

filter #

filter(fn, destination=None)

Filters the current memoryset using the given function and stores the result in the current memoryset or a new destination memoryset if it is provided.

Note

This will reset the destination memoryset if it already exists.

Parameters:

fn (Callable[[LabeledMemory], bool]) –

Function that takes in the memory and returns a boolean indicating whether the memory should be included or not.
destination (LabeledMemoryset | str | None, default: None ) –

The destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.

Returns:

LabeledMemoryset –

The destination memoryset with the filtered memories.

Examples:

Filter out memories with a label of 0:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.filter(lambda m: m.label != 0)

Create a new memoryset with some metadata in a new table in the same database:

>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> filtered_memoryset = memoryset.filter(
...     lambda m: m.metadata["key"] == "filter_value",
...     "my_filtered_memoryset"
... )

drop_table #

1	`drop_table()`

Drop the table associated with this Memoryset.

reset #

reset()

Drop all data from the table associated with this Memoryset.

orcalib.memoryset#

EmbeddingModel #

embed #

LabeledMemory dataclass #

value instance-attribute #

embedding instance-attribute #

memory_id instance-attribute #

memory_version instance-attribute #

metadata instance-attribute #

label instance-attribute #

label_name class-attribute instance-attribute #

LabeledMemoryLookup dataclass #

lookup_score instance-attribute #

value instance-attribute #

embedding instance-attribute #

memory_id instance-attribute #

memory_version instance-attribute #

metadata instance-attribute #

label instance-attribute #

label_name class-attribute instance-attribute #

reranker_score class-attribute instance-attribute #

reranker_embedding class-attribute instance-attribute #

attention_weight class-attribute instance-attribute #

Memory dataclass #

value instance-attribute #

embedding instance-attribute #

memory_id instance-attribute #

memory_version instance-attribute #

metadata instance-attribute #

MemoryLookup dataclass #

lookup_score instance-attribute #

value instance-attribute #

embedding instance-attribute #

memory_id instance-attribute #

memory_version instance-attribute #

metadata instance-attribute #

reranker_score class-attribute instance-attribute #

reranker_embedding class-attribute instance-attribute #

attention_weight class-attribute instance-attribute #

LabeledMemoryset #

insert #

Example 1: Inserting a dictionary-like dataset#

Example 2: Inserting a list-like dataset#

Example 3: Inserting a Hugging Face Dataset#

lookup #

Example 1: Retrieving the most similar memory to a single example#

to_list #

to_pandas #

update_embedding_model #

clone #

map #

filter #

drop_table #

reset #

LabeledMemory `dataclass` #

value `instance-attribute` #

embedding `instance-attribute` #

memory_id `instance-attribute` #

memory_version `instance-attribute` #

metadata `instance-attribute` #

label `instance-attribute` #

label_name `class-attribute` `instance-attribute` #

LabeledMemoryLookup `dataclass` #

lookup_score `instance-attribute` #

value `instance-attribute` #

embedding `instance-attribute` #

memory_id `instance-attribute` #

memory_version `instance-attribute` #

metadata `instance-attribute` #

label `instance-attribute` #

label_name `class-attribute` `instance-attribute` #

reranker_score `class-attribute` `instance-attribute` #

reranker_embedding `class-attribute` `instance-attribute` #

attention_weight `class-attribute` `instance-attribute` #

Memory `dataclass` #

value `instance-attribute` #

embedding `instance-attribute` #

memory_id `instance-attribute` #

memory_version `instance-attribute` #

metadata `instance-attribute` #

MemoryLookup `dataclass` #

lookup_score `instance-attribute` #

value `instance-attribute` #

embedding `instance-attribute` #

memory_id `instance-attribute` #

memory_version `instance-attribute` #

metadata `instance-attribute` #

reranker_score `class-attribute` `instance-attribute` #

reranker_embedding `class-attribute` `instance-attribute` #

attention_weight `class-attribute` `instance-attribute` #