orcalib.memoryset#
EmbeddingModel
#
Embedding models for use with memorysets
Warning
Only the models that are available as class properties like EmbeddingModel.CLIP_BASE
as
well as fine-tuned versions of them are guaranteed to work.
Parameters:
-
name
(str
) –the name of the model to use, can be a HuggingFace model name or path to a local saved model, only models that are available as class properties like
EmbeddingModel.CLIP_BASE
as well as fine-tuned versions of them are guaranteed to work -
version
(int
, default:0
) –optional version number of the model to use, this is only used for default models
-
embedding_dim
(int | None
, default:None
) –optional overwrite for embeddings dimension in case it is not correctly specified in the config
-
tokenizer
(str | None
, default:None
) –optional name of a tokenizer model to use, if not given it will be the same as
name
embed
#
Generate embeddings for the given input
Parameters:
-
data
(InputType | list[InputType]
) –the data to encode, will be converted to a list if a scalar is given
-
show_progress_bar
(bool
, default:False
) –whether to show a progress bar
-
batch_size
(int
, default:32
) –the size of the batches to use
Returns:
-
ndarray
–matrix with embeddings of shape
len_data
xembedding_dim
LabeledMemory
dataclass
#
Bases: _LabeledMemoryFields
, Memory
A labeled memory is a single item that can be stored in the database and has a label.
embedding
instance-attribute
#
The embedding of the memory value, automatically generated by the Memoryset model.
memory_id
instance-attribute
#
The ID of the memory in the table, automatically generated by the Memoryset.
memory_version
instance-attribute
#
The version of the memory, automatically maintained by the Memoryset.
LabeledMemoryLookup
dataclass
#
Bases: _OptionalLookupProperties
, _LabeledMemoryFields
, Memory
, _RequiredLookupProperties
Single labeled memory lookup result.
embedding
instance-attribute
#
The embedding of the memory value, automatically generated by the Memoryset model.
memory_id
instance-attribute
#
The ID of the memory in the table, automatically generated by the Memoryset.
memory_version
instance-attribute
#
The version of the memory, automatically maintained by the Memoryset.
reranker_score
class-attribute
instance-attribute
#
The similarity score assigned by the reranker.
Note
This will be automatically generated if a reranker is attached to the memoryset.
reranker_embedding
class-attribute
instance-attribute
#
The reranker embedding for this memory value.
Note
This will be automatically generated if a reranker is attached to the memoryset.
Memory
dataclass
#
The base class for a labeled memory. This includes fields that are ALWAYS required.
embedding
instance-attribute
#
The embedding of the memory value, automatically generated by the Memoryset model.
memory_id
instance-attribute
#
The ID of the memory in the table, automatically generated by the Memoryset.
memory_version
instance-attribute
#
The version of the memory, automatically maintained by the Memoryset.
MemoryLookup
dataclass
#
Bases: _OptionalLookupProperties
, Memory
, _RequiredLookupProperties
Single labeled memory lookup result.
embedding
instance-attribute
#
The embedding of the memory value, automatically generated by the Memoryset model.
memory_id
instance-attribute
#
The ID of the memory in the table, automatically generated by the Memoryset.
memory_version
instance-attribute
#
The version of the memory, automatically maintained by the Memoryset.
reranker_score
class-attribute
instance-attribute
#
The similarity score assigned by the reranker.
Note
This will be automatically generated if a reranker is attached to the memoryset.
reranker_embedding
class-attribute
instance-attribute
#
The reranker embedding for this memory value.
Note
This will be automatically generated if a reranker is attached to the memoryset.
LabeledMemoryset
#
Collection of memories with labels that are stored in an OrcaDB table and can be queried using embedding similarity search.
Note
This will create a database if it doesn’t exist yet and a table in it.
Parameters:
-
uri
(str | None
, default:None
) –URL of the database that should store the memories table or name of the table for the memories. Either a file URL or the URL to a hosted OrcaDB instance is accepted. If empty, the
ORCADB_URL
environment variable is used instead. If a string is provided, it is interpreted as the name of the table to create in the database specified by theORCADB_URL
environment variable. -
api_key
(str | None
, default:None
) –API key for the OrcaDB instance. If not provided, the
ORCADB_API_KEY
environment variable or the credentials encoded in the uri are used -
secret_key
(str | None
, default:None
) –Secret key for the OrcaDB instance. If not provided, the
ORCADB_SECRET_KEY
environment variable or the credentials encoded in the uri are used. -
database
(str | None
, default:None
) –Name of the database. Do not provide this if it is already encoded in the
uri
. -
table
(str | None
, default:None
) –Name of the table. Do not provide this if it is already encoded in the
uri
. -
embedding_model
(EmbeddingModel
, default:GTE_BASE
) –Embedding model to use for semantic similarity search.
-
reranker
(Reranker | None
, default:None
) –optional reranking model to use during lookup.
Examples:
Infer connection details from the ORCADB_URL, ORCADB_API_KEY, and ORCADB_SECRET_KEY environment variables:
>>> import os
>>> os.environ["ORCADB_URL"] = "https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-db"
>>> LabeledMemoryset()
LabeledMemoryset(table="memories", database="my-db")
>>> LabeledMemoryset("my_memories_table")
LabeledMemoryset(table="my_memories_table", database="my-db")
All connection details can be fully encoded in the the uri:
>>> LabeledMemoryset("https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-db/my-memories-table")
LabeledMemoryset(table="my-memories-table", database="my-db")
Or they can be provided explicitly:
>>> LabeledMemoryset(
... "https://instance.orcadb.cloud",
... api_key="my-api-key",
... secret_key="my-secret-key",
... database="my-db",
... table="my-memories-table"
... )
LabeledMemoryset(table="my-memories-table", database="my-db")
insert
#
Inserts a dataset into the LabeledMemoryset database.
For dict-like or list of dict-like datasets, there must be a label
key and one of the following keys: text
, image
, or value
.
If there are only two keys and one is label
, the other will be inferred to be value
.
For list-like datasets, the first element of each tuple must be the value and the second must be the label.
Parameters:
-
dataset
(DatasetLike
) –data to insert into the memoryset
-
log
(bool
, default:True
) –whether to show a progressbar and log messages
Examples:
Example 1: Inserting a dictionary-like dataset#
>>> dataset = [{
... "text": "text 1",
... "label": 0
... }]
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> memoryset.insert(dataset)
Example 2: Inserting a list-like dataset#
>>> dataset = [
... ("text 1", 0),
... ("text 2", 1)
]
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> memoryset.insert(dataset)
Example 3: Inserting a Hugging Face Dataset#
from datasets import Dataset
lookup
#
Retrieves the most similar memories to the query from the memoryset.
Parameters:
-
query
(InputType | list[InputType] | ndarray
) –The query to retrieve memories for. Can be a single value, a list of values, or a numpy array with value embeddings.
-
k
(int
, default:1
) –The number of memories to retrieve.
-
batch_size
(int
, default:32
) –The number of queries to process at a time.
-
run_ids
(list[int] | None
, default:None
) –A list of run IDs to track with the lookup.
-
rerank
(bool | None
, default:None
) –Whether to rerank the results. If None (default), results will be reranked if a reranker is attached to the Memoryset.
-
log
(bool
, default:False
) –Whether to log the lookup process and show progress bars.
Returns:
-
list[list[LabeledMemoryLookup]] | list[MemoryLookupResults]
–A list of lists of LabeledMemoryLookups, where each inner list contains the k most similar memories to the corresponding query.
Examples:
Example 1: Retrieving the most similar memory to a single example#
>>> memoryset = LabeledMemoryset("file:///path/to/memoryset")
>>> query = "Apple"
>>> memories = memoryset.lookup(query, k=1)
[
[
LabeledMemoryLookup(
value='Orange',
memory_id=12,
memory_version=1,
label=0,
label_name='fruit',
embedding=array([...], dtype=float32),
metadata=None,
lookup_score=.98,
reranker_score=None,
reranker_embedding=None
)
]
]
to_list
#
Get a list of all the memories in the memoryset.
Returns:
-
list[LabeledMemory]
–list containing the memories
to_pandas
#
update_embedding_model
#
Updates the embedding model for the memoryset and re-embeds all memories in the current memoryset or a new destination memoryset if it is provided.
Note
This will reset the destination memoryset if it already exists.
Parameters:
-
embedding_model
(EmbeddingModel
) –new embedding model to use.
-
destination
(LabeledMemoryset | str | None
, default:None
) –destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist. It this is
None
the current memoryset will be updated.
Returns:
-
LabeledMemoryset
–The destination memoryset with the updated embeddings.
Examples:
Replace the embedding model for the current memoryset:
>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.update_model(EmbeddingModel.CLIP_BASE)
Create a new memoryset with a new embedding model:
clone
#
Clone the current memoryset into a new memoryset.
Note
This will reset the destination memoryset if it already exists.
Parameters:
-
destination
(LabeledMemoryset | str
) –The destination memoryset to clone this memoryset into, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.
Returns:
-
LabeledMemoryset
–The destination memoryset that the memories were cloned into.
Examples:
Clone a local memoryset into a hosted database:
>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.clone("https://<my-api-key>:<my-secret-key>@instance.orcadb.cloud/my-database#my_memoryset")
Clone a local memoryset into a new table in the same database:
map
#
Apply a function to all the memories in the memoryset and store them in the current memoryset or a new destination memoryset if it is provided.
Note
If your function returns a column that already exists, then it overwrites it.
Parameters:
-
fn
(Callable[[LabeledMemory], dict[str, Any] | LabeledMemory]
) –Function that takes in the memory and returns a new memory or a dictionary containing the values to update in the memory.
-
destination
(LabeledMemoryset | str | None
, default:None
) –The destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.
Returns:
-
LabeledMemoryset
–The destination memoryset with the updated memories.
Examples:
Add new metadata to all memories in the memoryset:
>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.map(lambda m: dict(metadata=dict(**m.metadata, new_key="new_value")))
Create a new memoryset with swapped labels in a new table in the same database:
filter
#
Filters the current memoryset using the given function and stores the result in the current memoryset or a new destination memoryset if it is provided.
Note
This will reset the destination memoryset if it already exists.
Parameters:
-
fn
(Callable[[LabeledMemory], bool]
) –Function that takes in the memory and returns a boolean indicating whether the memory should be included or not.
-
destination
(LabeledMemoryset | str | None
, default:None
) –The destination memoryset to store the results in, this can either be a memoryset instance, or the URL to a new memoryset, or the name of a table in the same database. A table for the destination will be created if it does not already exist.
Returns:
-
LabeledMemoryset
–The destination memoryset with the filtered memories.
Examples:
Filter out memories with a label of 0:
>>> memoryset = LabeledMemoryset("file:./orca.db#my_memoryset")
>>> memoryset.filter(lambda m: m.label != 0)
Create a new memoryset with some metadata in a new table in the same database: