orcalib.lookup_cache_builder#

OrcaLookupCacheBuilder #

OrcaLookupCacheBuilder(
    db,
    index_name,
    num_memories,
    embedding_col_name,
    memory_column_aliases,
    drop_exact_match=False,
    exact_match_threshold=EXACT_MATCH_THRESHOLD,
    batch_size=1000,
)

This class allows you to precache your lookup results into a HuggingFace Dataset for faster training. It also allows you to inject the pre-cached lookup results into your model during training and inference.

In the future, we may extend this class to support additional dataset types and frameworks.

Examples:

First, configure the lookup cache builder with the necessary information about your lookups.

lookup_cache_builder = OrcaLookupCacheBuilder(
    db=OrcaDatabase(DB_NAME),
    index_name=INDEX_NAME,
    num_memories=10,
    embedding_col_name="embeddings",
    memory_column_aliases={"$score": "memory_scores", "label": "memory_labels"},
)

train_data = load_dataset(DATASET_NAME)

Next, add the lookup results to the HuggingFace Dataset. lookup_cache_builder.add_lookups_to_hf_dataset(train_data, "text")

Finally, inject the lookup results into your model during training and inference.

class MyModel(OrcaLookupModule):
    def __init__(self, cache_builder: OrcaLookupCacheBuilder):
        ...
        # Orca modules that use lookup layers don't need to know about the lookup cache builder.
        self.my_orca_layer = OrcaLookupLayer(...)
        self.cache_builder = cache_builder

    def forward(self, x, memory_scores, memory_labels):
        # Finally, inject the lookup results into the model. Downstream lookup layers will use these results.
        self.cache_builder.inject_lookup_results(self, memory_scores=memory_scores, memory_labels=memory_labels)
        ...

Parameters:

db (OrcaDatabase) –

The OrcaDatabase instance that contains the memory table we’ll use for lookups.
index_name (str) –

The name of the memory index we’ll use for lookups.
num_memories (int) –

The number of memories to fetch for each element in the data set.
embedding_col_name (str | None) –

The name of the feature that will be created to store the embedding of the query column. If None, the embedding will not be stored (for when embeddings are alredy processed).
memory_column_aliases (Dict[str, str]) –

Maps the lookup column names to the feature name that will be added to the Dataset. For example, {"$score": "memory_scores", "label": "memory_labels"} will add a memory_scores column and a memory_labels column to the Dataset that contains the scores and labels of the memories, respectively. It’s a good idea to align the aliases to match the inputs to your model’s forward() method, e.g., forward(x, memory_scores, memory_labels).
drop_exact_match (bool, default: False ) –

Whether to drop the highest match that’s above the exact_match_threshold.
exact_match_threshold (float, default: EXACT_MATCH_THRESHOLD ) –

The similarity threshold for considering a memory an exact match.
batch_size (int, default: 1000 ) –

The batch size to use for the vector scan. Defaults to 1000.

memory_column_aliases `instance-attribute` #

memory_column_aliases = memory_column_aliases

A mapping of the lookup column names to the feature name that will be added to the dataset

from_model `classmethod` #

from_model(
    model,
    embedding_col_name,
    memory_column_aliases=None,
    memory_feature_prefix="memory_",
)

This is a convenient way to create an OrcaLookupCacheBuilder that is configured with the same lookup settings as a model or module. It will create memory column aliases for each lookup column name in the model’s settings by prepending the memory_feature_prefix to the column name. For example, if the model has a lookup column named label, the memory column alias will be memory_label, because memory_feature_prefix defaults to "memory_".

Parameters:

model (OrcaLookupModule) –

The model whose settings will be used to create the OrcaLookupCacheBuilder.
embedding_col_name (str) –

The name of the feature that will be created to store the embedding of the query column.
memory_column_aliases (Dict[str, str], default: None ) –

A mapping of the lookup column names to the feature name that will be added to the Dataset. This can help when there are conflicts between the lookup column names and special columns. For example, if you’re looking up both “score” and “$score", you can provide an alias for "$score”, which would both map to “memory_score”. This will help avoid conflicts. Defaults to None.
memory_feature_prefix (str, default: 'memory_' ) –

The prefix to prepend to create the names of the features that will be added to the Dataset to hold the lookup results. Defaults to "memory_". For example, if the model has a lookup column named label, the memory column alias will be memory_label. For special columns, e.g., $score, the $ will be removed: memory_score.

Returns:

OrcaLookupCacheBuilder –

An OrcaLookupCacheBuilder instance that is configured with the same settings as the model.

inject_lookup_results #

inject_lookup_results(model, **features)

Sets (or clears) the lookup result override for an OrcaLookupModule, e.g., OrcaLookupLayer, OrcaModel. All downstream lookup layers will use these results instead of performing a lookup by contacting the database. When the feature values are None, the lookup result override will be cleared instead.

Parameters:

model (OrcaLookupModule) –

The OrcaLookupModule to inject the lookup results into.
features (Dict[str, list[list[Any]] | Tensor], default: {} ) –

A mapping of the memory column aliases to their values. These values will be converted to the correct type to be used as the lookup results. Important: The feature values should all be None or all be non-None. If the feature values are None, the lookup result override will be cleared.

Example

class MyModel(OrcaLookupModule):
    ...

    def forward(self, x, memory_scores, memory_labels):
        self.cache_builder.inject_lookup_results(self, memory_scores=memory_scores, memory_labels=memory_labels)
        ...

get_lookup_result #

get_lookup_result(**features)

Returns a BatchedScanResult that contains lookup results that contain the provided features.

Parameters:

features (Dict[str, list[list[Any]] | Tensor], default: {} ) –

A mapping of the memory column aliases to their values. These values will be converted based on the column/artifact type to be used as the lookup results.

Returns:

BatchedScanResult –

The lookup results that contain the provided features for each memory.

add_lookups_to_hf_dataset #

add_lookups_to_hf_dataset(
    ds, query_column_name, map_cache_file_name=None
)

Adds the lookup results as columns (i.e., features) to a HuggingFace Dataset. This function will perform a vector scan on the memory index to fetch the lookup results for each example in the dataset. The feature names for the memories will be the same as the memory_column_aliases provided in the constructor. The embedding of the query column will be stored in the embedding_col_name provided

Parameters:

ds (Dataset) –

The HuggingFace dataset to add the lookup results to.
query_column_name (str) –

The name of the column that contains the query text to lookup in the memory index.
map_cache_file_name (str | None, default: None ) –

The name of the cache file to use for the mapping of the dataset file. Defaults to None.

Returns:

Dataset –

The HuggingFace dataset with the lookup results added as features.

Examples:

First, configure the lookup cache builder with the necessary information about your lookups.

lookup_cache_builder = OrcaLookupCacheBuilder(
    db=OrcaDatabase(DB_NAME),
    index_name=INDEX_NAME,
    num_memories=10,
    embedding_col_name="embeddings",
    memory_column_aliases={"$score": "memory_scores", "label": "memory_labels"},
)

Now, load the HuggingFace dataset and add the lookup results to it.

train_data = load_dataset(DATASET_NAME) # Load the HuggingFace dataset
lookup_cache_builder.add_lookups_to_hf_dataset(train_data, "text")

orcalib.lookup_cache_builder#

OrcaLookupCacheBuilder #

memory_column_aliases instance-attribute #

from_model classmethod #

inject_lookup_results #

get_lookup_result #

add_lookups_to_hf_dataset #

memory_column_aliases `instance-attribute` #

from_model `classmethod` #