orcalib.lookup_cache_builder#
OrcaLookupCacheBuilder
#
This class allows you to precache your lookup results into a HuggingFace Dataset for faster training. It also allows you to inject the pre-cached lookup results into your model during training and inference.
In the future, we may extend this class to support additional dataset types and frameworks.
Examples:
First, configure the lookup cache builder with the necessary information about your lookups.
Next, add the lookup results to the HuggingFace Dataset.
lookup_cache_builder.add_lookups_to_hf_dataset(train_data, "text")
Finally, inject the lookup results into your model during training and inference.
Parameters:
-
db
(OrcaDatabase
) –The
OrcaDatabase
instance that contains the memory table we’ll use for lookups. -
index_name
(str
) –The name of the memory index we’ll use for lookups.
-
num_memories
(int
) –The number of memories to fetch for each element in the data set.
-
embedding_col_name
(str | None
) –The name of the feature that will be created to store the embedding of the query column. If None, the embedding will not be stored (for when embeddings are alredy processed).
-
memory_column_aliases
(Dict[str, str]
) –Maps the lookup column names to the feature name that will be added to the
Dataset
. For example,{"$score": "memory_scores", "label": "memory_labels"}
will add amemory_scores
column and amemory_labels
column to theDataset
that contains the scores and labels of the memories, respectively. It’s a good idea to align the aliases to match the inputs to your model’sforward()
method, e.g.,forward(x, memory_scores, memory_labels)
. -
drop_exact_match
(bool
, default:False
) –Whether to drop the highest match that’s above the exact_match_threshold.
-
exact_match_threshold
(float
, default:EXACT_MATCH_THRESHOLD
) –The similarity threshold for considering a memory an exact match.
-
batch_size
(int
, default:1000
) –The batch size to use for the vector scan. Defaults to 1000.
memory_column_aliases
instance-attribute
#
A mapping of the lookup column names to the feature name that will be added to the dataset
from_model
classmethod
#
This is a convenient way to create an OrcaLookupCacheBuilder
that is configured with the same
lookup settings as a model or module. It will create memory column aliases for each lookup column name in the model’s settings by
prepending the memory_feature_prefix
to the column name. For example, if the model has a lookup column named label
, the memory
column alias will be memory_label
, because memory_feature_prefix
defaults to "memory_"
.
Parameters:
-
model
(OrcaLookupModule
) –The model whose settings will be used to create the
OrcaLookupCacheBuilder
. -
embedding_col_name
(str
) –The name of the feature that will be created to store the embedding of the query column.
-
memory_column_aliases
(Dict[str, str]
, default:None
) –A mapping of the lookup column names to the feature name that will be added to the
Dataset
. This can help when there are conflicts between the lookup column names and special columns. For example, if you’re looking up both “score” and “\(score", you can provide an alias for "\)score”, which would both map to “memory_score”. This will help avoid conflicts. Defaults toNone
. -
memory_feature_prefix
(str
, default:'memory_'
) –The prefix to prepend to create the names of the features that will be added to the
Dataset
to hold the lookup results. Defaults to"memory_"
. For example, if the model has a lookup column namedlabel
, the memory column alias will bememory_label
. For special columns, e.g.,$score
, the $ will be removed:memory_score
.
Returns:
-
OrcaLookupCacheBuilder
–An
OrcaLookupCacheBuilder
instance that is configured with the same settings as the model.
inject_lookup_results
#
Sets (or clears) the lookup result override for an OrcaLookupModule
, e.g., OrcaLookupLayer
, OrcaModel
.
All downstream lookup layers will use these results instead of performing a lookup by contacting the database. When
the feature values are None
, the lookup result override will be cleared instead.
Parameters:
-
model
(OrcaLookupModule
) –The
OrcaLookupModule
to inject the lookup results into. -
features
(Dict[str, list[list[Any]] | Tensor]
, default:{}
) –A mapping of the memory column aliases to their values. These values will be converted to the correct type to be used as the lookup results. Important: The feature values should all be
None
or all be non-None
. If the feature values areNone
, the lookup result override will be cleared.
get_lookup_result
#
Returns a BatchedScanResult
that contains lookup results that contain the provided features.
Parameters:
-
features
(Dict[str, list[list[Any]] | Tensor]
, default:{}
) –A mapping of the memory column aliases to their values. These values will be converted based on the column/artifact type to be used as the lookup results.
Returns:
-
BatchedScanResult
–The lookup results that contain the provided features for each memory.
add_lookups_to_hf_dataset
#
Adds the lookup results as columns (i.e., features) to a HuggingFace Dataset. This function will perform a vector scan on the
memory index to fetch the lookup results for each example in the dataset. The feature names for the memories will be the same as the
memory_column_aliases
provided in the constructor. The embedding of the query column will be stored in the embedding_col_name
provided
Parameters:
-
ds
(Dataset
) –The HuggingFace dataset to add the lookup results to.
-
query_column_name
(str
) –The name of the column that contains the query text to lookup in the memory index.
-
map_cache_file_name
(str | None
, default:None
) –The name of the cache file to use for the mapping of the dataset file. Defaults to None.
Returns:
-
Dataset
–The HuggingFace dataset with the lookup results added as features.
Examples:
First, configure the lookup cache builder with the necessary information about your lookups.
Now, load the HuggingFace dataset and add the lookup results to it.