orcalib.torch_layers#
SentenceEmbeddingGenerator
#
Bases: Module
Model for creating embeddings for sentences using a pre-trained model.
Note
To fine-tune the embedding layer initialize it with frozen=False
.
Parameters:
-
base_model
(str
) –Name or path of the pre-trained model to use as the base encoder.
-
frozen
(bool
, default:True
) –If True, freezes the base model parameters, preventing updates during training.
-
pooling
(Literal['cls', 'mean']
, default:'cls'
) –Pooling strategy for creating sentence embeddings. “cls” uses the [CLS] token embedding, while “mean” averages all token embeddings.
-
normalize
(bool
, default:False
) –If True, normalizes the output embeddings to unit length.
-
max_sequence_length
(int | None
, default:None
) –Maximum number of tokens to process. If None, uses the model’s default max length.
Note
The embedding dimension is automatically set based on the hidden size of the base model.
get_max_sequence_length
#
Get the maximum sequence length of the given texts to be used for tokenization.
Parameters:
Returns:
tokenize
#
Tokenize the input text
Parameters:
-
text
(str | list[str] | list[list[str]]
) –the text to tokenize, can be either a single string or a batch of strings, or a batch of list of strings (for batches of memories)
-
name
(str | None
, default:None
) –optional name of the parameter to use in the output, e.g. when set to “memories”, the output will have keys “memories_ids” and “memories_mask”
-
return_tensors
(bool
, default:False
) –if True, return the tokenized text as a tuple of tensors, otherwise return a BatchEncoding
-
sequence_length
(int | None
, default:None
) –length of the output sequence. If None, will pad to the longest sequence in the batch.
Returns: tokenized text (ids and attention mask) or list of tokenized texts or list of list of tokenized texts or tuple of tensors (ids and attention mask) if return_tensors is True
Examples:
>>> embedder.tokenize("Hello, world!")
{"input_ids": [101, 7592, 2088, 2003, 2074, 102], "attention_mask": [1, 1, 1, 1, 1, 1]}
>>> embedder.tokenize(["Hello, world!", "Hello, universe!"], name="input")
{
"input_ids": [
[101, 7592, 2088, 2003, 2074, 102],
[101, 7592, 2088, 2003, 2074, 102]
],
"input_mask": [
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1]
]
}
>>> embedder.tokenize([
... ["Hello, world!", "Hello, universe!"],
... ["Hello, world!", "Hello, universe!"]
... ], name="memories")
{
"memories_ids": [
[
[101, 7592, 2088, 2003, 2074, 102],
[101, 7592, 2088, 2003, 2074, 102]
],
[
[101, 7592, 2088, 2003, 2074, 102],
[101, 7592, 2088, 2003, 2074, 102]
]
],
"memories_mask": [
[
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1]
],
[
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1]
]
]
}
forward
#
Compute embeddings for the input tokens
Parameters:
-
input_ids
(Tensor
) –input token ids, long tensor of shape batch_size (x num_memories) x max_token_length
-
attention_mask
(Tensor
) –input mask, float tensor of shape batch_size (x num_memories) x max_token_length
Returns:
-
Tensor
–embeddings for the input tokens, float tensor of shape batch_size (x num_memories) x embedding_dim
encode
#
Encode the input text into embeddings
Note
This method is not differentiable and should only be used for inference.
Parameters:
-
text
(str | list[str] | list[list[str]]
) –the text to encode, can be a single string or a batch of strings, or a batch of list of strings (for batches of memories)
Returns:
-
Tensor
–embeddings for the input text, float tensor of shape (x batch_size) (x num_memories) x embedding_dim
CosineSimilarity
#
Bases: EmbeddingSimilarity
A shallow wrapper around torch.nn.CosineSimilarity that supports scoring multiple memories at once.
EmbeddingSimilarity
#
Abstract class for computing similarity between input and memory embeddings.
forward
abstractmethod
#
Compute similarity scores between the given input and memory embeddings
Parameters:
-
input_embedding
(Tensor
) –input embeddings, float tensor of shape batch_size x embedding_dim
-
memories_embedding
(Tensor
) –memory embeddings, float tensor of shape batch_size (x num_memories) x embedding_dim
Returns:
-
Tensor
–similarity scores between 0 and 1 for each memory in each batch, float tensor of shape batch_size (x num_memories)
FeedForwardSimilarity
#
Bases: EmbeddingSimilarity
Module to compute the similarity between input and memory embeddings using a feedforward network with two hidden layers and an output layer that returns sigmoid activated scores between 0 and 1.
Warning
Unlike other similarity heads, this layer has trainable parameters and will not output meaningful similarity scores unless trained on a dataset of input-memory pairs.
InnerProductSimilarity
#
Bases: EmbeddingSimilarity
Module to compute the inner product between input and memory embeddings.
Note
Inner product is equivalent to cosine similarity when the input embeddings are normalized. It will only output scores between 0 and 1 when the input embeddings are normalized.
GatherTopK
#
Bases: Module
Parameters:
-
k
(int
) –number of top elements to select
forward
#
Select the top memories based on the weights and return their properties
Parameters:
-
weights
(Tensor
) –weights to sort selection by, float tensor of shape batch_size x num_total
-
other_props
(Tensor
, default:()
) –other properties to select with shape batch_size x num_total (x optional_dim)
Returns:
-
tuple[Tensor, ...]
–tuple of properties with the top elements selected, always including the weights as the first element, shape batch_size x num_top (x optional_dim)
Examples: