orcalib.torch_layers.embedding_generation#
SentenceEmbeddingGenerator
#
Bases: Module
Model for creating embeddings for sentences using a pre-trained model.
Note
To fine-tune the embedding layer initialize it with frozen=False
.
Parameters:
-
base_model
(str
) –Name or path of the pre-trained model to use as the base encoder.
-
frozen
(bool
, default:True
) –If True, freezes the base model parameters, preventing updates during training.
-
pooling
(Literal['cls', 'mean']
, default:'cls'
) –Pooling strategy for creating sentence embeddings. “cls” uses the [CLS] token embedding, while “mean” averages all token embeddings.
-
normalize
(bool
, default:False
) –If True, normalizes the output embeddings to unit length.
-
max_sequence_length
(int | None
, default:None
) –Maximum number of tokens to process. If None, uses the model’s default max length.
Note
The embedding dimension is automatically set based on the hidden size of the base model.
get_max_sequence_length
#
Get the maximum sequence length of the given texts to be used for tokenization.
Parameters:
Returns:
tokenize
#
Tokenize the input text
Parameters:
-
text
(str | list[str] | list[list[str]]
) –the text to tokenize, can be either a single string or a batch of strings, or a batch of list of strings (for batches of memories)
-
name
(str | None
, default:None
) –optional name of the parameter to use in the output, e.g. when set to “memories”, the output will have keys “memories_ids” and “memories_mask”
-
return_tensors
(bool
, default:False
) –if True, return the tokenized text as a tuple of tensors, otherwise return a BatchEncoding
-
sequence_length
(int | None
, default:None
) –length of the output sequence. If None, will pad to the longest sequence in the batch.
Returns: tokenized text (ids and attention mask) or list of tokenized texts or list of list of tokenized texts or tuple of tensors (ids and attention mask) if return_tensors is True
Examples:
>>> embedder.tokenize("Hello, world!")
{"input_ids": [101, 7592, 2088, 2003, 2074, 102], "attention_mask": [1, 1, 1, 1, 1, 1]}
>>> embedder.tokenize(["Hello, world!", "Hello, universe!"], name="input")
{
"input_ids": [
[101, 7592, 2088, 2003, 2074, 102],
[101, 7592, 2088, 2003, 2074, 102]
],
"input_mask": [
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1]
]
}
>>> embedder.tokenize([
... ["Hello, world!", "Hello, universe!"],
... ["Hello, world!", "Hello, universe!"]
... ], name="memories")
{
"memories_ids": [
[
[101, 7592, 2088, 2003, 2074, 102],
[101, 7592, 2088, 2003, 2074, 102]
],
[
[101, 7592, 2088, 2003, 2074, 102],
[101, 7592, 2088, 2003, 2074, 102]
]
],
"memories_mask": [
[
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1]
],
[
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1]
]
]
}
forward
#
Compute embeddings for the input tokens
Parameters:
-
input_ids
(Tensor
) –input token ids, long tensor of shape batch_size (x num_memories) x max_token_length
-
attention_mask
(Tensor
) –input mask, float tensor of shape batch_size (x num_memories) x max_token_length
Returns:
-
Tensor
–embeddings for the input tokens, float tensor of shape batch_size (x num_memories) x embedding_dim
encode
#
Encode the input text into embeddings
Note
This method is not differentiable and should only be used for inference.
Parameters:
-
text
(str | list[str] | list[list[str]]
) –the text to encode, can be a single string or a batch of strings, or a batch of list of strings (for batches of memories)
Returns:
-
Tensor
–embeddings for the input text, float tensor of shape (x batch_size) (x num_memories) x embedding_dim