orcalib.torch_layers#

SentenceEmbeddingGenerator #

SentenceEmbeddingGenerator(
    base_model,
    tokenizer_model=None,
    frozen=True,
    pooling="cls",
    normalize=False,
    max_sequence_length=None,
)

Bases: Module

Model for creating embeddings for sentences using a pre-trained model.

Note

To fine-tune the embedding layer initialize it with frozen=False.

Parameters:

base_model (str) –

Name or path of the pre-trained model to use as the base encoder.
frozen (bool, default: True ) –

If True, freezes the base model parameters, preventing updates during training.
pooling (Literal['cls', 'mean'], default: 'cls' ) –

Pooling strategy for creating sentence embeddings. “cls” uses the [CLS] token embedding, while “mean” averages all token embeddings.
normalize (bool, default: False ) –

If True, normalizes the output embeddings to unit length.
max_sequence_length (int | None, default: None ) –

Maximum number of tokens to process. If None, uses the model’s default max length.

Note

The embedding dimension is automatically set based on the hidden size of the base model.

device `property` #

device

Current device of the encoder

get_max_sequence_length #

get_max_sequence_length(text)

Get the maximum sequence length of the given texts to be used for tokenization.

Parameters:

text (list[str]) –

the texts to get the maximum sequence length for

Returns:

int –

either the maximum sequence length of the given texts or the maximum sequence length
int –

supported by the model if the texts are longer than the model supports.

tokenize #

tokenize(
    text,
    *,
    name=None,
    return_tensors=False,
    sequence_length=None
)

Tokenize the input text

Parameters:

text (str | list[str] | list[list[str]]) –

the text to tokenize, can be either a single string or a batch of strings, or a batch of list of strings (for batches of memories)
name (str | None, default: None ) –

optional name of the parameter to use in the output, e.g. when set to “memories”, the output will have keys “memories_ids” and “memories_mask”
return_tensors (bool, default: False ) –

if True, return the tokenized text as a tuple of tensors, otherwise return a BatchEncoding
sequence_length (int | None, default: None ) –

length of the output sequence. If None, will pad to the longest sequence in the batch.

Returns: tokenized text (ids and attention mask) or list of tokenized texts or list of list of tokenized texts or tuple of tensors (ids and attention mask) if return_tensors is True

Examples:

>>> embedder.tokenize("Hello, world!")
{"input_ids": [101, 7592, 2088, 2003, 2074, 102], "attention_mask": [1, 1, 1, 1, 1, 1]}

>>> embedder.tokenize(["Hello, world!", "Hello, universe!"], name="input")
{
    "input_ids": [
        [101, 7592, 2088, 2003, 2074, 102],
        [101, 7592, 2088, 2003, 2074, 102]
    ],
    "input_mask": [
        [1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1]
    ]
}

>>> embedder.tokenize([
...     ["Hello, world!", "Hello, universe!"],
...     ["Hello, world!", "Hello, universe!"]
... ], name="memories")
{
    "memories_ids": [
        [
            [101, 7592, 2088, 2003, 2074, 102],
            [101, 7592, 2088, 2003, 2074, 102]
        ],
        [
            [101, 7592, 2088, 2003, 2074, 102],
            [101, 7592, 2088, 2003, 2074, 102]
        ]
    ],
    "memories_mask": [
        [
            [1, 1, 1, 1, 1, 1],
            [1, 1, 1, 1, 1, 1]
        ],
        [
            [1, 1, 1, 1, 1, 1],
            [1, 1, 1, 1, 1, 1]
        ]
    ]
}

forward #

forward(input_ids, attention_mask)

Compute embeddings for the input tokens

Parameters:

input_ids (Tensor) –

input token ids, long tensor of shape batch_size (x num_memories) x max_token_length
attention_mask (Tensor) –

input mask, float tensor of shape batch_size (x num_memories) x max_token_length

Returns:

Tensor –

embeddings for the input tokens, float tensor of shape batch_size (x num_memories) x embedding_dim

encode #

encode(text)

Encode the input text into embeddings

Note

This method is not differentiable and should only be used for inference.

Parameters:

text (str | list[str] | list[list[str]]) –

the text to encode, can be a single string or a batch of strings, or a batch of list of strings (for batches of memories)

Returns:

Tensor –

embeddings for the input text, float tensor of shape (x batch_size) (x num_memories) x embedding_dim

CosineSimilarity #

1	`CosineSimilarity()`

Bases: EmbeddingSimilarity

A shallow wrapper around torch.nn.CosineSimilarity that supports scoring multiple memories at once.

EmbeddingSimilarity #

Bases: Module, ABC

Abstract class for computing similarity between input and memory embeddings.

forward `abstractmethod` #

forward(input_embedding, memories_embedding)

Compute similarity scores between the given input and memory embeddings

Parameters:

input_embedding (Tensor) –

input embeddings, float tensor of shape batch_size x embedding_dim
memories_embedding (Tensor) –

memory embeddings, float tensor of shape batch_size (x num_memories) x embedding_dim

Returns:

Tensor –

similarity scores between 0 and 1 for each memory in each batch, float tensor of shape batch_size (x num_memories)

FeedForwardSimilarity #

FeedForwardSimilarity(embedding_dim)

Bases: EmbeddingSimilarity

Module to compute the similarity between input and memory embeddings using a feedforward network with two hidden layers and an output layer that returns sigmoid activated scores between 0 and 1.

Warning

Unlike other similarity heads, this layer has trainable parameters and will not output meaningful similarity scores unless trained on a dataset of input-memory pairs.

InnerProductSimilarity #

1	`InnerProductSimilarity()`

Bases: EmbeddingSimilarity

Module to compute the inner product between input and memory embeddings.

Note

Inner product is equivalent to cosine similarity when the input embeddings are normalized. It will only output scores between 0 and 1 when the input embeddings are normalized.

GatherTopK #

GatherTopK(k)

Bases: Module

Parameters:

k (int) –

number of top elements to select

last_indices `instance-attribute` #

last_indices = None

Indices of the last top k elements selected

forward #

forward(weights, *other_props)

Select the top memories based on the weights and return their properties

Parameters:

weights (Tensor) –

weights to sort selection by, float tensor of shape batch_size x num_total
other_props (Tensor, default: () ) –

other properties to select with shape batch_size x num_total (x optional_dim)

Returns:

tuple[Tensor, ...] –

tuple of properties with the top elements selected, always including the weights as the first element, shape batch_size x num_top (x optional_dim)

Examples:

>>> selector = GatherTopK(2)
>>> selector(
...     torch.tensor([[0.1, 0.2, 0.3], [0.3, 0.2, 0.1]]),
...     torch.tensor([[1, 2, 3], [3, 2, 1]]),
...     torch.tensor([[4, 5, 6], [6, 5, 4]]),
... )
(tensor([[0.2, 0.3], [0.3, 0.2]]), tensor([[2, 3], [3, 2]]), tensor([[5, 6], [6, 5]]))

orcalib.torch_layers#

SentenceEmbeddingGenerator #

device property #

get_max_sequence_length #

tokenize #

forward #

encode #

CosineSimilarity #

EmbeddingSimilarity #

forward abstractmethod #

FeedForwardSimilarity #

InnerProductSimilarity #

GatherTopK #

last_indices instance-attribute #

forward #

device `property` #

forward `abstractmethod` #

last_indices `instance-attribute` #