Skip to content

orcalib.torch_layers.embedding_generation#

SentenceEmbeddingGenerator #

1
2
3
4
5
6
7
8
SentenceEmbeddingGenerator(
    base_model,
    tokenizer_model=None,
    frozen=True,
    pooling="cls",
    normalize=False,
    max_sequence_length=None,
)

Bases: Module

Model for creating embeddings for sentences using a pre-trained model.

Note

To fine-tune the embedding layer initialize it with frozen=False.

Parameters:

  • base_model (str) –

    Name or path of the pre-trained model to use as the base encoder.

  • frozen (bool, default: True ) –

    If True, freezes the base model parameters, preventing updates during training.

  • pooling (Literal['cls', 'mean'], default: 'cls' ) –

    Pooling strategy for creating sentence embeddings. “cls” uses the [CLS] token embedding, while “mean” averages all token embeddings.

  • normalize (bool, default: False ) –

    If True, normalizes the output embeddings to unit length.

  • max_sequence_length (int | None, default: None ) –

    Maximum number of tokens to process. If None, uses the model’s default max length.

Note

The embedding dimension is automatically set based on the hidden size of the base model.

device property #

device

Current device of the encoder

get_max_sequence_length #

get_max_sequence_length(text)

Get the maximum sequence length of the given texts to be used for tokenization.

Parameters:

  • text (list[str]) –

    the texts to get the maximum sequence length for

Returns:

  • int

    either the maximum sequence length of the given texts or the maximum sequence length

  • int

    supported by the model if the texts are longer than the model supports.

tokenize #

1
2
3
4
5
6
7
tokenize(
    text,
    *,
    name=None,
    return_tensors=False,
    sequence_length=None
)

Tokenize the input text

Parameters:

  • text (str | list[str] | list[list[str]]) –

    the text to tokenize, can be either a single string or a batch of strings, or a batch of list of strings (for batches of memories)

  • name (str | None, default: None ) –

    optional name of the parameter to use in the output, e.g. when set to “memories”, the output will have keys “memories_ids” and “memories_mask”

  • return_tensors (bool, default: False ) –

    if True, return the tokenized text as a tuple of tensors, otherwise return a BatchEncoding

  • sequence_length (int | None, default: None ) –

    length of the output sequence. If None, will pad to the longest sequence in the batch.

Returns: tokenized text (ids and attention mask) or list of tokenized texts or list of list of tokenized texts or tuple of tensors (ids and attention mask) if return_tensors is True

Examples:

>>> embedder.tokenize("Hello, world!")
{"input_ids": [101, 7592, 2088, 2003, 2074, 102], "attention_mask": [1, 1, 1, 1, 1, 1]}
>>> embedder.tokenize(["Hello, world!", "Hello, universe!"], name="input")
{
    "input_ids": [
        [101, 7592, 2088, 2003, 2074, 102],
        [101, 7592, 2088, 2003, 2074, 102]
    ],
    "input_mask": [
        [1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1]
    ]
}
>>> embedder.tokenize([
...     ["Hello, world!", "Hello, universe!"],
...     ["Hello, world!", "Hello, universe!"]
... ], name="memories")
{
    "memories_ids": [
        [
            [101, 7592, 2088, 2003, 2074, 102],
            [101, 7592, 2088, 2003, 2074, 102]
        ],
        [
            [101, 7592, 2088, 2003, 2074, 102],
            [101, 7592, 2088, 2003, 2074, 102]
        ]
    ],
    "memories_mask": [
        [
            [1, 1, 1, 1, 1, 1],
            [1, 1, 1, 1, 1, 1]
        ],
        [
            [1, 1, 1, 1, 1, 1],
            [1, 1, 1, 1, 1, 1]
        ]
    ]
}

forward #

forward(input_ids, attention_mask)

Compute embeddings for the input tokens

Parameters:

  • input_ids (Tensor) –

    input token ids, long tensor of shape batch_size (x num_memories) x max_token_length

  • attention_mask (Tensor) –

    input mask, float tensor of shape batch_size (x num_memories) x max_token_length

Returns:

  • Tensor

    embeddings for the input tokens, float tensor of shape batch_size (x num_memories) x embedding_dim

encode #

encode(text)

Encode the input text into embeddings

Note

This method is not differentiable and should only be used for inference.

Parameters:

  • text (str | list[str] | list[list[str]]) –

    the text to encode, can be a single string or a batch of strings, or a batch of list of strings (for batches of memories)

Returns:

  • Tensor

    embeddings for the input text, float tensor of shape (x batch_size) (x num_memories) x embedding_dim