Skip to content

orcalib.batched_scan_result#

BatchedScanResult #

1
2
3
4
5
6
7
BatchedScanResult(
    column_dict,
    data,
    batch_slice=None,
    memory_slice=None,
    column_slice=None,
)

A batched scan result, containing batches of memory results. Each batch contains a list of memories. Each memory contains a list of values that were selected in the query.

This class acts as a view on the underlying data, allowing you to slice it by batch, memory, and column. The slicing is lazy, so it doesn’t copy any of the underlying data.

Parameters:

  • column_dict (dict[ColumnName, OrcaTypeHandle]) –

    A dictionary of column name to column type. These are the columns that were requested in the query.

  • data (list[list[tuple[Any, ...]]]) –

    The underlying data. This is a list of batches, where each batch is a list of memories, where each memory is a tuple of values.

  • batch_slice (slice | int | None, default: None ) –

    Used internally to maintain a “view” of the data based on a subset of the batches. You shouldn’t need to set this manually.

  • memory_slice (slice | int | None, default: None ) –

    Used internally to maintain a “view” of the data based on a subset of the memories. You shouldn’t need to set this manually.

  • column_slice (ColumnSlice | None, default: None ) –

    Used internally to maintain a “view” of the data based on a subset of the columns. You shouldn’t need to set this manually.

shuffle #

shuffle()

Shuffles the memories within each batch.

item #

item()

Return the single value of the result. This is only valid when the result is not a list.

__getitem__ #

__getitem__(key)

Slice the data based on the given batch, memory, and column slices.

Parameters:

  • key (tuple[int, ...] | int) –

    Key for indexing into the current BatchedScanResult.

Returns:

  • BatchedScanResult

    A new BatchedScanResult that is a view on the underlying data.

Note
  • If we haven’t sliced the data at all, then the key must be one of batch_slice, (batch_slice), (batch_slice, memory_slice), or (batch_slice, memory_slice, column_slice)
  • If batch_slice is already set, then the key must be one of memory_slice, (memory_slice), or (memory_slice, column_slice)
  • If batch_slice and memory_slice are already set, then the key must be a column_slice.
  • A batch_slice can be a single batch index or a slice of batch indices.
  • A memory_slice can be a single memory index or a slice of memory indices.
  • A column_slice can be a single column name, a list of column names or indices, or a slice of column indices.

When batch_slice and memory_slice are ints, this function doesn’t return a BatchedScanResult. Instead, if one column is selected, it returns a single value. If multiple columns are selected, it returns a list of values.

Examples:

>>> # Slice the data by batch, memory, and column
>>> first_batch = result[0] # Get the first batch
>>> first_batch_last_memory = first_batch[-1:] # Get the last memory of the first batch
>>> first_batch_last_memory_vector = first_batch_last_memory["$embedding"] # Get the vector of the last memory of the first batch
>>> first_batch[-1:, "$embedding"] # Equivalent to the above
>>> result[0, -1:, "$embedding"] # Equivalent to the above
>>> result[0, -1:, ["$embedding", "col1"]] # Get the vector and col1 of the last memory of the first batch

to_tensor #

to_tensor(column=None, dtype=None, device=None)

Convert the selected values from a vector column of the batched scan results into a PyTorch tensor. This method is useful for preparing the scan results for machine learning models and other tensor-based computations.

This method assumes that the selected data can be appropriately converted into a tensor. It works best when the data is numeric and consistently shaped across batches and memories. Non-numeric data or inconsistent shapes may lead to errors or unexpected results.

Parameters:

  • column (ColumnName | int | None, default: None ) –

    Specifies the column from which to extract the values. If None, the method uses the current column slice. If a column has been singularly selected by previous slicing, this parameter is optional.

  • dtype (dtype | None, default: None ) –

    The desired data type of the resulting tensor. If not provided, the default is inferred based on the data types of the input values.

  • device (device | None, default: None ) –

    The device on which the resulting tensor will be allocated. Use this to specify if the tensor should be on CPU, GPU, or another device. If not provided, the default is the current device setting in PyTorch.

Returns:

  • Tensor

    A tensor representation of the selected data. The shape of the tensor is typically ( batch_size, num_memories, embedding_dim), but can vary based on the current slicing of the BatchedScanResult object.

Examples:

>>> result = my_index.vector_scan(...)
>>> # Convert the '$embedding' column into a tensor
>>> embedding_tensor = result.to_tensor(column='$embedding')
>>> # Convert and specify data type and device
>>> embedding_tensor = result[0:2, :, 'features'].to_tensor(dtype=torch.float32, device=torch.device('cuda:0'))

__len__ #

__len__()

Based on the current slices, return the number of batches, memories, or values in a vector column.

Returns:

  • int

    The return type depends on the current slices:

    • When batch_slice is an int (but memory_slice and column_slice are None), this returns the number of memories in that batch.
    • When batch_slice and memory_slice are both ints (but column_slice is None), this returns the number of values in that memory.
    • Otherwise, this returns the number of batches with the specified subset of selected memories/columns.

__iter__ #

__iter__()

Iterate over the batches of memories

Returns:

  • Iterator

    The return type depends on the current slices:

    • When batch_slice is an int (but memory_slice and column_slice are None), this yields each memory from that batch.
    • When batch_slice and memory_slice are both ints (but column_slice is None), this yields each value from that memory.
    • Otherwise, this yields each batch with the specified subset of selected memories/columns

map_values #

map_values(func)

Apply a function to the column values for each memory in the current view of the data.

Note

This will make a copy of the underlying data.

Parameters:

  • func (Callable[[Sequence[Any]], list[Any]]) –

    A function that takes a sequence of column values and returns a list of new values. Note that the length of the returned list must match the length of the input list. The column values will be in the same order as the column names in the column_dict.

Returns:

Examples:

>>> def add_one(values):
...     return [val + 1 for val in values]
>>> result.map_values(add_one)
>>> result[0, ::2].map_values(add_one) # Only updates the values of the even memories in the first batch

to_list #

to_list()

Convert the values of a vector column to a list of lists of tuples

Returns:

  • list[Any]

    A list of lists of values. The outer list represents the batches, the inner list represents the memories, and the innermost tuple represents the values of the vector

Examples:

>>> bsr[0].to_list() # returns the list of memories in the first batch
>>> bsr[0, 0].to_list() # returns a list of the column values in the first memory of the first batch.
>>> bsr[0, 0, "col1"].to_list() # returns the value of "col1" for the first memory of the first batch
>>> bsr[0, 0, ["col1", "col2"]].to_list() # returns [value of col1, value of col2] for the first memory of the first batch
>>> bsr[1:3, -2:, ["col1", "col2"]].to_list() # returns a list of lists of [value of col1, value of col2] for
the last two memories of the second and third batches

df #

df(limit=None, explode=False)

Convert the current view of your results into a pandas DataFrame, enabling easy manipulation and analysis of the data.

This method restructures the nested data into a tabular format, while respecting the current slicing of the BatchedScanResult object. If the object has been sliced to select certain batches, memories, or columns, only the selected data will be included in the DataFrame.

Special columns _batch and _memory are added to the DataFrame if the batch or memory, respectively, has not been singularly selected. These columns track the batch and memory indices of each row in the DataFrame.

Parameters:

  • limit (int | None, default: None ) –

    If provided, limits the number of rows in the resulting DataFrame to the specified value. This can be useful for large datasets where you only need a sample of the data for quick analysis or visualization.

  • explode (bool, default: False ) –

    If True, any list-like columns in the DataFrame will be ‘exploded’ into separate rows, each containing one element from the list. This parameter is currently not implemented but can be used in future for handling nested data structures. Currently, its value does not change the behavior of the method.

Returns:

  • DataFrame

    A DataFrame representing the selected portions of the batched scan data. The exact shape and content of the DataFrame depend on the current state of the object, including any applied batch, memory, and column slices.

Examples:

>>> result = BatchedScanResult(...)
>>> # Convert entire data to DataFrame
>>> df = result.df()
>>> # Convert first 10 rows to DataFrame
>>> df_limited = result.df(limit=10)
>>> # Convert and 'explode' list-like columns (if implemented)
>>> df_exploded = result.df(explode=True)

BatchedScanResultBuilder #

BatchedScanResultBuilder()

A helper class to build a BatchedScanResult object incrementally. This class is useful when you want to build a BatchedScanResult object in a loop or by iterating over a large dataset.

add_feature #

add_feature(name, feature_type, values)

Add a feature to the BatchedScanResultBuilder object. The feature values should be a list of values, where each value corresponds to a memory. The length of the list should be equal to the number of memories in each batch.

Parameters:

build #

build()

Build the BatchedScanResult object from the added features. Returns: A BatchedScanResult object with the added features