orcalib.batched_scan_result#
BatchedScanResult
#
A batched scan result, containing batches of memory results. Each batch contains a list of memories. Each memory contains a list of values that were selected in the query.
This class acts as a view on the underlying data, allowing you to slice it by batch, memory, and column. The slicing is lazy, so it doesn’t copy any of the underlying data.
Parameters:
-
column_dict
(dict[ColumnName, OrcaTypeHandle]
) –A dictionary of column name to column type. These are the columns that were requested in the query.
-
data
(list[list[tuple[Any, ...]]]
) –The underlying data. This is a list of batches, where each batch is a list of memories, where each memory is a tuple of values.
-
batch_slice
(slice | int | None
, default:None
) –Used internally to maintain a “view” of the data based on a subset of the batches. You shouldn’t need to set this manually.
-
memory_slice
(slice | int | None
, default:None
) –Used internally to maintain a “view” of the data based on a subset of the memories. You shouldn’t need to set this manually.
-
column_slice
(ColumnSlice | None
, default:None
) –Used internally to maintain a “view” of the data based on a subset of the columns. You shouldn’t need to set this manually.
__getitem__
#
Slice the data based on the given batch, memory, and column slices.
Parameters:
Returns:
-
BatchedScanResult
–A new
BatchedScanResult
that is a view on the underlying data.
Note
- If we haven’t sliced the data at all, then the key must be one of batch_slice,
(
batch_slice
), (batch_slice
,memory_slice
), or (batch_slice
,memory_slice
,column_slice
) - If
batch_slice
is already set, then the key must be one ofmemory_slice
, (memory_slice
), or (memory_slice
,column_slice
) - If
batch_slice
andmemory_slice
are already set, then the key must be acolumn_slice
. - A
batch_slice
can be a single batch index or a slice of batch indices. - A
memory_slice
can be a single memory index or a slice of memory indices. - A
column_slice
can be a single column name, a list of column names or indices, or a slice of column indices.
When batch_slice
and memory_slice
are ints, this function doesn’t return a
BatchedScanResult
. Instead, if one column is selected, it returns a single value. If
multiple columns are selected, it returns a list of values.
Examples:
>>> # Slice the data by batch, memory, and column
>>> first_batch = result[0] # Get the first batch
>>> first_batch_last_memory = first_batch[-1:] # Get the last memory of the first batch
>>> first_batch_last_memory_vector = first_batch_last_memory["$embedding"] # Get the vector of the last memory of the first batch
>>> first_batch[-1:, "$embedding"] # Equivalent to the above
>>> result[0, -1:, "$embedding"] # Equivalent to the above
>>> result[0, -1:, ["$embedding", "col1"]] # Get the vector and col1 of the last memory of the first batch
to_tensor
#
Convert the selected values from a vector column of the batched scan results into a PyTorch tensor. This method is useful for preparing the scan results for machine learning models and other tensor-based computations.
This method assumes that the selected data can be appropriately converted into a tensor. It works best when the data is numeric and consistently shaped across batches and memories. Non-numeric data or inconsistent shapes may lead to errors or unexpected results.
Parameters:
-
column
(ColumnName | int | None
, default:None
) –Specifies the column from which to extract the values. If None, the method uses the current column slice. If a column has been singularly selected by previous slicing, this parameter is optional.
-
dtype
(dtype | None
, default:None
) –The desired data type of the resulting tensor. If not provided, the default is inferred based on the data types of the input values.
-
device
(device | None
, default:None
) –The device on which the resulting tensor will be allocated. Use this to specify if the tensor should be on CPU, GPU, or another device. If not provided, the default is the current device setting in PyTorch.
Returns:
-
Tensor
–A tensor representation of the selected data. The shape of the tensor is typically (
batch_size
,num_memories
,embedding_dim
), but can vary based on the current slicing of the BatchedScanResult object.
Examples:
>>> result = my_index.vector_scan(...)
>>> # Convert the '$embedding' column into a tensor
>>> embedding_tensor = result.to_tensor(column='$embedding')
>>> # Convert and specify data type and device
>>> embedding_tensor = result[0:2, :, 'features'].to_tensor(dtype=torch.float32, device=torch.device('cuda:0'))
__len__
#
Based on the current slices, return the number of batches, memories, or values in a vector column.
Returns:
-
int
–The return type depends on the current slices:
- When
batch_slice
is anint
(butmemory_slice
andcolumn_slice
areNone
), this returns the number of memories in that batch. - When
batch_slice
andmemory_slice
are bothint
s (butcolumn_slice
isNone
), this returns the number of values in that memory. - Otherwise, this returns the number of batches with the specified subset of selected memories/columns.
- When
__iter__
#
Iterate over the batches of memories
Returns:
-
Iterator
–The return type depends on the current slices:
- When
batch_slice
is anint
(butmemory_slice
andcolumn_slice
areNone
), this yields each memory from that batch. - When
batch_slice
andmemory_slice
are bothint
s (butcolumn_slice
isNone
), this yields each value from that memory. - Otherwise, this yields each batch with the specified subset of selected memories/columns
- When
map_values
#
Apply a function to the column values for each memory in the current view of the data.
Note
This will make a copy of the underlying data.
Parameters:
-
func
(Callable[[Sequence[Any]], list[Any]]
) –A function that takes a sequence of column values and returns a list of new values. Note that the length of the returned list must match the length of the input list. The column values will be in the same order as the column names in the
column_dict
.
Returns:
-
BatchedScanResult
–A new
BatchedScanResult
object with the modified values.
Examples:
to_list
#
Convert the values of a vector column to a list of lists of tuples
Returns:
-
list[Any]
–A list of lists of values. The outer list represents the batches, the inner list represents the memories, and the innermost tuple represents the values of the vector
Examples:
>>> bsr[0, 0].to_list() # returns a list of the column values in the first memory of the first batch.
>>> bsr[0, 0, "col1"].to_list() # returns the value of "col1" for the first memory of the first batch
>>> bsr[0, 0, ["col1", "col2"]].to_list() # returns [value of col1, value of col2] for the first memory of the first batch
>>> bsr[1:3, -2:, ["col1", "col2"]].to_list() # returns a list of lists of [value of col1, value of col2] for
the last two memories of the second and third batches
df
#
Convert the current view of your results into a pandas DataFrame
, enabling easy manipulation
and analysis of the data.
This method restructures the nested data into a tabular format, while respecting the current
slicing of the BatchedScanResult object. If the object has been sliced to select certain
batches, memories, or columns, only the selected data will be included in the DataFrame
.
Special columns _batch
and _memory
are added to the DataFrame if the batch or memory,
respectively, has not been singularly selected. These columns track the batch and memory
indices of each row in the DataFrame
.
Parameters:
-
limit
(int | None
, default:None
) –If provided, limits the number of rows in the resulting DataFrame to the specified value. This can be useful for large datasets where you only need a sample of the data for quick analysis or visualization.
-
explode
(bool
, default:False
) –If True, any list-like columns in the
DataFrame
will be ‘exploded’ into separate rows, each containing one element from the list. This parameter is currently not implemented but can be used in future for handling nested data structures. Currently, its value does not change the behavior of the method.
Returns:
-
DataFrame
–A
DataFrame
representing the selected portions of the batched scan data. The exact shape and content of theDataFrame
depend on the current state of the object, including any applied batch, memory, and column slices.
Examples:
BatchedScanResultBuilder
#
A helper class to build a BatchedScanResult object incrementally. This class is useful when you want to build a BatchedScanResult object in a loop or by iterating over a large dataset.
add_feature
#
Add a feature to the BatchedScanResultBuilder object. The feature values should be a list of values, where each value corresponds to a memory. The length of the list should be equal to the number of memories in each batch.
Parameters: