Directly Querying OrcaDB#
The easiest way to interact with OrcaDB is via memorysets, see details in the memoryset guide. But sometimes you need to interact with the database tables in a more direct way. This guide will show you how to do that. Walking you through the process of defining a table schema, ingesting data, querying the data, and defining a vector index that we’ll use to query similar items.
LOCAL FILE-BASED DATABASES DO NOT SUPPORT DIRECT QUERYING
Creating a Database Handle#
To get a handle to your database, you can use the OrcaDatabase
class and pass it the URI of the database you want to connect to.
- This creates a handle to the OrcaDB you are connected to. You may also omit passing the
uri
argument and it will default to the value of theORCADB_URL
environment variable.
Define a Table Schema#
To define a table schema, you can use the create_table
method and pass it the table name and column names with their data types. Let’s define the schema for a simple table to store some FAQ items that we might want to use as context for a model in the future.
- The
if_table_exists
argument is optional and determines what to do if a table with the same name already exists. In this case, we are replacing the current table and deleting all it’s data if it exists. This is a convenient way to update table structures during prototyping or in dynamic environments. Other options areTableCreateMode.ERROR_IF_TABLE_EXISTS
(the default, which will raise an exception if the table exists) andTableCreateMode.RETURN_CURR_TABLE
(which will return the existing table without making changes).
Let’s break down the syntax of the table schema definition:
- The first argument to the method is the name of the table we want to create.
- Columns are defined by providing keyword arguments that map column names to data types. In this case, we are defining a few columns of various types with various constraints.
The IntT
, TextT
, and FloatT
types are all pretty self-explanatory. We will go over the enum and vector types below.
Enum Columns#
Enum columns are useful for categorical data that needs both a human-readable name and a numeric value for use in your models. To define an enum column, just define a standard python Enum
and pass it as a param to the EnumT
data type.
Enum columns can also be defined from lists, (EnumT["neg", "pos"]
), dictionaries (EnumT[{"pos": 1, "neg": 0}]
), and tuples (EnumT[("neg", 0), ("pos", 1)]
).
Vector Columns#
To store embeddings or other vectors, you can create vector columns using either VectorT
or any of the numeric column types and parameterizing it with the dimension of the vectors that will be stored in that column.
- 768-dimensional vector of 32-bit signed integers.
- 128-dimensional vector of 32-bit signed integers.
- 64-dimensional vector of 16-bit floating point values.
We also provide column types for specific numerical data types. For unsigned integers, we provide UInt8T
, UInt16T
, UInt32T
, and UInt64T
. For signed integers, we provide Int8T
, Int16T
, Int32T
, and Int64T
. For floating point numbers, we provide Float16T
, Float32T
, Float64T
, and BFloat16T
.
Column Constraints#
When defining your column types, you can specify constraints by chaining .unique
and/or .notnull
on your base column type. These constraints are essential for maintaining data quality and enforcing rules at the database level.
- Creates a text column that must contain unique values and cannot contain
None
values. - Creates a text column that cannot contain
None
values.
Get an Existing Table#
The create_table
method returns a TableHandle
object that we can use to interact with the table. If you are not creating a new table, you can get a handle to an existing table with the get_table
method.
Modify the Schema#
You can easily modify the schema of a table by dropping columns with the drop_column
method or adding columns with the add_column
method.
- To add columns, you pass them as keyword arguments in the same format as you do for
create_table
. You can add as many columns as you need to.
Drop a Table#
To drop a table, you can use the drop_table
method.
error_if_not_exists
is an optional second method argument that defaults toTrue
and determines whether or not to raise anError
if the table does not exist.
Interact with the Table#
Now that we have a TableHandle
, we can interact with the table by inserting, querying, updating, and deleting data.
Insert Data#
To insert data into the table, we use the insert
method, which takes a list of dictionaries whose keys are the column names and whose values are the data to insert.
Select Data#
Let’s query the data we just inserted.
There’s a lot going on here. Let’s break it down:
- The
select
method is used to start a query. We pass it the columns we want to return. In this case, it’s just theanswer
column. - We reference columns in the query with the table handle and the column name, like
table.question
. - The
select
method is chainable, so we can call additional methods to build up the query. - The
where
method is used to filter the rows returned by the query. Only rows that match its conditions will be returned. - Finally, the
fetch
method is used to execute the query and return the results, which will be a list of dictionaries where the keys are the selected column names and the values are the data. The example above will return:
We can also directly get a pandas DataFrame
by calling the df
property instead of the fetch
method, in all the same places. So an easy way to get an idea what’s in the table is to call:
- This returns a
DataFrame
with the first 10 rows of the table.
Update and Delete Data#
To update data in the table, we use the update
method, which takes a dictionary of column names and values to update and a condition that indicates which rows to update—much like the where
method.
Finally, to delete data from the table, we use the delete
method. It accepts a condition that indicates which rows to delete.
Using the Orca App#
You can also browse the data in your table using the Orca App by navigating to the Browse Tab and selecting the table you want to view.
Use Vector Indices#
Ok, this is all well and good, but it’s probably not why you are interested in OrcaDB. Let’s define a vector index on the question
column of the faq_items
table and use it to find similar items to a given question.
Define a Vector Index#
OrcaDB can automatically generate embeddings for text data. Let’s define a text index on the column we want to search using the create_text_index
method:
OrcaDB provides several flexible ways to define indices and the ability to bring your own embeddings.
Much like with tables, we can get a handle to an existing index with the get_index
method and drop an index with the drop_index
method.
Check Index Status#
Indices are updated asynchronously whenever the corresponding column changes, so you can keep inserting, updating, and deleting data while the values are processed into embeddings and indexed. To check the status of our index, we can use the get_status
method:
This should returned a CatchupStatus
of 'COMPLETED'
, since we only have a few rows in our table that should have been indexed quickly.
OrcaDB will automatically wait for the index to finish processing before it performs any queries on it, guaranteeing that the results are up to date.
Find Similar Items#
Now that we have our index, we can use it to find similar items to a given question. To do this, we use the scan
method:
This will return the answer to the most similar question in the table to the question we provided. Let’s break down the query:
- The
scan
method is similar to thewhere
method, but instead of filtering rows, it finds items that are similar to the provided query. - The
select
method works just like the one on the table but it also allows us to select pseudo columns like$score
and$embedding
that are generated by the index. In this case, we are selecting theanswer
column and the$score
pseudo column which indicates how similar the item is to the query (the closer the score is to1
, the more similar the item is). We are also selecting the$row_id
pseudo column which is the auto-incrementing row id of the item in the table. - When scanning the index for similar items, we have to specify how many items we want to retrieve,so we pass
1
to thefetch
method which returns the results of the query as a list of dictionaries. The example above will return:
Get Embeddings#
When building models, you sometimes want to generate the same kind of embeddings that the OrcaDB index generates. To do this, you can use the embed
method:
This will return a float Tensor
of shape (2, 768)
(1) for use in your model, where each row is the embedding of the corresponding question.
- The embedding dimension depends on the model used to generate the embeddings. The default model for text indices is Sentence Transformers, which uses a BERT model with
768
dimensions.