Directly Querying OrcaDB#

The easiest way to interact with OrcaDB is via memorysets, see details in the memoryset guide. But sometimes you need to interact with the database tables in a more direct way. This guide will show you how to do that. Walking you through the process of defining a table schema, ingesting data, querying the data, and defining a vector index that we’ll use to query similar items.

LOCAL FILE-BASED DATABASES DO NOT SUPPORT DIRECT QUERYING

Creating a Database Handle#

To get a handle to your database, you can use the OrcaDatabase class and pass it the URI of the database you want to connect to.

import os
from orcalib import OrcaDatabase

db = OrcaDatabase(os.getenv("ORCADB_URL")) # (1)!

This creates a handle to the OrcaDB you are connected to. You may also omit passing the uri argument and it will default to the value of the ORCADB_URL environment variable.

Define a Table Schema#

To define a table schema, you can use the create_table method and pass it the table name and column names with their data types. Let’s define the schema for a simple table to store some FAQ items that we might want to use as context for a model in the future.

from orcalib import TableCreateMode, TextT, IntT, DocumentT, EnumT, FloatT, VectorT
from enum import Enum

class Sentiment(Enum):
    pos = 1
    neg = 0

table = db.create_table(
    "memories",
    if_table_exists=TableCreateMode.REPLACE_CURR_TABLE, # (1)!
    # column definitions:
    source_uuid=TextT.unique.notnull,
    version=IntT.notnull,
    value=TextT.notnull,
    embedding=VectorT[768].notnull,
    label=EnumT[Sentiment].notnull,
    boost=FloatT,
)

The if_table_exists argument is optional and determines what to do if a table with the same name already exists. In this case, we are replacing the current table and deleting all it’s data if it exists. This is a convenient way to update table structures during prototyping or in dynamic environments. Other options are TableCreateMode.ERROR_IF_TABLE_EXISTS (the default, which will raise an exception if the table exists) and TableCreateMode.RETURN_CURR_TABLE (which will return the existing table without making changes).

Let’s break down the syntax of the table schema definition:

The first argument to the method is the name of the table we want to create.
Columns are defined by providing keyword arguments that map column names to data types. In this case, we are defining a few columns of various types with various constraints.

The IntT, TextT, and FloatT types are all pretty self-explanatory. We will go over the enum and vector types below.

Enum Columns#

Enum columns are useful for categorical data that needs both a human-readable name and a numeric value for use in your models. To define an enum column, just define a standard python Enum and pass it as a param to the EnumT data type.

from orcalib import EnumT
from enum import Enum

class Sentiment(Enum):
    pos = 1
    neg = 0

enum_type = EnumT[Sentiment]

Enum columns can also be defined from lists, (EnumT["neg", "pos"]), dictionaries (EnumT[{"pos": 1, "neg": 0}]), and tuples (EnumT[("neg", 0), ("pos", 1)]).

Vector Columns#

To store embeddings or other vectors, you can create vector columns using either VectorT or any of the numeric column types and parameterizing it with the dimension of the vectors that will be stored in that column.

from orcalib import FloatT, IntT, VectorT

embedding_type = VectorT[768] # (1)!
int_vector_type = IntT[128] # (2)!
float_vector_type = FloatT[64] # (3)!

768-dimensional vector of 32-bit signed integers.
128-dimensional vector of 32-bit signed integers.
64-dimensional vector of 16-bit floating point values.

We also provide column types for specific numerical data types. For unsigned integers, we provide UInt8T, UInt16T, UInt32T, and UInt64T. For signed integers, we provide Int8T, Int16T, Int32T, and Int64T. For floating point numbers, we provide Float16T, Float32T, Float64T, and BFloat16T.

Column Constraints#

When defining your column types, you can specify constraints by chaining .unique and/or .notnull on your base column type. These constraints are essential for maintaining data quality and enforcing rules at the database level.

from orcalib import IntT, TextT

unique_required_id_type = IntT.unique.notnull # (1)!
required_text_type = TextT.notnull # (2)!

Creates a text column that must contain unique values and cannot contain None values.
Creates a text column that cannot contain None values.

Get an Existing Table#

The create_table method returns a TableHandle object that we can use to interact with the table. If you are not creating a new table, you can get a handle to an existing table with the get_table method.

table = db.get_table("articles")

Modify the Schema#

You can easily modify the schema of a table by dropping columns with the drop_column method or adding columns with the add_column method.

from orcalib import ImageT, TextT

table.add_column(answer=TextT, question=TextT.notnull) # (1)!

table.drop_column("value", "label", "embedding", "version")

To add columns, you pass them as keyword arguments in the same format as you do for create_table. You can add as many columns as you need to.

Drop a Table#

To drop a table, you can use the drop_table method.

db.drop_table(
    "articles",
    error_if_not_exists=False # (1)!
)

error_if_not_exists is an optional second method argument that defaults to True and determines whether or not to raise an Error if the table does not exist.

Interact with the Table#

Now that we have a TableHandle, we can interact with the table by inserting, querying, updating, and deleting data.

Insert Data#

To insert data into the table, we use the insert method, which takes a list of dictionaries whose keys are the column names and whose values are the data to insert.

table.insert(
    [
        {
            "source_uuid": "e43eff9e-6f33-4005-9901-91faa4797869",
            "question": "What is OrcaDB?",
            "answer": "OrcaDB is a memory-augmented database that allows you to build memory-augmented PyTorch models.",
        },
        {
            "source_uuid": "2b7a0374-eaf3-4df3-83f3-6ed1f9aa538d",
            "question": "What is OrcaLib?",
            "answer": "OrcaLib is a Python library that allows you to interact with OrcaDB.",
        },
    ]
)

Select Data#

Let’s query the data we just inserted.

table.select(table.answer)\
     .where(table.question == "What is OrcaDB?")\
     .fetch()

There’s a lot going on here. Let’s break it down:

The select method is used to start a query. We pass it the columns we want to return. In this case, it’s just the answer column.
We reference columns in the query with the table handle and the column name, like table.question.
The select method is chainable, so we can call additional methods to build up the query.
The where method is used to filter the rows returned by the query. Only rows that match its conditions will be returned.
Finally, the fetch method is used to execute the query and return the results, which will be a list of dictionaries where the keys are the selected column names and the values are the data. The example above will return:
```
[
    {
        'answer': 'OrcaDB is a memory-augmented database that allows you to build memory-augmented PyTorch models.',
    }
]
```

We can also directly get a pandas DataFrame by calling the df property instead of the fetch method, in all the same places. So an easy way to get an idea what’s in the table is to call:

table.df(10) # (1)!

This returns a DataFrame with the first 10 rows of the table.

Update and Delete Data#

To update data in the table, we use the update method, which takes a dictionary of column names and values to update and a condition that indicates which rows to update—much like the where method.

table.update(
    {
        "answer": "OrcaLib is a Python library that allows you to seamlessly interact with OrcaDB.",
    },
    table.question == "What is OrcaLib?"
)

Finally, to delete data from the table, we use the delete method. It accepts a condition that indicates which rows to delete.

table.delete(table.question == "What is OrcaLib?")

Using the Orca App#

You can also browse the data in your table using the Orca App by navigating to the Browse Tab and selecting the table you want to view.

Browse Table Data

Use Vector Indices#

Ok, this is all well and good, but it’s probably not why you are interested in OrcaDB. Let’s define a vector index on the question column of the faq_items table and use it to find similar items to a given question.

Define a Vector Index#

OrcaDB can automatically generate embeddings for text data. Let’s define a text index on the column we want to search using the create_text_index method:

index = db.create_text_index(
    "faq_index",
    table_name="articles",
    column="question"
)

OrcaDB provides several flexible ways to define indices and the ability to bring your own embeddings.

Much like with tables, we can get a handle to an existing index with the get_index method and drop an index with the drop_index method.

Check Index Status#

Indices are updated asynchronously whenever the corresponding column changes, so you can keep inserting, updating, and deleting data while the values are processed into embeddings and indexed. To check the status of our index, we can use the get_status method:

index.get_status()

This should returned a CatchupStatus of 'COMPLETED', since we only have a few rows in our table that should have been indexed quickly.

OrcaDB will automatically wait for the index to finish processing before it performs any queries on it, guaranteeing that the results are up to date.

Find Similar Items#

Now that we have our index, we can use it to find similar items to a given question. To do this, we use the scan method:

index.scan("Why is OrcaDB useful?")\
    .select(table.answer, "$score", "$row_id")\
    .fetch(1)

This will return the answer to the most similar question in the table to the question we provided. Let’s break down the query:

The scan method is similar to the where method, but instead of filtering rows, it finds items that are similar to the provided query.
The select method works just like the one on the table but it also allows us to select pseudo columns like $score and $embedding that are generated by the index. In this case, we are selecting the answer column and the $score pseudo column which indicates how similar the item is to the query (the closer the score is to 1, the more similar the item is). We are also selecting the $row_id pseudo column which is the auto-incrementing row id of the item in the table.
When scanning the index for similar items, we have to specify how many items we want to retrieve,so we pass 1 to the fetch method which returns the results of the query as a list of dictionaries. The example above will return:
```
[
    {
        'answer': 'OrcaDB is a memory-augmented database that allows you to build memory-augmented PyTorch models.',
        '$score': 0.8411197066307068
    }
]
```

Get Embeddings#

When building models, you sometimes want to generate the same kind of embeddings that the OrcaDB index generates. To do this, you can use the embed method:

index.embed(["Why is OrcaDB useful?", "What do I need OrcaLib for?"])

This will return a float Tensor of shape (2, 768) (1) for use in your model, where each row is the embedding of the corresponding question.

The embedding dimension depends on the model used to generate the embeddings. The default model for text indices is Sentence Transformers, which uses a BERT model with 768 dimensions.