Orca Quick Start#

This tutorial will give you a quick walk through of how to install OrcaSDK, create a labeled memoryset, and configure a retrieval-augmented classification model that uses those memories to guide its predictions.

Install OrcaSDK#

First we need to install OrcaSDK, our Python library for interacting with OrcaCloud. OrcaSDK is compatible with Python 3.11 or higher and is available on PyPI. You can install it with your favorite python package manager alongside the other dependencies we will need for this tutorial.

PipCondaPoetry

pip install orca_sdk datasets python-dotenv

conda install orca_sdk datasets python-dotenv

poetry add orca_sdk datasets python-dotenv

Setup API Key#

Make sure you have an OrcaCloud API Key and add it to your .env file. If you don’t have one yet, see the Accounts guide for more information on how to create one.

.env

ORCA_API_KEY=<your-api-key>

Make sure to load your API key into your environment variables first thing in your script. You can call the is_authenticated method to verify that your API key is working.

from dotenv import load_dotenv
from orca_sdk import OrcaCredentials

load_dotenv() # (1)!
assert OrcaCredentials.is_authenticated()

We recommend loading the API key from a .env file. Alternatively you can also set the ORCA_API_KEY environment variable in your operating system or use the OrcaCredentials.set_api_key method to set the API key manually.

Create a Memoryset#

OrcaCloud makes it easy to store memories that your model will use to guide its predictions. Because we are building a classification model, we will store memories that contain labeled examples. OrcaSDK contains the LabeledMemoryset class which provides a convenient way to create a memoryset, insert memories, and search for similar memories for a given input. (1)

OrcaCloud generates embeddings for each memory on insertion and uses an approximate nearest neighbor (ANN) index to enable fast semantic search for similar memories.

There are many ways to create a LabeledMemoryset (1), for this tutorial we will create a memoryset from a Hugging Face dataset. We will use the IMDB Dataset which contains the text of 50,000 movie reviews and a label indicating whether the review has a positive or negative sentiment. It is equally distributed and split half-and-half for training and testing.

See the memoryset reference guide for more details.

from datasets import load_dataset
from orca_sdk import LabeledMemoryset, PretrainedEmbeddingModel

dataset = load_dataset("stanfordnlp/imdb").shuffle(seed=42) # (1)!

memoryset = LabeledMemoryset.from_hf_dataset(
    "imdb_train_gte", # (2)!
    hf_dataset=dataset["train"].take(2500), # (3)!
    embedding_model=PretrainedEmbeddingModel.GTE_BASE, # (4)!
    value_column="text", # (5)!
)

The IMDB dataset is ordered by label by default, so we shuffle it to randomize the order of the samples and use a seed to ensure the same order every time we run the code.
The first argument to all LabeledMemoryset create methods is the name of the memoryset to store the memories in.
To get started, we will use a small subset of the training dataset as the memories.
We will use Alibaba’s GTE model to embed the memories on memoryset creation. This will take a while to generate the embeddings for all samples. But since the result is saved to a memoryset in the OrcaCloud, you will not have to run this again.
The value_column is the column in the dataset that contains the value of the memory. By default, the LabeledMemoryset will assume the the column name is “value”. You can also specify the label_column and source_id_column.

To get a feel for how memorysets work, let’s retrieve some sample reviews using the semantic search capabilities of a memoryset by calling the search method.

memoryset.search("I really enjoyed the movie!", count=2)

[LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.67, value: 'As a producer of indie movies and a harsh critic of such, I have to say I loved this movie. It is fu...' }),
 LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.64, value: 'The movie is great and I like the story. I prefer this movie than other movie such The cell ( sick m...' })]

As you can see we get back a list of LabeledMemoryLookup objects that contain the value and label of the memories as well as some other information like the lookup score, and an automatically generated unique identifier for the memory. (See LabeledMemoryLookup docs for a full list of attributes.)

Create a Classifier#

Time to build our first retrieval-augmented classification model from the memoryset we just created. The model will use similar memories to guide predictions. Orca makes it really easy to do this with the ClassificationModel.

from orca_sdk import ClassificationModel

model = ClassificationModel.create("imdb_model", memoryset)

That is all you need to create a retrieval-augmented classification model. Check the ClassificationModel docs for more details on the different options.

The model is now ready to make predictions. Let’s make a prediction using the predict method.

model.predict("I really enjoyed the movie!")

LabelPrediction({label: <pos: 1>, confidence: 0.63, input_value: 'I really enjoyed the movie!'})

As you can see the returned LabelPrediction object contains the predicted label and its confidence. The model uses the memories to guide its prediction by looking up similar memories and combining them with the input to produce a final prediction. The following diagram shows the data flow through the model and attached memoryset:

RACModel Data Flow Diagram

The model also keeps track of all predictions and the memories that were used. You can inspect the memories that were looked up to guide this prediction by looking at the memory_lookups attribute of the LabelPrediction object. (1)

Alternatively, you can also call the inspect method of the LabelPrediction object to open a small UI to inspect the looked up memories and even update their values and labels.

model.last_prediction.memory_lookups

[LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.67, attention_weight: 0.67, value: 'As a producer of indie movies and a harsh critic of such, I have to say I loved this movie. It is fu...' }),
 LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.64, attention_weight: 0.64, value: 'The movie is great and I like the story. I prefer this movie than other movie such The cell ( sick m...' }),
 LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.64, attention_weight: 0.64, value: 'When I saw the trailers I just HAD to see the film. And when I had, I kinda had a feeling that felt ...' }),
 LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.63, attention_weight: 0.63, value: 'Last November, I had a chance to see this film at the Reno Film Festival. I have to say that it was ...' }),
 LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.62, attention_weight: 0.62, value: 'I thought this was a wonderful movie. It touches every fiber of a human being. The love in the film ...' }),
 LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.62, attention_weight: 0.62, value: 'Maybe I loved this movie so much in part because I've been feeling down in the dumps and it's such a...' }),
 LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.62, attention_weight: 0.62, value: 'This has to be one of my favourite movies of all time. The dialogue, with the constant use of puns i...' }),
 LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.62, attention_weight: 0.62, value: 'This is just a short comment but I stumbled onto this movie by chance and I loved it. The acting is ...' }),
 LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.61, attention_weight: 0.61, value: 'This is absolutely one of the best movies I've ever seen. It takes me on a a roller-coaster of emoti...' })]

Model Endpoint

Each model has a unique HTTP endpoint that can be used to make predictions. For our model, the endpoint is POST https://api.orcadb.ai/classification_model/imdb_model/predict. Check out the API docs for more details.

Evaluate the Model#

Lastly, let’s evaluate the model on the test set to see how it performs. We will need to create a Datasource from the test set and pass it to the evaluate method.

from orca_sdk import Datasource

model.evaluate(
    Datasource.from_hf_dataset( # (1)!
        "imdb_test", # (2)!
        dataset=dataset["test"].take(1000), # (3)!
    ),
    value_column="text", # (4)!
)

Datasources can be created from a Hugging Face or Pytorch Datasets, lists, column dictionaries, pandas DataFrames, pyarrow Tables, or local files. For a more detailed explanation of datasources, see the Datasource documentation.
The first argument to all Datasource create methods is the unique name to identify the datasource by.
For now, we will just run the evaluation against a small subset of the test set.
The value_column is the column in the dataset that contains the input values. By default, the Datasource will assume the column name is "value". You can also specify the label_column which defaults to "label".

{
    'f1_score': 0.9100126012601261,
    'roc_auc': 0.9102843237704918,
    'pr_auc': 0.9282727450852329,
    'accuracy': 0.91,
    'loss': 0.5113761501312256
}

We see that the model has an accuracy of around 93% which is quite good for a first try. However, we can improve the model even more by fine-tuning the embedding model or curating the contents of our memoryset.

Up Next#

We have seen how to install OrcaSDK, create a memoryset to store data in the OrcaCloud that can be easily retrieved, and build our first retrieval-augmented classification model that uses those memories to guide its predictions.

Next dive into the How to guides to learn more about how to optimize your memorysets and models with Orca.