Orca Quick Start#
This tutorial will give you a quick walk through of how to install OrcaSDK, create a labeled memoryset, and configure a retrieval-augmented classification model that uses those memories to guide its predictions.
Install OrcaSDK#
First we need to install OrcaSDK, our Python library for interacting with OrcaCloud. OrcaSDK is compatible with Python 3.11 or higher and is available on PyPI. You can install it with your favorite python package manager alongside the other dependencies we will need for this tutorial.
Setup API Key#
Make sure you have an OrcaCloud API Key and add it to your .env
file. If you don’t have one yet, see the Accounts guide for more information on how to create one.
Make sure to load your API key into your environment variables first thing in your script. You can call the is_authenticated
method to verify that your API key is working.
- We recommend loading the API key from a
.env
file. Alternatively you can also set theORCA_API_KEY
environment variable in your operating system or use theOrcaCredentials.set_api_key
method to set the API key manually.
Create a Memoryset#
OrcaCloud makes it easy to store memories that your model will use to guide its predictions. Because we are building a classification model, we will store memories that contain labeled examples. OrcaSDK contains the LabeledMemoryset
class which provides a convenient way to create a memoryset, insert memories, and search for similar memories for a given input. (1)
- OrcaCloud generates embeddings for each memory on insertion and uses an approximate nearest neighbor (ANN) index to enable fast semantic search for similar memories.
There are many ways to create a LabeledMemoryset
(1), for this tutorial we will create a memoryset from a Hugging Face dataset. We will use the IMDB Dataset which contains the text of 50,000 movie reviews and a label indicating whether the review has a positive or negative sentiment. It is equally distributed and split half-and-half for training and testing.
- See the memoryset reference guide for more details.
- The IMDB dataset is ordered by label by default, so we shuffle it to randomize the order of the samples and use a seed to ensure the same order every time we run the code.
- The first argument to all
LabeledMemoryset
create methods is the name of the memoryset to store the memories in. - To get started, we will use a small subset of the training dataset as the memories.
- We will use Alibaba’s GTE model to embed the memories on memoryset creation. This will take a while to generate the embeddings for all samples. But since the result is saved to a memoryset in the OrcaCloud, you will not have to run this again.
- The
value_column
is the column in the dataset that contains the value of the memory. By default, theLabeledMemoryset
will assume the the column name is “value”. You can also specify thelabel_column
andsource_id_column
.
To get a feel for how memorysets work, let’s retrieve some sample reviews using the semantic search capabilities of a memoryset by calling the search
method.
[LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.67, value: 'As a producer of indie movies and a harsh critic of such, I have to say I loved this movie. It is fu...' }),
LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.64, value: 'The movie is great and I like the story. I prefer this movie than other movie such The cell ( sick m...' })]
As you can see we get back a list of LabeledMemoryLookup
objects that contain the value and label of the memories as well as some other information like the lookup score, and an automatically generated unique identifier for the memory. (See LabeledMemoryLookup
docs for a full list of attributes.)
Create a Classifier#
Time to build our first retrieval-augmented classification model from the memoryset we just created. The model will use similar memories to guide predictions. Orca makes it really easy to do this with the ClassificationModel
.
That is all you need to create a retrieval-augmented classification model. Check the ClassificationModel docs for more details on the different options.
The model is now ready to make predictions. Let’s make a prediction using the predict
method.
As you can see the returned LabelPrediction
object contains the predicted label and its confidence. The model uses the memories to guide its prediction by looking up similar memories and combining them with the input to produce a final prediction. The following diagram shows the data flow through the model and attached memoryset:
The model also keeps track of all predictions and the memories that were used. You can inspect the memories that were looked up to guide this prediction by looking at the memory_lookups
attribute of the LabelPrediction
object. (1)
- Alternatively, you can also call the
inspect
method of theLabelPrediction
object to open a small UI to inspect the looked up memories and even update their values and labels.
[LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.67, attention_weight: 0.67, value: 'As a producer of indie movies and a harsh critic of such, I have to say I loved this movie. It is fu...' }),
LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.64, attention_weight: 0.64, value: 'The movie is great and I like the story. I prefer this movie than other movie such The cell ( sick m...' }),
LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.64, attention_weight: 0.64, value: 'When I saw the trailers I just HAD to see the film. And when I had, I kinda had a feeling that felt ...' }),
LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.63, attention_weight: 0.63, value: 'Last November, I had a chance to see this film at the Reno Film Festival. I have to say that it was ...' }),
LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.62, attention_weight: 0.62, value: 'I thought this was a wonderful movie. It touches every fiber of a human being. The love in the film ...' }),
LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.62, attention_weight: 0.62, value: 'Maybe I loved this movie so much in part because I've been feeling down in the dumps and it's such a...' }),
LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.62, attention_weight: 0.62, value: 'This has to be one of my favourite movies of all time. The dialogue, with the constant use of puns i...' }),
LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.62, attention_weight: 0.62, value: 'This is just a short comment but I stumbled onto this movie by chance and I loved it. The acting is ...' }),
LabeledMemoryLookup({ label: <pos: 1>, lookup_score: 0.61, attention_weight: 0.61, value: 'This is absolutely one of the best movies I've ever seen. It takes me on a a roller-coaster of emoti...' })]
Model Endpoint
Each model has a unique HTTP endpoint that can be used to make predictions. For our model, the endpoint is POST https://api.orcadb.ai/classification_model/imdb_model/predict
. Check out the API docs for more details.
Evaluate the Model#
Lastly, let’s evaluate the model on the test set to see how it performs. We will need to create a Datasource
from the test set and pass it to the evaluate
method.
- Datasources can be created from a Hugging Face or Pytorch Datasets, lists, column dictionaries, pandas DataFrames, pyarrow Tables, or local files. For a more detailed explanation of datasources, see the Datasource documentation.
- The first argument to all
Datasource
create methods is the unique name to identify the datasource by. - For now, we will just run the evaluation against a small subset of the test set.
- The
value_column
is the column in the dataset that contains the input values. By default, theDatasource
will assume the column name is"value"
. You can also specify thelabel_column
which defaults to"label"
.
We see that the model has an accuracy of around 93% which is quite good for a first try. However, we can improve the model even more by fine-tuning the embedding model or curating the contents of our memoryset.
Up Next#
We have seen how to install OrcaSDK, create a memoryset to store data in the OrcaCloud that can be easily retrieved, and build our first retrieval-augmented classification model that uses those memories to guide its predictions.
Next dive into the How to guides to learn more about how to optimize your memorysets and models with Orca.