orca_sdk.datasource#

Datasource #

A Handle to a datasource in the OrcaCloud

A Datasource is a collection of data saved to the OrcaCloud that can be used to create a Memoryset. It can be created from a Hugging Face Dataset, a PyTorch DataLoader or Dataset, a list of dictionaries, a dictionary of columns, a pandas DataFrame, a pyarrow Table, or a local file.

Attributes:

id (str) –

Unique identifier for the datasource
name (str) –

Unique name of the datasource
description (str | None) –

Optional description of the datasource
length (int) –

Number of rows in the datasource
created_at (datetime) –

When the datasource was created
columns (dict[str, str]) –

Dictionary of column names and types

download #

download(output_path)

Download the datasource as a ZIP and extract them to a specified path.

Parameters:

output_path (str | PathLike) –

The local file path or directory where the downloaded files will be saved.

Returns:

None –

None

Raises:

RuntimeError –

If the download fails.

from_hf_dataset `classmethod` #

from_hf_dataset(
    name, dataset, if_exists="error", description=None
)

Create a new datasource from a Hugging Face Dataset

Parameters:

name (str) –

Required name for the new datasource (must be unique)
dataset (Dataset) –

The Hugging Face Dataset to create the datasource from
if_exists (CreateMode, default: 'error' ) –

What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.
description (str | None, default: None ) –

Optional description for the datasource

Returns:

Datasource –

A handle to the new datasource in the OrcaCloud

Raises:

ValueError –

If the datasource already exists and if_exists is "error"

from_pytorch `classmethod` #

from_pytorch(
    name,
    torch_data,
    column_names=None,
    if_exists="error",
    description=None,
)

Create a new datasource from a PyTorch DataLoader or Dataset

Parameters:

name (str) –

Required name for the new datasource (must be unique)
torch_data (DataLoader | Dataset) –

The PyTorch DataLoader or Dataset to create the datasource from
column_names (list[str] | None, default: None ) –

If the provided dataset or data loader returns unnamed tuples, this argument must be provided to specify the names of the columns.
if_exists (CreateMode, default: 'error' ) –

What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.
description (str | None, default: None ) –

Optional description for the datasource

Returns:

Datasource –

A handle to the new datasource in the OrcaCloud

Raises:

ValueError –

If the datasource already exists and if_exists is "error"

from_list `classmethod` #

from_list(name, data, if_exists='error', description=None)

Create a new datasource from a list of dictionaries

Parameters:

name (str) –

Required name for the new datasource (must be unique)
data (list[dict]) –

The list of dictionaries to create the datasource from
if_exists (CreateMode, default: 'error' ) –

What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.
description (str | None, default: None ) –

Optional description for the datasource

Returns:

Datasource –

A handle to the new datasource in the OrcaCloud

Raises:

ValueError –

If the datasource already exists and if_exists is "error"

Examples:

>>> Datasource.from_list("my_datasource", [{"text": "Hello, world!", "label": 1}, {"text": "Goodbye", "label": 0}])

from_dict `classmethod` #

from_dict(name, data, if_exists='error', description=None)

Create a new datasource from a dictionary of columns

Parameters:

name (str) –

Required name for the new datasource (must be unique)
data (dict) –

The dictionary of columns to create the datasource from
if_exists (CreateMode, default: 'error' ) –

What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.
description (str | None, default: None ) –

Optional description for the datasource

Returns:

Datasource –

A handle to the new datasource in the OrcaCloud

Raises:

ValueError –

If the datasource already exists and if_exists is "error"

Examples:

>>> Datasource.from_dict("my_datasource", {"text": ["Hello, world!", "Goodbye"], "label": [1, 0]})

from_pandas `classmethod` #

from_pandas(
    name, dataframe, if_exists="error", description=None
)

Create a new datasource from a pandas DataFrame

Parameters:

name (str) –

Required name for the new datasource (must be unique)
dataframe (DataFrame) –

The pandas DataFrame to create the datasource from
if_exists (CreateMode, default: 'error' ) –

What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.
description (str | None, default: None ) –

Optional description for the datasource

Returns:

Datasource –

A handle to the new datasource in the OrcaCloud

Raises:

ValueError –

If the datasource already exists and if_exists is "error"

from_arrow `classmethod` #

from_arrow(
    name, pyarrow_table, if_exists="error", description=None
)

Create a new datasource from a pyarrow Table

Parameters:

name (str) –

Required name for the new datasource (must be unique)
pyarrow_table (Table) –

The pyarrow Table to create the datasource from
if_exists (CreateMode, default: 'error' ) –

What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.
description (str | None, default: None ) –

Optional description for the datasource

Returns:

Datasource –

A handle to the new datasource in the OrcaCloud

Raises:

ValueError –

If the datasource already exists and if_exists is "error"

from_disk `classmethod` #

from_disk(
    name, file_path, if_exists="error", description=None
)

Create a new datasource from a local file

Parameters:

name (str) –

Required name for the new datasource (must be unique)
file_path (str | PathLike) –
Path to the file on disk to create the datasource from. The file type will be inferred from the file extension. The following file types are supported:
- .pkl: Pickle files containing lists of dictionaries or dictionaries of columns
- .json/.jsonl: JSON and [JSON] Lines files
- .csv: CSV files
- .parquet: Parquet files
- dataset directory: Directory containing a saved HuggingFace Dataset
if_exists (CreateMode, default: 'error' ) –

What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.
description (str | None, default: None ) –

Optional description for the datasource

Returns:

Datasource –

A handle to the new datasource in the OrcaCloud

Raises:

ValueError –

If the datasource already exists and if_exists is "error"

open `classmethod` #

open(name)

Get a handle to a datasource by name or id in the OrcaCloud

Parameters:

name (str) –

The name or unique identifier of the datasource to get

Returns:

Datasource –

A handle to the existing datasource in the OrcaCloud

Raises:

LookupError –

If the datasource does not exist

exists `classmethod` #

exists(name_or_id)

Check if a datasource exists in the OrcaCloud

Parameters:

name_or_id (str) –

The name or id of the datasource to check

Returns:

bool –

True if the datasource exists, False otherwise

all `classmethod` #

all()

List all datasource handles in the OrcaCloud

Returns:

list[Datasource] –

A list of all datasource handles in the OrcaCloud

drop `classmethod` #

drop(name_or_id, if_not_exists='error')

Delete a datasource from the OrcaCloud

Parameters:

name_or_id (str) –

The name or id of the datasource to delete
if_not_exists (DropMode, default: 'error' ) –

What to do if the datasource does not exist, defaults to "error". Other options are "ignore" to do nothing.

Raises:

LookupError –

If the datasource does not exist and if_not_exists is "error"

orca_sdk.datasource#

Datasource #

download #

from_hf_dataset classmethod #

from_pytorch classmethod #

from_list classmethod #

from_dict classmethod #

from_pandas classmethod #

from_arrow classmethod #

from_disk classmethod #

open classmethod #

exists classmethod #

all classmethod #

drop classmethod #

from_hf_dataset `classmethod` #

from_pytorch `classmethod` #

from_list `classmethod` #

from_dict `classmethod` #

from_pandas `classmethod` #

from_arrow `classmethod` #

from_disk `classmethod` #

open `classmethod` #

exists `classmethod` #

all `classmethod` #

drop `classmethod` #