Skip to content

orca_sdk.datasource#

Datasource #

A Handle to a datasource in the OrcaCloud

A Datasource is a collection of data saved to the OrcaCloud that can be used to create a Memoryset. It can be created from a Hugging Face Dataset, a PyTorch DataLoader or Dataset, a list of dictionaries, a dictionary of columns, a pandas DataFrame, a pyarrow Table, or a local file.

Attributes:

  • id (str) –

    Unique identifier for the datasource

  • name (str) –

    Unique name of the datasource

  • length (int) –

    Number of rows in the datasource

  • created_at (datetime) –

    When the datasource was created

  • columns (dict[str, str]) –

    Dictionary of column names and types

from_hf_dataset classmethod #

from_hf_dataset(name, dataset, if_exists='error')

Create a new datasource from a Hugging Face Dataset

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • dataset (Dataset) –

    The Hugging Face Dataset to create the datasource from

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

from_pytorch classmethod #

from_pytorch(
    name, torch_data, column_names=None, if_exists="error"
)

Create a new datasource from a PyTorch DataLoader or Dataset

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • torch_data (DataLoader | Dataset) –

    The PyTorch DataLoader or Dataset to create the datasource from

  • column_names (list[str] | None, default: None ) –

    If the provided dataset or data loader returns unnamed tuples, this argument must be provided to specify the names of the columns.

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

from_list classmethod #

from_list(name, data, if_exists='error')

Create a new datasource from a list of dictionaries

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • data (list[dict]) –

    The list of dictionaries to create the datasource from

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

Examples:

>>> Datasource.from_list("my_datasource", [{"text": "Hello, world!", "label": 1}, {"text": "Goodbye", "label": 0}])

from_dict classmethod #

from_dict(name, data, if_exists='error')

Create a new datasource from a dictionary of columns

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • data (dict) –

    The dictionary of columns to create the datasource from

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

Examples:

>>> Datasource.from_dict("my_datasource", {"text": ["Hello, world!", "Goodbye"], "label": [1, 0]})

from_pandas classmethod #

from_pandas(name, dataframe, if_exists='error')

Create a new datasource from a pandas DataFrame

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • dataframe (DataFrame) –

    The pandas DataFrame to create the datasource from

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

from_arrow classmethod #

from_arrow(name, pyarrow_table, if_exists='error')

Create a new datasource from a pyarrow Table

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • pyarrow_table (Table) –

    The pyarrow Table to create the datasource from

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

from_disk classmethod #

from_disk(name, file_path, if_exists='error')

Create a new datasource from a local file

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • file_path (str | PathLike) –

    Path to the file on disk to create the datasource from. The file type will be inferred from the file extension. The following file types are supported:

    • .pkl: Pickle files containing lists of dictionaries or dictionaries of columns
    • .json/.jsonl: JSON and [JSON] Lines files
    • .csv: CSV files
    • .parquet: Parquet files
    • dataset directory: Directory containing a saved HuggingFace Dataset
  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

open classmethod #

open(name)

Get a handle to a datasource by name or id in the OrcaCloud

Parameters:

  • name (str) –

    The name or unique identifier of the datasource to get

Returns:

  • Datasource

    A handle to the existing datasource in the OrcaCloud

Raises:

exists classmethod #

exists(name_or_id)

Check if a datasource exists in the OrcaCloud

Parameters:

  • name_or_id (str) –

    The name or id of the datasource to check

Returns:

  • bool

    True if the datasource exists, False otherwise

all classmethod #

all()

List all datasource handles in the OrcaCloud

Returns:

  • list[Datasource]

    A list of all datasource handles in the OrcaCloud

drop classmethod #

drop(name_or_id, if_not_exists='error')

Delete a datasource from the OrcaCloud

Parameters:

  • name_or_id (str) –

    The name or id of the datasource to delete

  • if_not_exists (DropMode, default: 'error' ) –

    What to do if the datasource does not exist, defaults to "error". Other options are "ignore" to do nothing.

Raises:

  • LookupError

    If the datasource does not exist and if_not_exists is "error"