Skip to content

orca_sdk.datasource#

Datasource #

A Handle to a datasource in the OrcaCloud

A Datasource is a collection of data saved to the OrcaCloud that can be used to create a Memoryset. It can be created from a Hugging Face Dataset, a PyTorch DataLoader or Dataset, a list of dictionaries, a dictionary of columns, a pandas DataFrame, a pyarrow Table, or a local file.

Attributes:

  • id (str) –

    Unique identifier for the datasource

  • name (str) –

    Unique name of the datasource

  • description (str | None) –

    Optional description of the datasource

  • length (int) –

    Number of rows in the datasource

  • created_at (datetime) –

    When the datasource was created

  • columns (dict[str, str]) –

    Dictionary of column names and types

from_hf_dataset classmethod #

from_hf_dataset(
    name, dataset, if_exists="error", description=None
)

Create a new datasource from a Hugging Face Dataset

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • dataset (Dataset) –

    The Hugging Face Dataset to create the datasource from

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

  • description (str | None, default: None ) –

    Optional description for the datasource

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

from_hf_dataset_dict classmethod #

from_hf_dataset_dict(
    name, dataset_dict, if_exists="error", description=None
)

Create datasources from a Hugging Face DatasetDict

Parameters:

  • name (str) –

    Name prefix for the new datasources, will be suffixed with the dataset name

  • dataset_dict (DatasetDict) –

    The Hugging Face DatasetDict to create the datasources from

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

  • description (dict[str, str | None] | str | None, default: None ) –

    Optional description for the datasources, can be a string or a dictionary of dataset names to descriptions

Returns:

  • dict[str, Datasource]

    A dictionary of datasource handles, keyed by the dataset name

Raises:

  • ValueError

    If a datasource already exists and if_exists is "error"

from_pytorch classmethod #

from_pytorch(
    name,
    torch_data,
    column_names=None,
    if_exists="error",
    description=None,
)

Create a new datasource from a PyTorch DataLoader or Dataset

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • torch_data (DataLoader | Dataset) –

    The PyTorch DataLoader or Dataset to create the datasource from

  • column_names (list[str] | None, default: None ) –

    If the provided dataset or data loader returns unnamed tuples, this argument must be provided to specify the names of the columns.

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

  • description (str | None, default: None ) –

    Optional description for the datasource

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

from_list classmethod #

from_list(name, data, if_exists='error', description=None)

Create a new datasource from a list of dictionaries

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • data (list[dict]) –

    The list of dictionaries to create the datasource from

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

  • description (str | None, default: None ) –

    Optional description for the datasource

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

Examples:

>>> Datasource.from_list("my_datasource", [{"text": "Hello, world!", "label": 1}, {"text": "Goodbye", "label": 0}])

from_dict classmethod #

from_dict(name, data, if_exists='error', description=None)

Create a new datasource from a dictionary of columns

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • data (dict) –

    The dictionary of columns to create the datasource from

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

  • description (str | None, default: None ) –

    Optional description for the datasource

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

Examples:

>>> Datasource.from_dict("my_datasource", {"text": ["Hello, world!", "Goodbye"], "label": [1, 0]})

from_pandas classmethod #

from_pandas(
    name, dataframe, if_exists="error", description=None
)

Create a new datasource from a pandas DataFrame

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • dataframe (DataFrame) –

    The pandas DataFrame to create the datasource from

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

  • description (str | None, default: None ) –

    Optional description for the datasource

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

from_arrow classmethod #

from_arrow(
    name, pyarrow_table, if_exists="error", description=None
)

Create a new datasource from a pyarrow Table

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • pyarrow_table (Table) –

    The pyarrow Table to create the datasource from

  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

  • description (str | None, default: None ) –

    Optional description for the datasource

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

from_disk classmethod #

from_disk(
    name, file_path, if_exists="error", description=None
)

Create a new datasource from a local file

Parameters:

  • name (str) –

    Required name for the new datasource (must be unique)

  • file_path (str | PathLike) –

    Path to the file on disk to create the datasource from. The file type will be inferred from the file extension. The following file types are supported:

    • .pkl: Pickle files containing lists of dictionaries or dictionaries of columns
    • .json/.jsonl: JSON and [JSON] Lines files
    • .csv: CSV files
    • .parquet: Parquet files
    • dataset directory: Directory containing a saved HuggingFace Dataset
  • if_exists (CreateMode, default: 'error' ) –

    What to do if a datasource with the same name already exists, defaults to "error". Other option is "open" to open the existing datasource.

  • description (str | None, default: None ) –

    Optional description for the datasource

Returns:

  • Datasource

    A handle to the new datasource in the OrcaCloud

Raises:

  • ValueError

    If the datasource already exists and if_exists is "error"

open classmethod #

open(name_or_id)

Get a handle to a datasource by name or id in the OrcaCloud

Parameters:

  • name_or_id (str) –

    The name or unique identifier of the datasource to get

Returns:

  • Datasource

    A handle to the existing datasource in the OrcaCloud

Raises:

exists classmethod #

exists(name_or_id)

Check if a datasource exists in the OrcaCloud

Parameters:

  • name_or_id (str) –

    The name or id of the datasource to check

Returns:

  • bool

    True if the datasource exists, False otherwise

all classmethod #

all()

List all datasource handles in the OrcaCloud

Returns:

  • list[Datasource]

    A list of all datasource handles in the OrcaCloud

drop classmethod #

drop(name_or_id, if_not_exists='error')

Delete a datasource from the OrcaCloud

Parameters:

  • name_or_id (str) –

    The name or id of the datasource to delete

  • if_not_exists (DropMode, default: 'error' ) –

    What to do if the datasource does not exist, defaults to "error". Other options are "ignore" to do nothing.

Raises:

  • LookupError

    If the datasource does not exist and if_not_exists is "error"

download #

download(output_dir, file_type='hf_dataset')

Download the datasource to a specified path in the specified format type

Parameters:

  • output_dir (str | PathLike) –

    The local directory where the downloaded file will be saved.

  • file_type (Literal['hf_dataset', 'json', 'csv'], default: 'hf_dataset' ) –

    The type of file to download.

Returns:

  • None

    None

to_list #

to_list()

Convert the datasource to a list of dictionaries.

Returns:

  • list[dict]

    A list of dictionaries representation of the datasource.