Skip to content

orcalib.file_ingestor#

FileIngestorBase #

1
2
3
4
5
6
7
8
FileIngestorBase(
    db,
    table_name,
    dataset,
    auto_table=False,
    replace=False,
    max_text_col_len=220,
)

Bases: ABC

Base class for file ingestors

Parameters:

  • db (OrcaDatabase) –

    The database to ingest into

  • table_name (str) –

    The name of the table to ingest the data into

  • dataset (list[dict[Hashable, Any]]) –

    The dataset to ingest

  • auto_table (bool, default: False ) –

    Whether to automatically create the table if it doesn’t exist

  • replace (bool, default: False ) –

    Whether to replace the table if it already exists

  • max_text_col_len (int, default: 220 ) –

    If a column has a median length greater than this, it will be parsed as a document column

run #

run(only_create_table=False, skip_create_table=False)

Ingest the data into the database table

Parameters:

  • only_create_table (bool, default: False ) –

    Whether to only create the table and not ingest the data

  • skip_create_table (bool, default: False ) –

    Whether to skip creating the table

Returns:

  • TableHandle

    A handle to the table that was created

PickleIngestor #

1
2
3
4
5
6
7
8
PickleIngestor(
    db,
    table_name,
    dataset_path,
    auto_table=False,
    replace=False,
    max_text_col_len=220,
)

Bases: FileIngestorBase

Ingestor for Pickle files

Examples:

>>> ingestor = PickleIngestor(db, "my_table", "data.pkl", auto_table=True)
>>> table = ingestor.run()

Parameters:

  • db (OrcaDatabase) –

    The database to ingest into

  • table_name (str) –

    The name of the table to ingest the data into

  • dataset_path (str) –

    The dataset to ingest

  • auto_table (bool, default: False ) –

    Whether to automatically create the table if it doesn’t exist

  • replace (bool, default: False ) –

    Whether to replace the table if it already exists

  • max_text_col_len (int, default: 220 ) –

    If a column has a median length greater than this, it will be parsed as a document column

run #

run(only_create_table=False, skip_create_table=False)

Ingest the data into the database table

Parameters:

  • only_create_table (bool, default: False ) –

    Whether to only create the table and not ingest the data

  • skip_create_table (bool, default: False ) –

    Whether to skip creating the table

Returns:

  • TableHandle

    A handle to the table that was created

JSONIngestor #

1
2
3
4
5
6
7
8
JSONIngestor(
    db,
    table_name,
    dataset_path,
    auto_table=False,
    replace=False,
    max_text_col_len=220,
)

Bases: FileIngestorBase

Ingestor for JSON files

Examples:

>>> ingestor = JSONIngestor(db, "my_table", "data.json", auto_table=True)
>>> table = ingestor.run()

Parameters:

  • db (OrcaDatabase) –

    The database to ingest into

  • table_name (str) –

    The name of the table to ingest the data into

  • dataset_path (str) –

    The dataset to ingest

  • auto_table (bool, default: False ) –

    Whether to automatically create the table if it doesn’t exist

  • replace (bool, default: False ) –

    Whether to replace the table if it already exists

  • max_text_col_len (int, default: 220 ) –

    If a column has a median length greater than this, it will be parsed as a document column

run #

run(only_create_table=False, skip_create_table=False)

Ingest the data into the database table

Parameters:

  • only_create_table (bool, default: False ) –

    Whether to only create the table and not ingest the data

  • skip_create_table (bool, default: False ) –

    Whether to skip creating the table

Returns:

  • TableHandle

    A handle to the table that was created

JSONLIngestor #

1
2
3
4
5
6
7
8
JSONLIngestor(
    db,
    table_name,
    dataset_path,
    auto_table=False,
    replace=False,
    max_text_col_len=220,
)

Bases: FileIngestorBase

Ingestor for JSONL files

Examples:

>>> ingestor = JSONLIngestor(db, "my_table", "data.jsonl", auto_table=True)
>>> table = ingestor.run()

Parameters:

  • db (OrcaDatabase) –

    The database to ingest into

  • table_name (str) –

    The name of the table to ingest the data into

  • dataset_path (str) –

    The dataset to ingest

  • auto_table (bool, default: False ) –

    Whether to automatically create the table if it doesn’t exist

  • replace (bool, default: False ) –

    Whether to replace the table if it already exists

  • max_text_col_len (int, default: 220 ) –

    If a column has a median length greater than this, it will be parsed as a document column

run #

run(only_create_table=False, skip_create_table=False)

Ingest the data into the database table

Parameters:

  • only_create_table (bool, default: False ) –

    Whether to only create the table and not ingest the data

  • skip_create_table (bool, default: False ) –

    Whether to skip creating the table

Returns:

  • TableHandle

    A handle to the table that was created

CSVIngestor #

1
2
3
4
5
6
7
8
CSVIngestor(
    db,
    table_name,
    dataset_path,
    auto_table=False,
    replace=False,
    max_text_col_len=220,
)

Bases: FileIngestorBase

Ingestor for CSV files

Examples:

>>> ingestor = CSVIngestor(db, "my_table", "data.csv", auto_table=True)
>>> table = ingestor.run()

Parameters:

  • db (OrcaDatabase) –

    The database to ingest into

  • table_name (str) –

    The name of the table to ingest the data into

  • dataset_path (str) –

    The dataset to ingest

  • auto_table (bool, default: False ) –

    Whether to automatically create the table if it doesn’t exist

  • replace (bool, default: False ) –

    Whether to replace the table if it already exists

  • max_text_col_len (int, default: 220 ) –

    If a column has a median length greater than this, it will be parsed as a document column

run #

run(only_create_table=False, skip_create_table=False)

Ingest the data into the database table

Parameters:

  • only_create_table (bool, default: False ) –

    Whether to only create the table and not ingest the data

  • skip_create_table (bool, default: False ) –

    Whether to skip creating the table

Returns:

  • TableHandle

    A handle to the table that was created

ParquetIngestor #

1
2
3
4
5
6
7
8
ParquetIngestor(
    db,
    table_name,
    dataset_path,
    auto_table=False,
    replace=False,
    max_text_col_len=220,
)

Bases: FileIngestorBase

Ingestor for Parquet files

Examples:

>>> ingestor = ParquetIngestor(db, "my_table", "data.parquet", auto_table=True)
>>> table = ingestor.run()

Parameters:

  • db (OrcaDatabase) –

    The database to ingest into

  • table_name (str) –

    The name of the table to ingest the data into

  • dataset_path (str) –

    The dataset to ingest

  • auto_table (bool, default: False ) –

    Whether to automatically create the table if it doesn’t exist

  • replace (bool, default: False ) –

    Whether to replace the table if it already exists

  • max_text_col_len (int, default: 220 ) –

    If a column has a median length greater than this, it will be parsed as a document column

run #

run(only_create_table=False, skip_create_table=False)

Ingest the data into the database table

Parameters:

  • only_create_table (bool, default: False ) –

    Whether to only create the table and not ingest the data

  • skip_create_table (bool, default: False ) –

    Whether to skip creating the table

Returns:

  • TableHandle

    A handle to the table that was created

HFDatasetIngestor #

HFDatasetIngestor(
    db,
    table_name,
    dataset,
    auto_table=False,
    replace=False,
    max_text_col_len=220,
    split=None,
    cache_dir=None,
)

Bases: FileIngestorBase

HuggingFace Dataset Ingestor

Examples:

>>> ingestor = HFDatasetIngestor(db, "my_table", "imdb", split="train")
>>> table = ingestor.run()

Parameters:

  • db (OrcaDatabase) –

    The database to ingest into

  • table_name (str) –

    The name of the table to ingest the data into

  • dataset (Dataset | str) –

    The dataset to ingest

  • auto_table (bool, default: False ) –

    Whether to automatically create the table if it doesn’t exist

  • replace (bool, default: False ) –

    Whether to replace the table if it already exists

  • max_text_col_len (int, default: 220 ) –

    If a column has a median length greater than this, it will be parsed as a document column

  • split (str | None, default: None ) –

    The split of the dataset to ingest

  • cache_dir (str | None, default: None ) –

    The directory to cache the dataset in

run #

run(only_create_table=False, skip_create_table=False)

Ingest the data into the database table

Parameters:

  • only_create_table (bool, default: False ) –

    Whether to only create the table and not ingest the data

  • skip_create_table (bool, default: False ) –

    Whether to skip creating the table

Returns:

  • TableHandle

    A handle to the table that was created