orcalib.file_ingestor#
FileIngestorBase
#
Bases: ABC
Base class for file ingestors
Parameters:
-
db
(OrcaDatabase
) –The database to ingest into
-
table_name
(str
) –The name of the table to ingest the data into
-
dataset
(list[dict[Hashable, Any]]
) –The dataset to ingest
-
auto_table
(bool
, default:False
) –Whether to automatically create the table if it doesn’t exist
-
replace
(bool
, default:False
) –Whether to replace the table if it already exists
-
max_text_col_len
(int
, default:220
) –If a column has a median length greater than this, it will be parsed as a document column
run
#
Ingest the data into the database table
Parameters:
-
only_create_table
(bool
, default:False
) –Whether to only create the table and not ingest the data
-
skip_create_table
(bool
, default:False
) –Whether to skip creating the table
Returns:
-
TableHandle
–A handle to the table that was created
PickleIngestor
#
Bases: FileIngestorBase
Ingestor for Pickle files
Examples:
>>> ingestor = PickleIngestor(db, "my_table", "data.pkl", auto_table=True)
>>> table = ingestor.run()
Parameters:
-
db
(OrcaDatabase
) –The database to ingest into
-
table_name
(str
) –The name of the table to ingest the data into
-
dataset_path
(str
) –The dataset to ingest
-
auto_table
(bool
, default:False
) –Whether to automatically create the table if it doesn’t exist
-
replace
(bool
, default:False
) –Whether to replace the table if it already exists
-
max_text_col_len
(int
, default:220
) –If a column has a median length greater than this, it will be parsed as a document column
run
#
Ingest the data into the database table
Parameters:
-
only_create_table
(bool
, default:False
) –Whether to only create the table and not ingest the data
-
skip_create_table
(bool
, default:False
) –Whether to skip creating the table
Returns:
-
TableHandle
–A handle to the table that was created
JSONIngestor
#
Bases: FileIngestorBase
Ingestor for JSON files
Examples:
>>> ingestor = JSONIngestor(db, "my_table", "data.json", auto_table=True)
>>> table = ingestor.run()
Parameters:
-
db
(OrcaDatabase
) –The database to ingest into
-
table_name
(str
) –The name of the table to ingest the data into
-
dataset_path
(str
) –The dataset to ingest
-
auto_table
(bool
, default:False
) –Whether to automatically create the table if it doesn’t exist
-
replace
(bool
, default:False
) –Whether to replace the table if it already exists
-
max_text_col_len
(int
, default:220
) –If a column has a median length greater than this, it will be parsed as a document column
run
#
Ingest the data into the database table
Parameters:
-
only_create_table
(bool
, default:False
) –Whether to only create the table and not ingest the data
-
skip_create_table
(bool
, default:False
) –Whether to skip creating the table
Returns:
-
TableHandle
–A handle to the table that was created
JSONLIngestor
#
Bases: FileIngestorBase
Ingestor for JSONL files
Examples:
>>> ingestor = JSONLIngestor(db, "my_table", "data.jsonl", auto_table=True)
>>> table = ingestor.run()
Parameters:
-
db
(OrcaDatabase
) –The database to ingest into
-
table_name
(str
) –The name of the table to ingest the data into
-
dataset_path
(str
) –The dataset to ingest
-
auto_table
(bool
, default:False
) –Whether to automatically create the table if it doesn’t exist
-
replace
(bool
, default:False
) –Whether to replace the table if it already exists
-
max_text_col_len
(int
, default:220
) –If a column has a median length greater than this, it will be parsed as a document column
run
#
Ingest the data into the database table
Parameters:
-
only_create_table
(bool
, default:False
) –Whether to only create the table and not ingest the data
-
skip_create_table
(bool
, default:False
) –Whether to skip creating the table
Returns:
-
TableHandle
–A handle to the table that was created
CSVIngestor
#
Bases: FileIngestorBase
Ingestor for CSV files
Examples:
Parameters:
-
db
(OrcaDatabase
) –The database to ingest into
-
table_name
(str
) –The name of the table to ingest the data into
-
dataset_path
(str
) –The dataset to ingest
-
auto_table
(bool
, default:False
) –Whether to automatically create the table if it doesn’t exist
-
replace
(bool
, default:False
) –Whether to replace the table if it already exists
-
max_text_col_len
(int
, default:220
) –If a column has a median length greater than this, it will be parsed as a document column
run
#
Ingest the data into the database table
Parameters:
-
only_create_table
(bool
, default:False
) –Whether to only create the table and not ingest the data
-
skip_create_table
(bool
, default:False
) –Whether to skip creating the table
Returns:
-
TableHandle
–A handle to the table that was created
ParquetIngestor
#
Bases: FileIngestorBase
Ingestor for Parquet files
Examples:
>>> ingestor = ParquetIngestor(db, "my_table", "data.parquet", auto_table=True)
>>> table = ingestor.run()
Parameters:
-
db
(OrcaDatabase
) –The database to ingest into
-
table_name
(str
) –The name of the table to ingest the data into
-
dataset_path
(str
) –The dataset to ingest
-
auto_table
(bool
, default:False
) –Whether to automatically create the table if it doesn’t exist
-
replace
(bool
, default:False
) –Whether to replace the table if it already exists
-
max_text_col_len
(int
, default:220
) –If a column has a median length greater than this, it will be parsed as a document column
run
#
Ingest the data into the database table
Parameters:
-
only_create_table
(bool
, default:False
) –Whether to only create the table and not ingest the data
-
skip_create_table
(bool
, default:False
) –Whether to skip creating the table
Returns:
-
TableHandle
–A handle to the table that was created
HFDatasetIngestor
#
Bases: FileIngestorBase
HuggingFace Dataset Ingestor
Examples:
Parameters:
-
db
(OrcaDatabase
) –The database to ingest into
-
table_name
(str
) –The name of the table to ingest the data into
-
dataset
(Dataset | str
) –The dataset to ingest
-
auto_table
(bool
, default:False
) –Whether to automatically create the table if it doesn’t exist
-
replace
(bool
, default:False
) –Whether to replace the table if it already exists
-
max_text_col_len
(int
, default:220
) –If a column has a median length greater than this, it will be parsed as a document column
-
split
(str | None
, default:None
) –The split of the dataset to ingest
-
cache_dir
(str | None
, default:None
) –The directory to cache the dataset in
run
#
Ingest the data into the database table
Parameters:
-
only_create_table
(bool
, default:False
) –Whether to only create the table and not ingest the data
-
skip_create_table
(bool
, default:False
) –Whether to skip creating the table
Returns:
-
TableHandle
–A handle to the table that was created