deepretro.data.loader
Dataset loading pipeline for reaction-step data, built on DeepChem’s
DataLoader base class.
ReactionDataLoader reads a CSV with product SMILES, reactant SMILES,
and a binary label column, featurizes each row with a reaction featurizer
(by default ReactionStepFeaturizer), and
writes the result to a DiskDataset with automatic
sharding for memory-efficient handling of large files.
A convenience function stratified_split wraps DeepChem’s
SingletaskStratifiedSplitter to split any Dataset into
train / valid / test sets while preserving class balance.
Usage
from deepretro.data import ReactionDataLoader, stratified_split
# Load and featurize a reaction CSV into a DiskDataset
loader = ReactionDataLoader(
product_col="product",
reactants_col="reactants",
label_col="label",
)
dataset = loader.create_dataset("data/hallucination_dataset.csv", shard_size=1000)
# Stratified train/valid/test split (70/15/15)
train, valid, test = stratified_split(dataset)
API
Dataset loading pipeline using DeepChem data structures.
Provides ReactionDataLoader, a subclass of DeepChem’s
DataLoader that converts a reaction CSV
(product, reactants, label) into a DiskDataset with automatic
sharding, plus a convenience stratified_split function for
stratified train/valid/test splitting.
- class deepretro.data.loader.ReactionDataLoader(*args, **kwargs)[source]
Load a two-column reaction CSV into a DeepChem
DiskDataset.Inherits from
DataLoaderand overrides_get_shards()and_featurize_shard()to handle paired SMILES columns (product + reactants). The parent’screate_datasetmethod is overridden to add automatic NaN-row dropping (invalid SMILES) with a summary warning.- Parameters:
featurizer (ReactionStepFeaturizer or None, optional) – Pre-configured featurizer. A default one (radius=2, size=2048, domain features on) is created when
None.product_col (str, optional) – Column name for product SMILES. Default
"product".reactants_col (str, optional) – Column name for reactant SMILES. Default
"reactants".label_col (str, optional) – Column name for binary labels. Default
"label".id_field (str or None, optional) – Column name to use as sample identifiers. When
None(default), sequential integer IDs are generated per shard.log_every_n (int, optional) – Log a progress message every n shards. Default
1000.
Examples
>>> from deepretro.data import ReactionDataLoader >>> loader = ReactionDataLoader() >>> ds = loader.create_dataset("data/dataset.csv") >>> len(ds) 808
- __init__(featurizer=None, product_col='product', reactants_col='reactants', label_col='label', id_field=None, log_every_n=1000)[source]
- Parameters:
featurizer (ReactionStepFeaturizer | None)
product_col (str)
reactants_col (str)
label_col (str)
id_field (str | None)
log_every_n (int)
- Return type:
None
- create_dataset(inputs, data_dir=None, shard_size=1000)[source]
Read, featurize, and write a reaction CSV to a
DiskDataset.Follows the same contract as
create_dataset()but adds automatic dropping of rows where featurization fails (invalid SMILES), with a summary warning at the end.- Parameters:
inputs (str or list of str) – Path(s) to CSV file(s).
data_dir (str or None, optional) – Directory to write shards into. When
Nonea temporary directory is used (cleaned up when theDiskDatasetobject is garbage-collected).shard_size (int or None, optional) – Number of rows per shard. Default
1000.
- Returns:
dataset – Disk-backed DeepChem dataset ready for splitting / training.
- Return type:
DiskDataset
Examples
>>> loader = ReactionDataLoader() >>> ds = loader.create_dataset("data/dataset.csv")
- deepretro.data.loader.stratified_split(dataset, frac_train=0.7, frac_valid=0.15, frac_test=0.15, seed=42)[source]
Stratified split into train / valid / test sets.
Uses DeepChem’s
SingletaskStratifiedSplitterto preserve class balance across all three splits. Works with any DeepChemDatasetsubclass (DiskDataset,NumpyDataset, etc.).- Parameters:
dataset (Dataset) – Full dataset to split (
DiskDatasetorNumpyDataset).frac_train (float, optional) – Training fraction. Default 0.7.
frac_valid (float, optional) – Validation fraction. Default 0.15.
frac_test (float, optional) – Test fraction. Default 0.15.
seed (int, optional) – Random seed. Default 42.
- Returns:
train_ds (Dataset)
valid_ds (Dataset)
test_ds (Dataset)
- Return type:
tuple[deepchem.data.Dataset, deepchem.data.Dataset, deepchem.data.Dataset]
Examples
>>> import numpy as np >>> from deepchem.data import NumpyDataset >>> from deepretro.data import stratified_split >>> ds = NumpyDataset(X=np.random.rand(100, 10), y=np.array([0]*50 + [1]*50).reshape(-1,1)) >>> train, valid, test = stratified_split(ds) >>> len(train) + len(valid) + len(test) == 100 True