deepretro.data.loader

Dataset loading pipeline for reaction-step data, built on DeepChem’s DataLoader base class.

ReactionDataLoader reads a CSV with product SMILES, reactant SMILES, and a binary label column, featurizes each row with a reaction featurizer (by default ReactionStepFeaturizer), and writes the result to a DiskDataset with automatic sharding for memory-efficient handling of large files.

A convenience function stratified_split wraps DeepChem’s SingletaskStratifiedSplitter to split any Dataset into train / valid / test sets while preserving class balance.

Usage

from deepretro.data import ReactionDataLoader, stratified_split

# Load and featurize a reaction CSV into a DiskDataset
loader = ReactionDataLoader(
    product_col="product",
    reactants_col="reactants",
    label_col="label",
)
dataset = loader.create_dataset("data/hallucination_dataset.csv", shard_size=1000)

# Stratified train/valid/test split (70/15/15)
train, valid, test = stratified_split(dataset)

API