deepretro.data.loader
Dataset loading pipeline for reaction-step data, built on DeepChem’s
DataLoader base class.
ReactionDataLoader reads a CSV with product SMILES, reactant SMILES,
and a binary label column, featurizes each row with a reaction featurizer
(by default ReactionStepFeaturizer), and
writes the result to a DiskDataset with automatic
sharding for memory-efficient handling of large files.
A convenience function stratified_split wraps DeepChem’s
SingletaskStratifiedSplitter to split any Dataset into
train / valid / test sets while preserving class balance.
Usage
from deepretro.data import ReactionDataLoader, stratified_split
# Load and featurize a reaction CSV into a DiskDataset
loader = ReactionDataLoader(
product_col="product",
reactants_col="reactants",
label_col="label",
)
dataset = loader.create_dataset("data/hallucination_dataset.csv", shard_size=1000)
# Stratified train/valid/test split (70/15/15)
train, valid, test = stratified_split(dataset)