rxn.onmt_models.augmentation.augment_translation_dataset

rxn.onmt_models.augmentation.augment_translation_dataset(*, src_in, src_out, tgt_in, tgt_out, n_augmentations, keep_original=False, augmenter=None)[source]

Augment an RXN translation dataset with randomization and shuffling of the source.

The target will not be modified, but its elements will be duplicated if more than one augmentation is required. The first argument, *, forces keyword arguments only (Python Cookbook 7.2).

Notes (Alain, Nov 2022):

In the future, we may want to convert this functionality to a class.

Parameters

src_in (Path) – source file to augment, in TXT format (tokenized or not).
src_out (Path) – augmented source file.
tgt_in (Path) – target file to augment (i.e. duplicate its samples).
tgt_out (Path) – augmented target file.
n_augmentations (int) – number of augmentations per input line.
keep_original (bool, default: False) – whether to keep the original sample in the output.
augmenter (Optional[SmilesAugmenter], default: None) – augmenter instance. Defaults to a rotated augmentation of the compound SMILES.

Return type

None