RXN chemistry utilities package

This repository contains various chemistry-related Python utilities used in the RXN universe. For general utilities not related to chemistry, see our other repository rxn-utilities.

Links:

System Requirements

This package is supported on all operating systems. It has been tested on the following systems:

macOS: Big Sur (11.1)
Linux: Ubuntu 18.04.4

A Python version of 3.7 or greater is recommended.

Installation guide

The package can be installed from Pypi:

pip install rxn-chem-utils

For local development, the package can be installed with:

pip install -e .[dev]

The RDKit dependency is not installed automatically and can be installed via Conda or Pypi:

# Install RDKit from Conda
conda install -c conda-forge rdkit

# Install RDKit from Pypi
pip install rdkit

Package highlights

Convert between compound representations

There are functions to convert between SMILES, RDKit.Mol, MDL, InChI, etc. All of them work in a similar way:

>>> from rxn.chemutils.conversion import smiles_to_mol, mol_to_smiles
>>> mol = smiles_to_mol("CO(C)")
>>> mol_to_smiles(mol)
'COC'

The functions raise exceptions when failing, and allow to be used without sanitization.

>>> mol = smiles_to_mol("CFC")
Traceback (most recent call last):
[...]
rxn.chemutils.exceptions.InvalidSmiles: "CFC" is not a valid SMILES string

>>> mol = smiles_to_mol("CFC", sanitize=False)
>>> mol_to_smiles(mol)
'CFC'

Reaction SMILES

The package supports different kinds of reaction SMILES, which, internally, are stored as ReactionEquations.

To convert to and from ReactionEquation, a few functions are provided:

parse_reaction_smiles and to_reaction_smiles, if you know already the format.
parse_any_reaction_smiles, if you don’t know the format or want to be flexible.

Examples:

>>> from rxn.chemutils.reaction_smiles import ReactionFormat, determine_format, parse_reaction_smiles, to_reaction_smiles, parse_any_reaction_smiles
>>> rxn_smiles = "CC.O.[Na+]~[Cl-]>>CCO"
>>> determine_format(rxn_smiles)
<ReactionFormat.STANDARD_WITH_TILDE: 3>
>>> parse_reaction_smiles(rxn_smiles, ReactionFormat.STANDARD_WITH_TILDE)
ReactionEquation(reactants=['CC', 'O', '[Na+].[Cl-]'], agents=[], products=['CCO'])
>>> parse_any_reaction_smiles(rxn_smiles)
ReactionEquation(reactants=['CC', 'O', '[Na+].[Cl-]'], agents=[], products=['CCO'])
>>> to_reaction_smiles(parse_any_reaction_smiles(rxn_smiles), ReactionFormat.EXTENDED)
'CC.O.[Na+].[Cl-]>>CCO |f:2.3|'

Multicomponent SMILES

Sometimes, it is necessary to represent multiple compounds as one single SMILES string. For fragments / ions, it becomes necessary to distinguish between what parts belong together as one compound, and what are differentt compounds. In such “multitcomponent SMILES”, we typically use tildes, ~, to indicate that different SMILES fragments belong to the same compound.

>>> from rxn.chemutils.multicomponent_smiles import multicomponent_smiles_to_list, list_to_multicomponent_smiles
>>> list_to_multicomponent_smiles(["CC", "[Na+].[Cl-]"], fragment_bond="~")
'CC.[Na+]~[Cl-]'
>>> multicomponent_smiles_to_list('CC.[Na+]~[Cl-]', fragment_bond="~")
['CC', '[Na+].[Cl-]']

Canonicalization

Canonicalization of compounds, with the possibility to remove the valence check:

>>> from rxn.chemutils.conversion import canonicalize_smiles
>>> canonicalize_smiles("CC(O)")
'CCO'
>>> canonicalize_smiles("ABCD")  # Invalid SMILES
Traceback (most recent call last):
[...]
rxn.chemutils.exceptions.InvalidSmiles: "ABCD" is not a valid SMILES string
>>> canonicalize_smiles("CF(C)")  # Invalid valence, fails by default
Traceback (most recent call last):
[...]
rxn.chemutils.exceptions.InvalidSmiles: "CFC" is not a valid SMILES string
>>> canonicalize_smiles("CF(C)", check_valence=False)  # Invalid valence, does not fail
'CFC'

Canonicalization of any kind of SMILES (components, multicomponent SMILES, reaction SMILES, etc.), again with the possibility to disable the valence check. Note that the resulting string is in the same format.

>>> from rxn.chemutils.miscellaneous import canonicalize_any
>>> canonicalize_any("[Na+].[Cl-]")
'[Cl-].[Na+]'
>>> canonicalize_any("OC.C(O)~CF(C)", check_valence=False)
'CO.CFC~CO'
>>> canonicalize_any("CC(C)>C(O)>C(O)")
'CCC>CO>CO'
>>> canonicalize_any("CO.O.C>>C(O) |f:1.2|")
'CO.C.O>>CO |f:1.2|'

The executable rxn-canonicalize (installed with the package), which works either on files or on stdin

rxn-canonicalize --help

Augmentation

See smiles_randomization.py and smiles_augmenter.py for the augmentation of compound SMILES and reaction SMILES strings.

Others

Without going into details, the package also does the following:

Tokenization and detokenization of SMILES strings in tokenization.py, and the executables rxn-tokenize and rxn-detokenize.
Easy combination of precursor SMILES and product SMILES into a reaction SMILES with the ReactionCombiner, and the executable rxn-combine-reaction.
Parsing of RDFs into reaction SMILES: different modules, and the executable rxn-rdf-to-smiles.
… and many others.