RXN chemistry utilities package
This repository contains various chemistry-related Python utilities used in the RXN universe.
For general utilities not related to chemistry, see our other repository rxn-utilities
.
Links:
System Requirements
This package is supported on all operating systems. It has been tested on the following systems:
macOS: Big Sur (11.1)
Linux: Ubuntu 18.04.4
A Python version of 3.7 or greater is recommended.
Installation guide
The package can be installed from Pypi:
pip install rxn-chem-utils
For local development, the package can be installed with:
pip install -e .[dev]
The RDKit
dependency is not installed automatically and can be installed via Conda or Pypi:
# Install RDKit from Conda
conda install -c conda-forge rdkit
# Install RDKit from Pypi
pip install rdkit
Package highlights
Convert between compound representations
There are functions to convert between SMILES, RDKit.Mol
, MDL, InChI, etc.
All of them work in a similar way:
>>> from rxn.chemutils.conversion import smiles_to_mol, mol_to_smiles
>>> mol = smiles_to_mol("CO(C)")
>>> mol_to_smiles(mol)
'COC'
The functions raise exceptions when failing, and allow to be used without sanitization.
>>> mol = smiles_to_mol("CFC")
Traceback (most recent call last):
[...]
rxn.chemutils.exceptions.InvalidSmiles: "CFC" is not a valid SMILES string
>>> mol = smiles_to_mol("CFC", sanitize=False)
>>> mol_to_smiles(mol)
'CFC'
Reaction SMILES
The package supports different kinds of reaction SMILES, which, internally, are stored as ReactionEquation
s.
To convert to and from ReactionEquation
, a few functions are provided:
parse_reaction_smiles
andto_reaction_smiles
, if you know already the format.parse_any_reaction_smiles
, if you don’t know the format or want to be flexible.
Examples:
>>> from rxn.chemutils.reaction_smiles import ReactionFormat, determine_format, parse_reaction_smiles, to_reaction_smiles, parse_any_reaction_smiles
>>> rxn_smiles = "CC.O.[Na+]~[Cl-]>>CCO"
>>> determine_format(rxn_smiles)
<ReactionFormat.STANDARD_WITH_TILDE: 3>
>>> parse_reaction_smiles(rxn_smiles, ReactionFormat.STANDARD_WITH_TILDE)
ReactionEquation(reactants=['CC', 'O', '[Na+].[Cl-]'], agents=[], products=['CCO'])
>>> parse_any_reaction_smiles(rxn_smiles)
ReactionEquation(reactants=['CC', 'O', '[Na+].[Cl-]'], agents=[], products=['CCO'])
>>> to_reaction_smiles(parse_any_reaction_smiles(rxn_smiles), ReactionFormat.EXTENDED)
'CC.O.[Na+].[Cl-]>>CCO |f:2.3|'
Multicomponent SMILES
Sometimes, it is necessary to represent multiple compounds as one single SMILES string.
For fragments / ions, it becomes necessary to distinguish between what parts belong together as one compound, and what are differentt compounds.
In such “multitcomponent SMILES”, we typically use tildes, ~
, to indicate that different SMILES fragments belong to the same compound.
>>> from rxn.chemutils.multicomponent_smiles import multicomponent_smiles_to_list, list_to_multicomponent_smiles
>>> list_to_multicomponent_smiles(["CC", "[Na+].[Cl-]"], fragment_bond="~")
'CC.[Na+]~[Cl-]'
>>> multicomponent_smiles_to_list('CC.[Na+]~[Cl-]', fragment_bond="~")
['CC', '[Na+].[Cl-]']
Canonicalization
Canonicalization of compounds, with the possibility to remove the valence check:
>>> from rxn.chemutils.conversion import canonicalize_smiles
>>> canonicalize_smiles("CC(O)")
'CCO'
>>> canonicalize_smiles("ABCD") # Invalid SMILES
Traceback (most recent call last):
[...]
rxn.chemutils.exceptions.InvalidSmiles: "ABCD" is not a valid SMILES string
>>> canonicalize_smiles("CF(C)") # Invalid valence, fails by default
Traceback (most recent call last):
[...]
rxn.chemutils.exceptions.InvalidSmiles: "CFC" is not a valid SMILES string
>>> canonicalize_smiles("CF(C)", check_valence=False) # Invalid valence, does not fail
'CFC'
Canonicalization of any kind of SMILES (components, multicomponent SMILES, reaction SMILES, etc.), again with the possibility to disable the valence check. Note that the resulting string is in the same format.
>>> from rxn.chemutils.miscellaneous import canonicalize_any
>>> canonicalize_any("[Na+].[Cl-]")
'[Cl-].[Na+]'
>>> canonicalize_any("OC.C(O)~CF(C)", check_valence=False)
'CO.CFC~CO'
>>> canonicalize_any("CC(C)>C(O)>C(O)")
'CCC>CO>CO'
>>> canonicalize_any("CO.O.C>>C(O) |f:1.2|")
'CO.C.O>>CO |f:1.2|'
The executable rxn-canonicalize
(installed with the package), which works either on files or on stdin
rxn-canonicalize --help
Augmentation
See smiles_randomization.py
and smiles_augmenter.py
for the augmentation of compound SMILES and reaction SMILES strings.
Others
Without going into details, the package also does the following:
Tokenization and detokenization of SMILES strings in
tokenization.py
, and the executablesrxn-tokenize
andrxn-detokenize
.Easy combination of precursor SMILES and product SMILES into a reaction SMILES with the
ReactionCombiner
, and the executablerxn-combine-reaction
.Parsing of RDFs into reaction SMILES: different modules, and the executable
rxn-rdf-to-smiles
.… and many others.