Tokenization and SMILES utils

Reaction SMILES tokenizer

We use the same tokenizer as the one used for http://rxnmapper.ai.

It is an instance of Hugginface Transformers BertTokenizer.

get_default_tokenizer[source]

get_default_tokenizer()

class SmilesTokenizer[source]

SmilesTokenizer(vocab_file='', **kwargs) :: BertTokenizer

Constructs a SmilesTokenizer. Mostly copied from https://github.com/huggingface/transformers Args: vocab_file: Path to a SMILES character per line vocabulary file

class BasicSmilesTokenizer[source]

BasicSmilesTokenizer(regex_pattern=`'($$[^$$]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|$|$|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\|\$|\%[0-9]{2}|[0-9])'`*)

Run basic SMILES tokenization

load_vocab[source]

load_vocab(vocab_file)

Loads a vocabulary file into a dictionary.

convert_reaction_to_valid_features[source]

convert_reaction_to_valid_features(reaction:str, tokenizer:SmilesTokenizer, max_seq_length:int=512)

Convert reaction SMILES into input features.

convert_reaction_to_valid_features_batch[source]

convert_reaction_to_valid_features_batch(reaction_list:List[str], tokenizer:SmilesTokenizer)

Convert list of reaction SMILES into batch of input features.

Smiles Utils

class NotCanonicalizableSmilesException[source]

NotCanonicalizableSmilesException() :: ValueError

Inappropriate argument value (of correct type).

canonicalize_smi[source]

canonicalize_smi(smi, remove_atom_mapping=False)

Canonicalize SMILES

process_reaction[source]

process_reaction(rxn)

Process and canonicalize reaction SMILES

Examples

process_reaction removes atom-mapping, canonicalizes the molecules and combines reactant and reactions

atom_mapped_rxn = 'F[c:5]1[n:6][cH:7][cH:8][cH:9][c:10]1[F:11].[CH3:1][CH:2]([CH3:3])[SH:4]>CN(C)C=O.O=C([O-])[O-].[K+].[K+]>[CH3:1][CH:2]([CH3:3])[S:4][c:5]1[n:6][cH:7][cH:8][cH:9][c:10]1[F:11]'
canonical_rxn = "CC(C)S.CN(C)C=O.Fc1cccnc1F.O=C([O-])[O-].[K+].[K+]>>CC(C)Sc1ncccc1F"
tokenized_rxn = 'C C ( C ) S . C N ( C ) C = O . F c 1 c c c n c 1 F . O = C ( [O-] ) [O-] . [K+] . [K+] >> C C ( C ) S c 1 n c c c c 1 F'
AllChem.ReactionFromSmarts(atom_mapped_rxn, useSmiles=True)
assert canonical_rxn == process_reaction(atom_mapped_rxn)
AllChem.ReactionFromSmarts(canonical_rxn, useSmiles=True)

SmilesTokenizer splits the reaction SMILES into tokens.

tokenizer = get_default_tokenizer()
assert isinstance(tokenizer, SmilesTokenizer)
basic_tokenizer = BasicSmilesTokenizer()
assert tokenized_rxn == ' '.join(basic_tokenizer.tokenize(canonical_rxn))