Tokenization and SMILES utils
Reaction SMILES tokenizer
We use the same tokenizer as the one used for http://rxnmapper.ai.
It is an instance of Hugginface Transformers BertTokenizer.
Examples
process_reaction
removes atom-mapping, canonicalizes the molecules and combines reactant and reactions
atom_mapped_rxn = 'F[c:5]1[n:6][cH:7][cH:8][cH:9][c:10]1[F:11].[CH3:1][CH:2]([CH3:3])[SH:4]>CN(C)C=O.O=C([O-])[O-].[K+].[K+]>[CH3:1][CH:2]([CH3:3])[S:4][c:5]1[n:6][cH:7][cH:8][cH:9][c:10]1[F:11]'
canonical_rxn = "CC(C)S.CN(C)C=O.Fc1cccnc1F.O=C([O-])[O-].[K+].[K+]>>CC(C)Sc1ncccc1F"
tokenized_rxn = 'C C ( C ) S . C N ( C ) C = O . F c 1 c c c n c 1 F . O = C ( [O-] ) [O-] . [K+] . [K+] >> C C ( C ) S c 1 n c c c c 1 F'
AllChem.ReactionFromSmarts(atom_mapped_rxn, useSmiles=True)
assert canonical_rxn == process_reaction(atom_mapped_rxn)
AllChem.ReactionFromSmarts(canonical_rxn, useSmiles=True)
SmilesTokenizer
splits the reaction SMILES into tokens.
tokenizer = get_default_tokenizer()
assert isinstance(tokenizer, SmilesTokenizer)
basic_tokenizer = RegexTokenizer()
assert tokenized_rxn == ' '.join(basic_tokenizer.tokenize(canonical_rxn))
assert len(tokenizer.vocab_list) == 591