Tokenization and SMILES utils

Reaction SMILES tokenizer

We use the same tokenizer as the one used for http://rxnmapper.ai.

It is an instance of Hugginface Transformers BertTokenizer.

get_default_tokenizer[source]

get_default_tokenizer()

class SmilesTokenizer[source]

SmilesTokenizer(vocab_file:str, unk_token:str='[UNK]', sep_token:str='[SEP]', pad_token:str='[PAD]', cls_token:str='[CLS]', mask_token:str='[MASK]', do_lower_case=False, **kwargs) :: BertTokenizer

Constructs a SmilesBertTokenizer. Adapted from https://github.com/huggingface/transformers and https://github.com/rxn4chemistry/rxnfp.

Args: vocabulary_file: path to a token per line vocabulary file.

class RegexTokenizer[source]

RegexTokenizer(regex_pattern:str=`'(\%$[0-9]{3}$|$$[^$$]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\||$|$|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\|\$|\%[0-9]{2}|[0-9])'`*)

Run regex tokenization

Smiles Utils

class NotCanonicalizableSmilesException[source]

NotCanonicalizableSmilesException() :: ValueError

Inappropriate argument value (of correct type).

canonicalize_smi[source]

canonicalize_smi(smi, remove_atom_mapping=False)

Canonicalize SMILES

process_reaction[source]

process_reaction(rxn)

Process and canonicalize reaction SMILES

Examples

process_reaction removes atom-mapping, canonicalizes the molecules and combines reactant and reactions

atom_mapped_rxn = 'F[c:5]1[n:6][cH:7][cH:8][cH:9][c:10]1[F:11].[CH3:1][CH:2]([CH3:3])[SH:4]>CN(C)C=O.O=C([O-])[O-].[K+].[K+]>[CH3:1][CH:2]([CH3:3])[S:4][c:5]1[n:6][cH:7][cH:8][cH:9][c:10]1[F:11]'
canonical_rxn = "CC(C)S.CN(C)C=O.Fc1cccnc1F.O=C([O-])[O-].[K+].[K+]>>CC(C)Sc1ncccc1F"
tokenized_rxn = 'C C ( C ) S . C N ( C ) C = O . F c 1 c c c n c 1 F . O = C ( [O-] ) [O-] . [K+] . [K+] >> C C ( C ) S c 1 n c c c c 1 F'
AllChem.ReactionFromSmarts(atom_mapped_rxn, useSmiles=True)
assert canonical_rxn == process_reaction(atom_mapped_rxn)
AllChem.ReactionFromSmarts(canonical_rxn, useSmiles=True)

SmilesTokenizer splits the reaction SMILES into tokens.

tokenizer = get_default_tokenizer()
assert isinstance(tokenizer, SmilesTokenizer)
basic_tokenizer = RegexTokenizer()
assert tokenized_rxn == ' '.join(basic_tokenizer.tokenize(canonical_rxn))
assert len(tokenizer.vocab_list) == 591