Generate RXN fingerprints for the 50k Schneider et al. data set
import gzip
import pandas as pd
import numpy as np
from itertools import islice
from tqdm import tqdm, tqdm_notebook
from rxnfp.transformer_fingerprints import (
    RXNBERTFingerprintGenerator, get_default_model_and_tokenizer, generate_fingerprints
)

Load data

df = pd.read_csv('../data/schneider50k.tsv', sep='\t')
df.head()
Unnamed: 0 original_rxn rxn_class source rxn split
0 0 [CH3:17][S:14](=[O:15])(=[O:16])[N:11]1[CH2:10... 6.1.5 US06887874 C1CCCCC1.CCO.CS(=O)(=O)N1CCN(Cc2ccccc2)CC1.[OH... test
1 1 O.O.[Na+].[CH3:1][c:2]1[cH:7][c:6]([N+:8](=O)[... 7.1.1 US07056926 CCOC(C)=O.Cc1cc([N+](=O)[O-])ccc1NC(=O)c1ccccc... test
2 2 [CH3:1][O:2][c:3]1[cH:4][cH:5][c:6](-[c:9]2[cH... 1.8.5 US08492378 COc1ccc(-c2coc3ccc(-c4nnc(S)o4)cc23)cc1.COc1cc... test
3 3 Cl.[CH3:43][CH2:42][S:44](=[O:45])(=[O:46])Cl.... 2.2.3 US08592454 CCS(=O)(=O)Cl.CN(C(=O)N(C)[C@@H]1CN(C(=O)C2CCN... train
4 4 [CH3:25][O:24][c:21]1[cH:22][cH:23][c:17]([O:1... 1.3.7 US06716851 COc1ccc(OC)c(N)c1.Cc1cc(Cl)nc(-c2ccccn2)n1>>CO... test

ft_10k model

This model was fine tuned on 10k reactions from the Schneider data set.

Generate and save the fingerprints.

model, tokenizer = get_default_model_and_tokenizer('bert_ft_10k_25s')
ft_10k_rxnfp_generator = RXNBERTFingerprintGenerator(model, tokenizer)

fps_ft_10k = generate_fingerprints(df.rxn.values.tolist(), ft_10k_rxnfp_generator, batch_size=8)
np.savez_compressed('../data/fps_ft_10k', fps=fps_ft_10k)
fps_ft_10k.shape
100%|██████████| 6250/6250 [02:52<00:00, 36.31it/s]
(50000, 256)

How to load the fingerprints:

fps_ft_10k = np.load('../data/fps_ft_10k.npz')['fps']

pretrained model

This model was only pretrained on reaction data

model, tokenizer = get_default_model_and_tokenizer('bert_pretrained')
pretrained_rxnfp_generator = RXNBERTFingerprintGenerator(model, tokenizer)
fps_pretrained = generate_fingerprints(df.rxn.values.tolist(), pretrained_rxnfp_generator, batch_size=8)
np.savez_compressed('../data/fps_pretrained', fps=fps_pretrained)
fps_pretrained.shape
100%|██████████| 6250/6250 [02:53<00:00, 35.93it/s]
(50000, 256)

ft model

model, tokenizer = get_default_model_and_tokenizer('bert_ft')
ft_rxnfp_generator = RXNBERTFingerprintGenerator(model, tokenizer)
fps_ft = generate_fingerprints(df.rxn.values.tolist(), ft_rxnfp_generator, batch_size=8)
np.savez_compressed('../data/fps_ft', fps=fps_ft)
fps_ft.shape
100%|██████████| 6250/6250 [00:56<00:00, 111.34it/s]
(50000, 256)