Getting the data files

The USPTO yields data files are too large to be uploaded to this GitHub repo and have to be downloaded from this Box folder and placed into ../data/uspto. The data was extracted from the data set published by Lowe Chemical reactions from US patents (1976-Sep2016), which has catalized research on ML for chemical reactions in recent years.

USPTO data - mass scales

We observed that milligram scale and gram scale reactions had different yield distributions. Gram scale reaction are often optimized, whereas milligram scale are not optimized and yields might be often lower. In academia predicting the non-optimized yield is more relevant. Therefore, we splitted the USPTO reactions, which contained yield and product mass information into two categories (gram, milligram) at a threshold of 1 gram.

gram_df = pd.read_csv('../data/uspto/yields_above_200622.tsv', sep='\t', index_col=0)
gram_df['calc_yield'] = gram_df.calc_yield_above

milligram_df = pd.read_csv('../data/uspto/yields_below_200622.tsv', sep='\t', index_col=0)
milligram_df['calc_yield'] = milligram_df.calc_yield_below

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6))
fig.suptitle('Yield distributions - USPTO', fontsize=title_fontsize)
ax1.set_ylabel('Count', fontsize=label_fontsize)
ax1.set_xlabel('Yields [%] - gram scale', fontsize=label_fontsize)
gram_df.calc_yield.hist(bins=20, ax=ax1, **default_plot_kwargs)
ax2.set_xlabel('Yields [%] - milligram scale', fontsize=label_fontsize)
milligram_df.calc_yield.hist(bins=20, ax=ax2, **default_plot_kwargs)

<matplotlib.axes._subplots.AxesSubplot at 0x7f5cac230630>

pd.concat([gram_df['calc_yield'].describe(),milligram_df['calc_yield'].describe()], axis=1)

Prepare the data

milligram_df['rxn'] = milligram_df.tokenized_reactants_inputs.str.replace(' ', '') + '>>' + milligram_df.tokenized_products_inputs.str.replace(' ', '')
milligram_df['scaled_yield'] = milligram_df.calc_yield / 100
milligram_train_random_split_df = milligram_df[milligram_df.random_split=='train'][['rxn', 'scaled_yield']].sample(frac=1., random_state=42)
milligram_test_random_split_df = milligram_df[milligram_df.random_split=='test'][['rxn', 'scaled_yield']]
milligram_train_time_split_df = milligram_df[milligram_df.time_split=='train'][['rxn', 'scaled_yield']].sample(frac=1., random_state=42)
milligram_test_time_split_df = milligram_df[milligram_df.time_split=='test'][['rxn', 'scaled_yield']]

Gram scale

gram_df['rxn'] = gram_df.tokenized_reactants_inputs.str.replace(' ', '') + '>>' + gram_df.tokenized_products_inputs.str.replace(' ', '')
gram_df['scaled_yield'] = gram_df.calc_yield / 100
gram_train_random_split_df = gram_df[gram_df.random_split=='train'][['rxn', 'scaled_yield']].sample(frac=1., random_state=42)
gram_test_random_split_df = gram_df[gram_df.random_split=='test'][['rxn', 'scaled_yield']]
gram_train_time_split_df = gram_df[gram_df.time_split=='train'][['rxn', 'scaled_yield']].sample(frac=1., random_state=42)
gram_test_time_split_df = gram_df[gram_df.time_split=='test'][['rxn', 'scaled_yield']]

USPTO - smoothed

gram_df = pd.read_csv('../data/uspto/yields_above_200622.tsv', sep='\t', index_col=0)
gram_smoothed_df = pd.read_csv('../data/uspto/yields_above_smooth_200622.tsv', sep='\t', index_col=0) 
gram_df['calc_yield'] = gram_smoothed_df['4NN-2'].values
milligram_df = pd.read_csv('../data/uspto/yields_below_200622.tsv', sep='\t', index_col=0)
milligram_smoothed_df = pd.read_csv('../data/uspto/yields_below_smooth_200622.tsv', sep='\t', index_col=0)

milligram_df['calc_yield'] = milligram_smoothed_df['4NN-2'].values

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6))
fig.suptitle('Yield distributions - USPTO smoothed 4NN-2', fontsize=title_fontsize)
ax1.set_ylabel('Count', fontsize=label_fontsize)
ax1.set_xlabel('Yields [%] - gram scale', fontsize=label_fontsize)
gram_df.calc_yield.hist(bins=20, ax=ax1, **default_plot_kwargs)
ax2.set_xlabel('Yields [%] - milligram scale', fontsize=label_fontsize)
milligram_df.calc_yield.hist(bins=20, ax=ax2, **default_plot_kwargs)

<matplotlib.axes._subplots.AxesSubplot at 0x7fcc59039dd8>

	calc_yield	calc_yield
count	197619.000000	302040.000000
mean	73.232436	56.841559
std	20.942848	26.642160
min	0.000000	0.000000
25%	60.200000	35.500000
50%	78.000000	58.900000
75%	90.300000	79.500000
max	100.000000	100.000000

USPTO data Exploration

Getting the data files

USPTO data - mass scales

Prepare the data

Gram scale

USPTO - smoothed