Getting the data files
The USPTO yields data files are too large to be uploaded to this GitHub repo and have to be downloaded from this Box folder and placed into ../data/uspto
. The data was extracted from the data set published by Lowe Chemical reactions from US patents (1976-Sep2016), which has catalized research on ML for chemical reactions in recent years.
USPTO data - mass scales
We observed that milligram scale
and gram scale
reactions had different yield distributions. Gram scale reaction are often optimized, whereas milligram scale are not optimized and yields might be often lower. In academia predicting the non-optimized yield is more relevant. Therefore, we splitted the USPTO reactions, which contained yield and product mass information into two categories (gram, milligram) at a threshold of 1 gram.
gram_df = pd.read_csv('../data/uspto/yields_above_200622.tsv', sep='\t', index_col=0)
gram_df['calc_yield'] = gram_df.calc_yield_above
milligram_df = pd.read_csv('../data/uspto/yields_below_200622.tsv', sep='\t', index_col=0)
milligram_df['calc_yield'] = milligram_df.calc_yield_below
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6))
fig.suptitle('Yield distributions - USPTO', fontsize=title_fontsize)
ax1.set_ylabel('Count', fontsize=label_fontsize)
ax1.set_xlabel('Yields [%] - gram scale', fontsize=label_fontsize)
gram_df.calc_yield.hist(bins=20, ax=ax1, **default_plot_kwargs)
ax2.set_xlabel('Yields [%] - milligram scale', fontsize=label_fontsize)
milligram_df.calc_yield.hist(bins=20, ax=ax2, **default_plot_kwargs)
pd.concat([gram_df['calc_yield'].describe(),milligram_df['calc_yield'].describe()], axis=1)
milligram_df['rxn'] = milligram_df.tokenized_reactants_inputs.str.replace(' ', '') + '>>' + milligram_df.tokenized_products_inputs.str.replace(' ', '')
milligram_df['scaled_yield'] = milligram_df.calc_yield / 100
milligram_train_random_split_df = milligram_df[milligram_df.random_split=='train'][['rxn', 'scaled_yield']].sample(frac=1., random_state=42)
milligram_test_random_split_df = milligram_df[milligram_df.random_split=='test'][['rxn', 'scaled_yield']]
milligram_train_time_split_df = milligram_df[milligram_df.time_split=='train'][['rxn', 'scaled_yield']].sample(frac=1., random_state=42)
milligram_test_time_split_df = milligram_df[milligram_df.time_split=='test'][['rxn', 'scaled_yield']]
gram_df['rxn'] = gram_df.tokenized_reactants_inputs.str.replace(' ', '') + '>>' + gram_df.tokenized_products_inputs.str.replace(' ', '')
gram_df['scaled_yield'] = gram_df.calc_yield / 100
gram_train_random_split_df = gram_df[gram_df.random_split=='train'][['rxn', 'scaled_yield']].sample(frac=1., random_state=42)
gram_test_random_split_df = gram_df[gram_df.random_split=='test'][['rxn', 'scaled_yield']]
gram_train_time_split_df = gram_df[gram_df.time_split=='train'][['rxn', 'scaled_yield']].sample(frac=1., random_state=42)
gram_test_time_split_df = gram_df[gram_df.time_split=='test'][['rxn', 'scaled_yield']]
gram_df = pd.read_csv('../data/uspto/yields_above_200622.tsv', sep='\t', index_col=0)
gram_smoothed_df = pd.read_csv('../data/uspto/yields_above_smooth_200622.tsv', sep='\t', index_col=0)
gram_df['calc_yield'] = gram_smoothed_df['4NN-2'].values
milligram_df = pd.read_csv('../data/uspto/yields_below_200622.tsv', sep='\t', index_col=0)
milligram_smoothed_df = pd.read_csv('../data/uspto/yields_below_smooth_200622.tsv', sep='\t', index_col=0)
milligram_df['calc_yield'] = milligram_smoothed_df['4NN-2'].values
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6))
fig.suptitle('Yield distributions - USPTO smoothed 4NN-2', fontsize=title_fontsize)
ax1.set_ylabel('Count', fontsize=label_fontsize)
ax1.set_xlabel('Yields [%] - gram scale', fontsize=label_fontsize)
gram_df.calc_yield.hist(bins=20, ax=ax1, **default_plot_kwargs)
ax2.set_xlabel('Yields [%] - milligram scale', fontsize=label_fontsize)
milligram_df.calc_yield.hist(bins=20, ax=ax2, **default_plot_kwargs)