Available tools
BERT and related transformer models have revolutionised Natural Language Processing. The implementation of such models is conveniently made available through the Huggingface Transformers library. We based already based our previous work on reaction fingerprints / classification and atom-mapping on this library. To train the yield regression models in this work, we used the SimpleTransformers.ai, which contains all you need to add fine-tuning heads on top of transformers, run trainings and evaluations.
SmilesTokenizer
One key difference compared to human languages, when compared to chemistry are the tokens and tokenizers. In this work, we use the tokenizer introduced our previous rxnfp work with the same regex as in the Molecular Transformer.
from rxnfp.tokenization import get_default_tokenizer, SmilesTokenizer
from rdkit.Chem import rdChemReactions
smiles_tokenizer = get_default_tokenizer()
reaction_smiles = 'CC(C)[C@@H](C)CCBr.[Na]C#N>>CC([C@@H](C)CCC#N)C'
rxn = rdChemReactions.ReactionFromSmarts(reaction_smiles,useSmiles=True)
print(smiles_tokenizer.tokenize(reaction_smiles))
rxn
As the tokenizer is normally hard-coded in the SimpleTransformers library we need to change it, we therefore create a SmilesClassificationModel
class, as seen in the core
module.
MODEL_CLASSES = {
"bert": (BertConfig, BertForSequenceClassification, SmilesTokenizer),
}
Once this is done, the SimpleTransformers library can be used as usual.
Pretrained reaction BERT models
There are currently two reaction BERT models in the rxnfp
library - pretrained
(trained with on a reaction MLM task) and ft
(additionally trained on a reaction classification task). For this example, we will use the pretrained
model as starting point for the training of our Yield-BERT. On the Buchwald-Hartwig reactions both base models performed similarly.
import pkg_resources
import torch
from rxnfp.models import SmilesClassificationModel
model_path = pkg_resources.resource_filename(
"rxnfp",
f"models/transformers/bert_pretrained" # change pretrained to ft to start from the other base model
)
yield_bert = SmilesClassificationModel('bert', model_path, use_cuda=torch.cuda.is_available())
import pandas as pd
from rxn_yields.data import generate_buchwald_hartwig_rxns
df = pd.read_excel('../data/Buchwald-Hartwig/Dreher_and_Doyle_input_data.xlsx', sheet_name='FullCV_01')
df['rxn'] = generate_buchwald_hartwig_rxns(df)
train_df = df.iloc[:2767][['rxn', 'Output']]
test_df = df.iloc[2767:][['rxn', 'Output']] #
train_df.columns = ['text', 'labels']
test_df.columns = ['text', 'labels']
mean = train_df.labels.mean()
std = train_df.labels.std()
train_df['labels'] = (train_df['labels'] - mean) / std
test_df['labels'] = (test_df['labels'] - mean) / std
train_df.head()
Hyperparameter tuning
Most of the hyperparameter are already fixed by the base model. Here we decided only to tune the dropout probability
and the learning rate
. SimpleTransformers has wandb nicely integrated. An example how to setup a hyperparameter sweep can be found in the training scripts. The wandb parameters are read using dotenv.
Training
As you can also be seen from the training scripts, once the data is in the right shape a training run can be started within a few lines of code.
For this example we will go with the following parameters,
{dropout=0.7987, learning_rate=0.00009659},
and launch a training. We have to reinitiate the BERT model with the correct parameters.
model_args = {
'num_train_epochs': 15, 'overwrite_output_dir': True,
'learning_rate': 0.00009659, 'gradient_accumulation_steps': 1,
'regression': True, "num_labels":1, "fp16": False,
"evaluate_during_training": False, 'manual_seed': 42,
"max_seq_length": 300, "train_batch_size": 16,"warmup_ratio": 0.00,
"config" : { 'hidden_dropout_prob': 0.7987 }
}
model_path = pkg_resources.resource_filename(
"rxnfp",
f"models/transformers/bert_pretrained" # change pretrained to ft to start from the other base model
)
yield_bert = SmilesClassificationModel("bert", model_path, num_labels=1,
args=model_args, use_cuda=torch.cuda.is_available())
yield_bert.train_model(train_df, output_dir=f"outputs_buchwald_hartwig_test_project", eval_df=test_df)
model_path = '../trained_models/buchwald_hartwig/FullCV_01_split_2768/checkpoint-2595-epoch-15'
trained_yield_bert = SmilesClassificationModel('bert', model_path,
num_labels=1, args={
"regression": True
}, use_cuda=torch.cuda.is_available())
yield_predicted = trained_yield_bert.predict(test_df.head(10).text.values)[0]
yield_predicted = yield_predicted * std + mean
yield_true = test_df.head(10).labels.values
yield_true = yield_true * std + mean
for rxn, pred, true in zip(test_df.head(10).text.values, yield_predicted, yield_true):
print(rxn)
print(f"predicted {pred:.1f} | {true:.1f} true yield")
print()