Deepchem: modeling molecules with Python on the cloud

Chemometrics: the democratization
The late 90’s in Italy, Turin, faculty of Chemistry. Smartphones were not even conceivable, few geeks had already heard of things like “ Netscape navigator running on Palm Pilot”. Undergraduate students were thought to program in Fortran running on Win95 and crashing on Win98. Few could use Matlab installed in the computer cluster in the faculty of Mathematics, only geeks heard about Python as an exotic programming language. Chemometrics was a course for which all of us had to take notes on paper and was passed with an oral exam. No undergraduate student was allowed to touch a PC in the Theoretical Chemistry department. The term “drug discovery” for most was unheard, let alone “pandemic”.
Fast forward 20 years: everybody talks about pandemics, people who until few months before were working as mechanics/shopkeepers/accountants/politicians and were often nagging about pensions suddenly discovered themselves as epidemiologists with strong background in Statistics. Drug discovery is THE skill to have to produce an effective vaccine against Covid-19. Theoretical Chemistry is something much more accessible.
No, I did not start looking for a vaccine from scratch by coding molecules in Python. But general interest for the new trends in Data Science and my Chemistry background encoraged me to learn more about the latest tools in this field. When months ago I first started to read about the Python library “Deepchem 2.4.0” and tried to install it on my PC, I managed only with a Docker. A rather frustrating experience, similar to what I had weeks earlier with Tensorflow. Missing dependencies, errors etc. As for Tensorflow all these problems vanished when I decided to run the same package on a cloud, using Google Colabs.

Meaning: you can use any PC connected to the internet and run some Theoretical Chemistry simulation. You don’t even need an expensive one, no need to install any software. You can work on a basic project even using for free the city library facilities.
Deepchem: let’s take a ride
Now what: do I have to learn from scratch about SMILES, i.e. the way to write down a molecule so that Python can understand? I could, but molecular datasets are already available [1]. Using the data from “MolNet”, for example you can import a dataset, say that for toxicity, create a test and train set for the ML algorithm, and try to classify a special molecule that you are interested in, just entering it as a string
example_smiles = ['CCCO', 'COCCC']
and using those as samples. Without getting into details of ab-initio simulations and wondering how to perform synthesis at lab scale. But where to learn all this?
There are few tutorials online about Deepchem [2], I started with this that is already integrated with Google Colabs [3]. That is a start, you can refer to the book “Deep learning for life sciences”[4] in order to master the basic concepts and apply them to the specific field you are interested in. The advantage of deep learning compared to traditional simulation approach? Time saving. Massive. Finding the right molecule through a simulation, or being able to model an interaction can help focusing on relative few experiments. While lab work will be more and more automatized [5], still it is costly in terms of time and money, requires people, equipment, space. A hint to the right directions can save months in the lab and huge costs.
A final note
After starting to code in Python and play with Deepchem I realized that while the basics of Chemistry are not going to change, in the near future Artificial Intelligence will have to be integrated in the study curriculum at University so that students will have a chance to keep up with a word that is already changing unpredictably fast.
References
1. https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html
2. https://forum.deepchem.io/t/summer-of-code-with-deepchem/55
3. https://colab.research.google.com/drive/12e8On0ntp8iteqEUt6SQmAw_9tsnsg16?usp=sharing
4. Ramsundar, Bharath, et al. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. “ O’Reilly Media, Inc.”, 2019.