Machine learning and Chemistry, an example from soil data
The question I keep getting every time I tell someone about my academic background: “ Why did you completely change career from working in a lab as Chemist to working on a PC as Data Scientist?”. The complete answer would require even more than a blog post. I try to make it concise, something like: “I have been interested in open source software since I was an undergraduate. Later on while working as R&D Chemist in the Netherlands I discovered how Python could be a powerful tool for physical sciences. Then back in France I decided to follow my passion and to invest some time to get trained in Data Science”.
Infrared spectroscopy data processed with a convolutional neural network
Indeed a bit more than a year after I made that decision, I was working on the application of neural network to extract patterns from infrared spectra as the final project for my M.Sc.
Data were sourced from https://registry.opendata.aws/afsis/
Spectra could be loaded into a Pandas dataframe using an open-source Python library from Bruker instruments. Well, a few of the spectra. During the initial exploratory data analysis (EDA) I noticed that I had a big chunk of data that the I could not open in Python. A lot of the remaining spectra were redundant since samples from the same soil were measured with different instruments in three distinct labs. On top of that, academic literature showed that only a few corresponding to mid-infrared wavelengths were really useful to quantify organic matter using Fourier Transform Infrared (FTIR) spectroscopy(Xu et al., 2019). I finally extracted 30 MB of useful data out of the initial 5 GB. At this point the classical lab routine is to have a look at each spectrum and work out which chemical groups correspond to certain peaks. This manual selection would be a very long procedure for big amounts of spectra, and for here ML can be handy. If each wavelength is taken as a variable, spectra can be fed to a ML algorithm that can automatically select those that correlate to a specific composition parameter such as phosphor concentration, acidity or amount of sand (figure 1).

The early approach was Partial Least Squares (PLS) regression(Du and Zhou, 2009), but recently a study showed that convolutional neural networks (CNNs) can give even better results(Ng et al., 2019). A good source code for this task is available at https://github.com/EBjerrum/Deep-Chemometrics
I used it and adapted the CNN to the African soil data. Results were not as good as expected though.

Model failed, why did that happen?
At first I tweaked the CNN parameters. Changing activation functions etc. did not help. A different algorithm such as the Random Forest Regressor did not perform much better either. At that point, I decided to verify if the basic assumption based on a physical mechanism, the Beer-Lambert law, was really met by the data. The assumption is that the intensity of some spectral peaks is directly proportional to the concentration of a given chemical in the solution. I previously explained and showed this phenomenon practically in a video. To verify if this holded true with H+ concentration, i.e. soil acidity, I plotted the spectra for the samples with the most extreme pH values (Figure 3a)

The region for which there is a clear difference between spectra corresponding to low pH and those with high pH is in the region 2500–3000 cm-1. I selected then a wavelength in this range, 2700 cm-1 , and plotted the absorbance at this wavelength of all samples vs. the respective pH values. Beer-Lambert law would imply that those datapoints would be narrowly distributed around a straight line. In figure 3b, it can be observed that this is not the case. It is not surprising then that any ML algorithm cannot find a correlation between spectral data and pH.
Bottom line
Machine Learning is a great tool, very effective to deal with complex and massive data. Still it is not a magic box, and cannot substitute understanding of physical principles that explain the experimental results.
The relative python code is available at https://github.com/opsabarsec/African-soil-chemistry
References
Du, C. and Zhou, J. (2009) ‘Evaluation of soil fertility using infrared spectroscopy: A review’, Environmental Chemistry Letters, pp. 97–113. doi: 10.1007/s10311–008–0166-x.
Ng, W. et al. (2019) ‘Convolutional neural network for simultaneous prediction of several soil properties using visible/near-infrared, mid-infrared, and their combined spectra’, Geoderma, 352, pp. 251–267. doi: 10.1016/j.geoderma.2019.06.016.
Xu, X. et al. (2019) ‘Detection of soil organic matter from laser-induced breakdown spectroscopy (LIBS) and mid-infrared spectroscopy (FTIR-ATR) coupled with multivariate techniques’, Geoderma, 355. doi: 10.1016/j.geoderma.2019.113905.