It was a moment of great relief and satisfaction when I published my book on Leanpub. I was finally able to accomplish what I set out to do a couple of years ago. Juggling between the regular job, household chores, familial responsibilities, and dodging the Covid virus (expectedly) wasn’t easy, but the end-goal was clear: ‘Write a book that my younger-self would have enjoyed reading and found useful’. But what exactly did my younger-self want? Why did I decide to write this book? This blogpost is a brief account of the circumstances that led me along the path of authoring a book.
Circa 2017, my team at Praxair (now Linde) was rechristened as Praxair Digital. Praxair’s air-separation and hydrogen plants generate a lot of process data and one of the mandate of the newly formed team was to explore, develop, and deploy ML-based solutions to derive ‘untapped value from data’. Historically, my team had developed several data-based solutions for modeling our plants, monitoring critical equipment, and optimal allocation of resources, etc. Nevertheless, we were not ‘data scientists, as per our pre-conceived notion of who a data scientist is. With the desire to quickly acquaint ourselves with all the modern ML techniques, we decided to form a self-directed learning group. For guidance, we shortlisted two popular books for data science. We would finish our lunch quickly and gather in a conference room to utilize the rest of the lunch-break time to study these books and learn ‘data science’ (some of us even joined over phone – such was the spirit of group learning!). [Of course my co-author, Jesus Flores-Cerrillo, who was the Associate Director of our team didn’t join us in these meet-ups]
These books are well-written and would be recommended to any data-science enthusiasts. It gave us a very good understanding of what methods constitute a data scientist’s toolkit. However, by the time we finished these books, it was apparent to us that something was missing. There was some disconnect between what we learnt and what we wanted to learn as process data scientists (PDS). This disconnect is common across the currently available data science books. Below I list some of the major sources of the disconnect
- Some obvious “omissions”: PLS is a very powerful method for process systems modeling due to its ability to efficiently handle collinear and noisy signals. PCA is usually the default choice for building process monitoring applications. These two are a must-have tools in any PDS’s toolkit. Unfortunately, while the mention of PLS is completely missing from these books, the treatment of PCA is limited to showing its DR capability. CVA and ICA are another set of important tools for DR-based process monitoring, but, many data science books don’t discuss them, let alone showcasing their application for process monitoring.
- Data utilized for illustrations: A PDS is likely to absorb the subject matter better if process systems data is employed for illustrations. However, in majority of cases, datasets are designed to appeal to a generic data scientist. For example, application of PCA is shown on the Iris flower dataset. This simple treatment fails to convey to the readers the true capabilities of PCA.
- Absence of specific guidelines for PDSs: Deep neural networks have provided amazing results in the areas of image recognition, speech processing, but, do we really need deep networks for modeling process systems? Based on my experience and literature survey, I can say that two to three hidden layers are usually enough. A beginner PDS would certainly appreciate having such specific guidelines upfront.
I hope that you got the message by now. I wanted a book written “of the PS, by the PDS, for the PDS“. A book that took into account various characteristics of process industry data and taught techniques which have proven useful for analyzing process data. A book that talked about different applications a method could be employed for in my industry; for example, PLS as a soft-sensing as well as process monitoring tool. This, in essence, is the backstory about my decision to write this book.
I look forward to hearing your thoughts. Let me know if you agree or disagree with me on lack of beginner-level ML books with focus on process systems.