Too long; didn’t read

  • If you are looking to build a career as a data scientist in process industry, then this blog may help you get an idea about what skills you should look to acquire
  • This infographic below provides a quick overview of intermediate-level data-science skills expected of a Modern Process Data Scientist
  • Our recent book ‘ML in Python for Process Systems Engineering‘ is a comprehensive guide for developing these skills.

In the 21st century, process data scientists (PDSs) serve the key role of unlocking untapped potential in process data. Using process data, they help an organization achieve higher efficiency and productivity by

  • real-time tracking and control of key process indicators
  • advance detection of non-optimally performing process units
  • predictive maintenance of process equipment
  • process-wide 24 X 7 automated monitoring to detect and diagnose abnormal operational states
  • optimizing process operations in real-time

Consequently, a modern PDS possesses several skills which come in handy for developing digital solutions to meet the aforementioned objectives. The infographic below provides an overview of these skills.

Areas of Expertise and Data-Science Techniques

A modern PDS should have sufficient expertise in the listed functional areas. Most of the data-science projects you will work on will fall under these categories and therefore, the infographic emphasizes the importance of acquiring expertise in different categories of tasks rather just a few data-science techniques/algorithms. This requirement translates into gaining working-level knowledge of broad spectrum of data-science techniques. The infographic cautions against a pitfall that some data science enthusiasts fall into: overemphasis on ‘modern’ and complex ML techniques such as artificial neural networks (ANNs) and ignoring classical methods like PLS, PCA. The adage “If the only tool you have is a hammer, you tend to see every problem as a nail” also applies to data scientists. Neural nets are a rage now-a-days and they have provided remarkable successes, however, data-science is not just neural networks. Partial least squares (PLS) are still the most popular models for soft sensor development in process industry due to its simplicity and powerful capabilities of handling noisy and correlated data. Therefore, as a skilled PDS, you should know different techniques that are frequently applied for the varied aforementioned tasks in process industry.

Domain Knowledge

Importance of domain knowledge for development of process ML solutions cannot be over-emphasized. The additional knowledge of process systems is what distinguishes a PDS from a ‘generic’ data scientist. A PDS is expected to be familiar with the typical process control systems such as PID control loops, model predictive controllers (MPCs) and real-time optimization (RTO), and able to read process documents such as PFD and P&IDs. Familiarity with process systems helps a PDS make judicious decisions when developing a process model. For example, during feature engineering, an accomplished PDS can often make good educated decisions about additional features (logarithm of composition for distillation columns, exponential of temperature for reactors, etc.) that can/should be generated for a given process dataset.

Scripting and ML-Ops

It is becoming increasingly rare that a data scientist’s job responsibilities entail solely the development of machine learning models. A modern PDS may have to shoulder several additional responsibilities involved in delivering models to production. Consequently, a modern PDS should be well-versed with the best practices for sustainable model maintenance, seamless data connectivity, scalable deployment, uninterrupted model executions, etc. These best practices aim to ensure reliable and efficient performance of data-science solutions and are collectively termed as ML-Ops. Several tools and platforms are available for implementation of ML-Ops practices. Cloud platforms (Microsoft’s Azure and Amazon’s AWS being the popular ones) may be employed for scalable and reliable model executions, DevOps, etc. Docker/Kubernetes can be used for packaging and deploying solutions at scale quickly. The takeaway message is that a modern PDS’s knowledge-sphere spans beyond classical scripting and a skilled PDS must strive to remain abreast with latest ML-Ops technologies.

Soft Skills

Until now we have focused on intermediate-level technical skills of a modern PDS. There are another set of skills that are equally important for a successful PDS career: Soft Skills, which is an umbrella term for skills such as story-telling, communication, presentation, interpersonal, team-building, leadership, work-ethics, etc. You may have the most accurate ML model, but your project won’t fly if you can’t present a compelling business case to your management. You may have the most detailed and beautiful looking user interface, but you tool won’t gain user traction and confidence if you can’t convey succinctly to the plant operators how your tool can help in their day-to-day activities of running a plant. Such skills are developed with practice but a keen focus can help you pick these quicker.

We hope that budding PDSs will find this infographic and the suggestions provided useful. We will continuously add resources to this website focused around the skills mentioned in the infographic above. Our recent book ‘ML in Python for Process Systems Engineering‘ contains detailed step-by-step guidance on developing several of these skills.