Data science has exploded in popularity (and sometimes, hype) in recent years. This has led to an increased interest in learning the subject. With so many possible directions, it can be hard to know where to start.
Machine learning is better considered a loose collection of topics rather than a coherent field. It encompasses topics in many different areas, including:
- Data storage (databases, data storage technologies)
- Data engineering (infrastructure and techniques for transforming data at scale)
- Statistics and machine learning
- Data visualization and communication
- And many more
For a beginner, it is crucial to get a flavor of the subject before diving deep and specializing. For anyone curious about learning data science, here is some informal guidance on ways to get started.
Programming languages for machine learning
In terms of programming languages, Python is used heavily in the commercial industry and academic computer science departments where a lot of machine learning (ML) research is carried out. The statistical language R is used heavily by groups doing classical statistics, such as medicine, clinical groups, and psychology. R has a very rich set of libraries in this arena, while Python is still lacking—although there are packages like statsmodels that implement classical methods.
That said, Python dominates when it comes to ML, especially deep learning or reinforcement learning. In these settings, Python is almost always used as a prototyping language; all the core functionality is implemented in a lower-level language, like C, C++, or numerical routines in Fortran. A practitioner might write a neural network using PyTorch or NumPy, which, in turn, calls parallelized single instruction, multiple data (SIMD) operations implemented in lower-level languages. Julia is an interesting alternative language, but learning Python is highly recommended for those starting out.
Which aspects of ML would you like to try?
The day-to-day work of data scientists can vastly differ. The situation is analogous to having the title "software engineer." Person A might be a low-level kernel hacker, while Person B might be writing front-end code in React. It's the same job title, but with very different skill sets, even though both can write code.
Here are a few things you could try to get a flavor of data science.
The classic course used to be Andrew Ng's Coursera course, which is still a great starting point. The course gives a great high-level survey of the various core techniques in ML. More importantly, it conveys the kind of mathematical and algorithmic thinking needed for designing and analyzing ML algorithms. While most practitioners will not need to design new algorithms, understanding the principles is crucial to applying and extending ML techniques. A great way to build this understanding is to implement not only the assignments in the course but also each algorithm from scratch in Python. Two recent books by Kevin Murphy are highly recommended for those seeking to go beyond and dive deeper into ML.
[ Try OpenShift Data Science in our Developer sandbox or in your own cluster. ]
A huge part of data science work is getting the data in the right structure and format, as well as exploring and checking ideas in the dataset. R has amazing libraries (dplyr) for this, and in the Python world, pandas essentially replicates these functionalities. Similarly, getting comfortable with any plotting library (for example, matplotlib, seaborn, or plotly) will be very useful. NumPy is another essential Python package: a general principle is to replace as many loops in your code as possible with corresponding highly optimized NumPy functions. The best way to learn these skills is to pick a dataset from your job or from Kaggle (see below) and start using pandas and matplotlib to explore patterns and hypotheses.
Kaggle is a platform for ML competitions and is a great source for well-defined problems on clean datasets. A good way to apply ML/modeling skills is to pick a Kaggle problem. Pick one with a tabular dataset—rather than images or text—for your first modeling exercise. Building models for a specific task that can be scored is a great way to learn new modeling techniques. A downside of Kaggle is that most real-world problems are not that well-defined and don't require getting an extra 0.001% accuracy. Models on Kaggle tend to be too complicated (an ensemble of 50 models, for example), but even with these caveats, Kaggle is a great way to learn practical modeling.
The workflow of an ML project generally consists of data going through a sequence of transformations. Pipelining infrastructure makes it easy to implement these transformations on distributed hardware in a scalable and reliable way. While data engineers are generally responsible for implementing these pipelines, it's very useful for data scientists to become conversant with pipelining tools too. A popular open source platform for pipelines is Kubeflow Pipelines. The project Operate First currently hosts a service providing Kubeflow Pipelines, which can be used for experimentation.
Domain expertise pays off in data analysis
In almost every scientific field, the role of the data scientist is actually played by a physicist, chemist, psychologist, mathematician (for numerical experiments), or some other domain expert. They have a deep understanding of their field and pick up the necessary techniques to analyze their data. They have a set of questions they want to ask and know how to interpret the results of their models and experiments.
With the increasing popularity of industrial data science and the rise of dedicated data science educational programs, a typical data scientist's training lacks domain-specific training. This lack of domain understanding strips away a data scientist's ability to ask meaningful questions of the data or generate new experiments and hypotheses. The only solutions are either to work with a domain expert or, even better, to start learning the field you are interested in. The latter approach does take a long time but pays rich dividends.
[ Learn best practices for implementing automation across your organization. Download The automation architect's handbook. ]
In many cases, there's also the option of going deep into the techniques. A big caveat is that some of these are very specialized and generally need a lot of dedicated time. The list below is woefully incomplete and is meant to give a sense of what a few subspecialties of ML involve. Most data scientists will probably never encounter these specializations in their work.
Beyond learning the basics of neural networks and their architectures, deep learning includes learning to devise new ones and understanding the tradeoffs in their design. Diving into deep learning also requires getting comfortable with the tools (such as PyTorch, GPU kernels, possibly some C, Julia code) that let you carry out diverse experiments and scale them. There's also a lot of reading: Papers with code is a great resource. Note that there are specialized subfields like computer vision, which do a lot more than throw a convolutional neural network at an image.
Reinforcement learning is even more specialized than deep learning, but it's a fast-growing, intellectually rich field. Again, this involves reading and understanding (and implementing) lots of papers, identifying subthreads that you find interesting, then applying or extending them. Reinforcement learning is generally more mathematical than deep learning. A (non-exhaustive) list of books and resources is:
- Reinforcement Learning by Sutton and Barto
- A great online course by Sergey Levine at Berkeley
- A collection of papers by Pieter Abbeel, also at Berkeley
- NeurIPS 2021 Workshop
[ Learn how to build a flexible foundation for your organization. Download An architect's guide to multicloud infrastructure. ]
Another interesting subfield is that of probabilistic graphical models. Some resources here are:
- Statistical Rethinking: This is a great (R-based) book
- Pyro: A PyTorch-based library for graphical models
The subfield of optimal statistical decision-making (related to reinforcement learning) provides a sense of how specialized things can get. To learn more, see:
- Optimal Statistical Decisions by DeGroot
- Bandits by Lattimore and Szepesvári
- Reinforcement Learning: Theory and Algorithms by Agarwal, Jiang, Kakade, and Sun
Choose an approach
Lastly, a philosophical point: There are two opposing approaches. One is to know which tool to use, pick up a pre-implemented version online, and apply it to a problem. This is a very reasonable approach for most practical problems. The other is to deeply understand how and why something works. This approach takes much more time but offers the advantage of modifying or extending the tool to make it more powerful.
The problem with the first approach is that when you don't understand the internals, it's easy to give up if something doesn't work. The problem with the second approach is that it is generally much more time-consuming (maybe that's not really a problem) and must be accompanied by an application to problems (practical or not) to avoid having just a superficial level of understanding.
My very opinionated advice is to do both. Always apply the techniques to problems. The problems can be artificial, using a synthetically generated dataset, or they can be real. See where they fail and where they succeed. But don't ignore the math and the fundamentals. The goal is also to understand and not just use, and understanding almost always has some mathematical elements.
Initially, it might seem like a foreign language, but eventually, it allows you to generate new ideas and see connections that are just hard to see otherwise. Sometimes the mathematics in ML papers can seem gratuitous. Still, even then, it provides a post-hoc justification of observed results and can be used to suggest new extensions of the techniques and new experiments to verify whether the mathematical understanding is correct.
This article originally appeared on Red Hat Research and is republished with permission.