Unifying teams and tools on the Red Hat OpenShift Data Science platform

22 de novembro de 2021Karl Eklund6 minutos (tempo de leitura)

Red Hat OpenShift Data Science removes barriers between data engineers, data scientists, and application developers so organizations can realize the benefits of Artificial Intelligence and Machine Learning. Using this cloud service, organizations can experiment with open source and integrated technology partner software across the entire ML life cycle.

Come together by shifting your mindset

You have data, and a lot of it. You know it's valuable, but are you really doing enough to extract the critical insights your data holds? Data professionals need to ask the right questions of data and put useful models into production. Most organizations, however, don't have a cohesive data and technology strategy to support this type of collaboration.

What happens when a cohesive data and technology strategy is missing?

Siloes happen. Organizations have siloed data, siloed people, siloed processes, and siloed tools. Teams spend their time attempting to "get tools working" instead of focusing on what they can do with their data.

Siloed people and processes often result from an organization's structure, and it is difficult, if not impossible, to avoid. The real problem, however, is the reinforcement of these barriers when individuals select and maintain an exclusive set of tools and data. Organizations need to focus on the bigger picture, an end-to-end Machine Learning pipeline aligned to their business goals, instead of local optimizations hindering the process.

An integrated environment like Red Hat OpenShift Data Science can help by making it easier for data engineers, data scientists, and application developers to work together on the full Machine Learning Life cycle.

Red Hat OpenShift Data Science, a cloud service providing a fully featured Data Science environment, unites data, teams, processes, and tools on a single platform. Unification gives data professionals a chance to finally engage directly on the broader goal. They focus on data and select the tools they need to be successful together.

At its core, Red Hat OpenShift Data Science provides a Jupyter-as-a-notebook service with key Data Science packages like TensorFlow, PyTorch, and Pandas built directly into container images. Red Hat OpenShift Data Science really shines in its ability to add-on third-party components from our best-of-breed partners.

The self-service catalog contains community open source projects and commercial partner offerings so data professionals can tailor tools to the organization's AI/ML needs and successfully navigate the Machine Learning life cycle.

Figure 1: The Machine Learning Life cycle

Partner choice along the way

Red Hat OpenShift Data Science helps streamline the process. Let’s look at how our partner solutions specialize in each stage of the Machine Learning life cycle.

Gathering and Preparing Data

The first challenge we face is simply getting access to data. Why? Siloes.

Starburst, a company based on open source Trino, the technology formerly known as PrestoSQL, may have finally cracked the data silo code. They built an analytics engine to access your data where it lives without the need to move data all over your organization or across different environments. Starburst lets you join disparate data sets as if they were already consolidated.

Starburst layers in enterprise-grade features on top of Trino. They provide data management, data security features, role based access control, and the distributed computing features every modern enterprise needs. Adopting a similar unification approach to Red Hat OpenShift Data Science, their single point of access supports a data mesh strategy and gives data scientists easy access to the data they need to be successful.

Developing Models

The next step is the model development process. In order to be successful long-term, data scientists must ensure model reproducibility in order to make sense of the data as quickly as possible.

Open source data science software changes quickly, and while this is a good thing for innovation, it can easily introduce discrepancies into Machine Learning models. Anaconda, a company providing open source packages and libraries along with dependency management, helps ensure compute environments are reproduced every time data scientists need them.

But, Anaconda is more than just open source distribution and package management. Anaconda’s premium repository includes Common Vulnerabilities and Exposures (CVE) metadata and artifact signature verification to help ensure provenance and security of open source packages. As a result, data scientists have consistent, repeatable workflows designed with security in mind. It's no wonder why so many organizations have been relying on Anaconda for almost a decade.

IBM Watson Studio and Watson Machine Learning are part of IBM's Cloud Pak for Data offering, but for now, we'll focus on AutoAI, an automation tool targeting both business users and experienced data scientists within IBM Watson Studio. AutoAI takes a subset of your data to iteratively rank the best combinations of automatically generated features, select models, and hyperparameter settings.

By automating tasks within the pipeline, AutoAI expands the reach of Artificial Intelligence to business users and democratizes AI. These same automation benefits required to extend complicated workflows into the hands of citizen data scientists also apply to experienced data scientists by letting them focus on the data, its meaning, and to model outcomes quickly. By rapidly experimenting with candidate models, data scientists can focus on asking the right questions of their data, discover appropriate features, and ultimately, get results.

Lastly, data scientists can save time by exporting results to their Jupyter notebooks. Transparency between Cloud Pak for Data and Red Hat OpenShift Data Science removes tedious, low level tasks within a data scientist’s workflow.

A second offering accelerating speed to insights is Intel oneAPI AI Analytics Toolkit. Intel’s AI Kit provides a series of tools and frameworks optimized for maximum performance on Intel-based CPUs.

Intel includes drop-in accelerations for TensorFlow, PyTorch, XGBoost, Scikit-learn, and more thanks to their oneAPI libraries, such as the oneAPI Deep Neural Network Library and oneAPI Data Analytics Library. Data scientists can use these optimized libraries directly within a Jupyter Notebook to increase compute performance in the data processing, model development, training, and inference stages.

Model Deployment and Monitoring

After reflecting on our data and candidate models, we can focus on the critical task of getting our first model into production. Thankfully, Red Hat OpenShift Data Science has two initial partner offerings to select from.

Intel OpenVINO Pro for Enterprise provides a fully integrated model development environment and optimizes your model for inference on Intel hardware. Optimization allows us to tailor model behavior to the intended deployment environment. We simply balance our desire for performance with accuracy thresholds depending on the deployment location.

For example, on the edge, we may optimize for performance if networking and hardware choices are constrained. But, with more powerful hardware within an OpenShift cluster, we receive our performance gains through horizontal scaling and optimize our model for accuracy.

The second partner offering, Seldon Deploy, focuses on model deployment, management, monitoring and explainability. Seldon bridges the gap between data science and DevOps teams to bring models to market faster and unlock business value - delivering repeatable deployments for businesses of all sizes.

Seldon Deploy allows teams to serve models and manage their deployments through an intuitive user interface with pre-built dashboards and custom visualizations. Advanced deployment techniques like canaries and A/B testing help enterprises minimize risk and optimize performance.

Seldon’s monitoring capabilities allow users to verify if a deployed model has the desired prediction characteristics while helping businesses understand how outputs are affected when data changes through drift and outlier detection.

Finally, Deploy’s explainability features let users see inside the ‘black box’ of decision making by attaching custom explainers to provide insight into data sets and help mitigate bias.

Continuous improvement

Just as models are continuously monitored and improved over time, Red Hat OpenShift Data Science is engaged with partners to provide solutions with the latest technologies to our customers. With Red Hat's partners, we are exploring offerings that touch on data engineering, automation, and discrete compute accelerators.

One of Red Hat's partners, NVIDIA, is a household name among data scientists for deep learning, and inference. NVIDIA’s discrete graphical processing units (GPUs) accelerate computationally expensive neural networks by parallelizing an otherwise serial process. With minimal code changes, data scientists can save a lot of time training models.

Red Hat OpenShift Data Science’s planned support of NVIDIA GPUs is designed to help data scientists to scale their neural networks to large, complex architectures without sacrificing their productivity in the process.

Build an environment, strengthen your team

Red Hat OpenShift Data Science eliminates data and technology silos and encourages collaboration simply by bringing data, teams, processes, and tools to a single location. You receive the benefits of a cloud managed service and those provided through OpenShift. And, because of the platform approach, it is easy to assemble a technology stack aligned to the needs of your entire team.

Trial options are available to introduce you to the Red Hat OpenShift Data Science offering. Our Developer Sandbox highlights core features and technology within the platform and only takes a few minutes to spin up. From there, the 60 day trial adds the ability to test our broader partner ecosystem to determine which solutions align best to your team's data needs.

Sobre o autor

Karl Eklund

Principal Architect

Karl Eklund is a Principal Architect aligning customer goals to solutions provided by the open source community and commercial vendors within the Red Hat OpenShift Data Science platform. Prior to joining Red Hat, Karl advised technology leaders on enterprise data and technology strategies and built machine learning models across multiple academic disciplines.

Read full bio