Automating ingest data processing with data pipelines

April 22, 2021Uday Boppana5-minute read

With the amount of data today growing ever faster—from sources ranging from device edge to offsite facilities and public and private clouds—organizations must somehow keep pace with that growth as they complete their digital-transformation journeys.

One challenge is having the right quantity and quality of data—and at the right time. With fresh, relevant data, businesses can learn quickly and adapt to changing customer behavior. However, managing vast amounts of data ingest and preparing that data to make it ready as fast as possible—preferably in real time—for analytics and AI/ML, is extremely challenging for today’s data engineers.

Plotting the course

A data pipeline that automates the workflow of data ingestion, preparation, and management and shares data securely with other entities makes the onslaught of data manageable. With the Red Hat product portfolio, companies can build data pipelines for hybrid cloud deployments that automate data processing on ingest.

The combination of Red Hat OpenShift Data Foundation (formerly Red Hat OpenShift Container Storage), Red Hat Ceph Storage, Red Hat OpenShift (incorporating OpenShift Serverless), and Red Hat AMQ delivers a powerful foundation on which to build data pipelines that can scale to meet data-ingest needs and automatically process incoming data based on organizational needs.

Following a blueprint that leverages user-defined functions to perform operations like data anonymization, tagging, and metadata enrichment, this data-ingest pipeline can serve multiple industries and verticals.

A healthcare facility, for example, may want to automatically process images, anonymize data, and provide timely material so researchers can improve processes or even accelerate cures. Banking institutions may want to accelerate payments or use fraud detection to better serve their customers. Insurance providers, for their part, can automate their workflows to accelerate claims adjustments. Events can be automated from sensor data to alert teams to conduct preventive maintenance. The list goes on!

Solidifying the plan

To achieve goals like these, Red Hat uses a combination of automated S3 object bucket notifications, a feature of the Ceph RADOS gateway (RGW) in OpenShift Data Foundation and Red Hat Ceph Storage, data-streaming services available in Red Hat AMQ, and serverless capabilities in Red Hat OpenShift.

As soon as data is ingested, the RGW sends a bucket notification to Red Hat AMQ, which creates an Apache Kafka topic that in turn delivers the notification to OpenShift Serverless. Next, OpenShift Serverless invokes the assigned function to process and apply any transformation on the incoming data.

This process can be scaled to operate on multiple incoming data streams, with each stream invoking a different serverless function. Once the data is processed, it’s stored in a data lake where data engineers and data scientists can access it.

data pipelines

Reaping the benefits

Foundation for tracking data lifecycle

The ability to process data at ingest, in real time, is akin to placing a virtual GPS tracker on data at its point of entry into the storage and data-management system. You can embed tags about the data’s source, and any source- and time-specific information, in real time. This yields valuable information you can use later to enhance data management and such ML processes as data classification, feature engineering, and cataloging. It can also improve visibility into data lineage and data provenance as data moves through its lifecycle.

Cloud native and integrated

The cloud-native way of creating data pipelines involves integrating components—like OpenShift Data Foundation, Red Hat OpenShift Serverless, and Red Hat AMQ—to each perform specific operations.

The resulting architecture automates data processing ingest, providing standardized tools for operational workflows like container lifecycle management, log management, and troubleshooting. In so doing, the cloud-native approach relieves adopters from having to create disjointed solutions and workflow processes that are complex and time-consuming to manage.

Scalable and flexible

Each component of the data pipelines architecture can be customized and independently scaled depending on administrative and user needs. Knative services, for example, can be customized to suit the data set and organizational processes, requirements, and goals. The storage configuration for source and destination data can be similarly defined. The resulting framework is a highly scalable, customizable solution that can be adapted to individual and organizational needs and policies.

Fast and in real time

As mentioned, once an ingest data pipeline is set up, data can be processed in real time as it’s ingested. This significantly speeds up the process of making data sets available to data scientists, providing up-to-date data from which to train models. The end products are intelligent models that are current with the trends being observed in the new data sets.

Extensible

Automating data pipelines in this fashion can be extended to other areas of data lifecycle management—like data cataloguing, audit logging, and so on—using the same building blocks of object bucket notifications, OpenShift Serverless, and Red Hat AMQ. The extensibility of this solution architecture helps organizations add functionality and automation to their data-lifecycle processes without having to re-architect or re-write their existing solutions.

Real-time data processing for edge-to-core data mobility

Massive parallel data ingestion from edge devices further challenges the ability to manage and use all raw data in a timely manner. As data moves from network edge to the enterprise datacenter, opportunities abound to act on that data.

This data pipeline can help businesses to create scalable and automated solutions that apply data transformations and manipulation as close to data creation as possible, as well as at different points when moving data to the core datacenter. With this approach, companies can streamline notifications, apply custom transformations such as data anonymization, and remove or mask sensitive information before moving their data to a central data store and making it available to data scientists and data engineers for analysis.

Once set up, the solution is easier to maintain, because it’s fully automated and scales as the amount of ingested data, data sources, and business needs grow.

Increased efficiency

The ingest-data pipeline automation described here enables data engineers to codify and automatically perform many of the daily operations needed to prepare data for use in ML. Once the pipeline is set up, it can scale to accommodate large and varied amounts of incoming data, which frees data engineers and data scientists—who would otherwise have to manually manage, analyze, and process incoming data—to instead focus on other, higher-value activities in the organization.

Conclusion

As organizations amass more and more data from edge devices and end-user premises to the enterprise edge and hybrid and multicloud environments, a mounting challenge is how to ingest and prepare all that data in a continuous stream so that it’s both useful and timely.

A data pipeline that can ingest, prepare, and manage data from its inception on an automated workflow journey, automatically preparing and even transforming data and sharing it more securely with other entities, not only makes the constant onslaught of data manageable, it makes it usable in near real time.

With application and data services provided by such tools as Red Hat Ceph Storage, Red Hat AMQ, and Red Hat OpenShift, organizations can harness the value of their data with near real-time data processing and act on data at its inception.

Automated data pipelines decouple event triggers from processes so those processes can evolve with requirements. In addition, because events spawn application processes as needed, there’s no need to predict workloads. Processing is done on demand, and resources scale automatically.

Read more about automating data pipelines in this Toolbox article, check out this informative interview at RTInsights, and find out more about Red Hat solutions in this technical overview.

About the author

Uday Boppana

Senior Principal Product Manager

Uday Boppana is a Senior Principal Product Manager at Red Hat, responsible for Big Data and AI/ML data services . He has experience working in AI/ML, hybrid cloud, datacenter, data services and storage solutions in different roles and with a variety of technologies. In prior roles, he worked in product management, technical marketing, solutions architecture, and in leadership and technical positions in engineering. He is a regular speaker at industry conferences and events related to AI/ML and hybrid cloud data solutions

Read full bio

Browse by channel

Explore all channels

Automating ingest data processing with data pipelines

Plotting the course

Solidifying the plan

Reaping the benefits

Foundation for tracking data lifecycle

Cloud native and integrated

Scalable and flexible

Fast and in real time

Extensible

Real-time data processing for edge-to-core data mobility

Increased efficiency

Conclusion

About the author

Uday Boppana

More like this

Browse by channel

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links