It isn't easy to provide an exact definition of big data. In general, big data consists of assembling, compiling, processing, and delivering insightful outputs from large datasets. Big data plays a significant role in enterprise solutions, requiring the computing power of multiple systems and "outside the box" technologies and tactics.
[ Learn how to manage your Linux environment for success. ]
This article offers a foundational understanding of big data, what it is, how it works, and why we need it.
What big data consists of
In general, big data consists of two things:
- Large datasets
- Computing programs and technologies to process those large datasets
A "large dataset" is content that exceeds the ability of a single computing system or conventional tool to manipulate. Datasets vary in size from one organization to the next.
How big data processing differs from other data processing
The massive size, processing speed, and data characteristics vary from one dataset to the next. One constant is that a single computing system cannot complete these requests on its own. The goal of big data is collectively conducting analysis, allowing the data to be stored and accessed without utilizing traditional methods.
With that in mind, there are "three Vs" to remember for big data: Volume, variety, and velocity.
Volume measures the amount of information processed by big data systems. Because of the sheer scale of content being collected and analyzed, this requires more care, power, and understanding of how the data is processed. Completing requests requires larger computer systems to allocate the appropriate resources. Algorithms and clusters are crucial here.
Big data has unique challenges due to the amount and kinds of data it collects. Big data accumulates and merges various information types—from social media feeds to system logs—into a single integrated system. The media types and formats, such as images or video files, also vary.
Big data differs because it accepts the material closer to its raw state. It will complete changes to the raw data in memory during processing.
Despite the data's destination being a single system, the origins of the data vary from one source to the next. Because of this, the speed at which the data must be processed plays an essential role in gathering real-time insights and understanding.
New and valuable information is output as data continues to be collected, processed, and analyzed. The speed of gathering that information requires a more robust engine to provide the desired analysis. A complex system is necessary to prevent failures.
[ Try OpenShift Data Science in our Developer sandbox or in your own cluster. ]
The big data lifecycle
What does the big data lifecycle look like? Below are several common steps in big data system processing:
- Collect the data into the system.
- Push the data into storage.
- Calculate and organize the data.
- Conceptualize the output.
1. Collect the data into the system
Data ingestion is simply the input of raw data into the system. Completing this with high accuracy and high efficiency depends heavily on the quality of the data sources, the data format, and what state the data is in before processing.
Dedicated data ingestion tools exist to aid in the process. Technologies like Apache Flume can aggregate and import server and application logs. Apache Sqoop can import data from relational databases into big data systems. Alternatively, the Gobblin framework assists in normalizing these tools' output near the end of the pipeline.
The ingestion process is sometimes called extract, transform, and load (ETL). Some layers in this process are similar to traditional processing platforms, including modifying the raw data received, organizing and cataloging the data, or filtering out bad or unneeded data that do not meet certain requirements. The more raw the data you input, the more flexible you can be with that data further down the ingestion pipeline.
2. Push the data into storage
After ingestion, the data moves to storage, allowing it to be persisted to disk reliably. This task requires more complex storage systems due to the volume of data and the velocity at which it enters.
One common solution is Apache Hadoop's HDFS filesystem, which stores large quantities of raw data in a cluster across multiple instances. This approach allows the data to be accessed with available and coordinated resources and can handle failures gracefully. You may also select other technologies, such as Ceph.
[ Related reading: How to create a Ceph cluster on a single machine ]
Utilizing data in other distributed systems, such as NoSQL, is also possible. NoSQL is designed to handle assorted data and contains the same broad fault tolerance as other distributed systems. Several solutions are available, depending on the desired output, organization, and presentation of the data.
3. Calculate and organize the data
When the data is ready, data computations begin. This section of the process is important and different because the best way to proceed can vary based on the desired outcomes: What do you want the final product to look like? What part of the data is important to focus on? And why? The data is usually processed several times, utilizing either one tool or many, until the final output produces the desired result.
One standard method to compute a large dataset is batch processing. This method breaks the work into sections. Batch processing assigns each section to a specific unique machine. The data is then redistributed. Finally, it is assessed and constructed into the desired final product. The batch processing method is most helpful with large amounts of data that need heavy processing and complex computations. Apache Hadoop's MapReduce utilizes this strategy.
[ You might be interested in reading Why in-memory data grids matter in event-driven and streaming data architectures ]
Another method is real-time processing. This method requires the system to acknowledge new information consistently and frequently being fed to it. Stream processing—a constant stream of new information from individual items—is one way of achieving this. It also consists of in-memory computing, which references the system cluster memory to avoid writing back to disk. This method is better for smaller groups of data. Examples of real-time processing tools include Apache Storm and Apache Spark.
4. Conceptualize the output
Visualizing data makes it easier for humans to perceive the bigger picture from the collected data to spot trends. Real-time processing is recommended because it's easier to make sense of the analyzed data. Large deltas in the metrics generally represent meaningful impacts. Tools like Prometheus help process data streams with strategies that allow the user to make educated decisions.
Elastic Stack, formally known as ELK stack, is a common way to visualize data. Elastic Stack consists of several tools: Logstash (collecting the data), Elasticsearch (cataloging the data), and Kibana (conceptualizing the data). Another popular technology in data science is a data "notebook," which allows interactive work when handling data and formats it effectively for sharing, collaborating, and analyzing. Jupyter Notebook is one favored technology for this type of visualization.
A big data glossary
It is best practice to have a centralized list of concepts and their definitions. Below is an explanation of these concepts for future reference:
- Big data: A general term used for larger-than-normal datasets that traditional compute methods cannot handle due to velocity, variety, and volume. This term also references tools and technologies that work with this data.
- Batch processing: A computational layer that includes processing large datasets. The process begins, and the system provides the final product after some time. This is intended for data that is not time-sensitive due to the volume.
- Cluster computing: The pooling of multiple computing systems to combine resources to complete a desired task. System clusters require a centralized management system to manage and coordinate tasks with individual computers.
- In-memory computing: A strategy involving moving the entire dataset to the cluster's combined memory. Computations are held in memory and not stored on the disk, which gives an edge over I/O-bound systems.
- Machine learning: The use and development of computing systems that can learn, adapt, and pivot by utilizing algorithms and drawing conjectures in the data patterns without absolute instructions.
- NoSQL: An approach to database design that enables storage and data queries outside the conventional formation found in relational databases.
- Stream processing: A strategy of real-time analysis that calculates unique data items moving through a computing system. Time-sensitive data operations warrant this approach.
[ Want to test your sysadmin skills? Take a skills assessment today. ]
More and more organizations are adopting big data systems. These systems provide invaluable insight and analysis through inferences and replace traditional business tools and technology.
Big data is not perfect for every organization, but it provides a service for specific workloads. As big data continues to evolve, it is uniquely suited for taking existing data and exploring depths that are impossible to see in raw form.