Big data ecosystems are like ogres.
Big data components pile up in layers, building a stack. It’s not as simple as taking data and turning it into insights. Big data analytics tools instate a process that raw data must go through to finally produce information-driven action in a company.
Get our Big Data Requirements Template
Data must first be ingested from sources, translated and stored, then analyzed before final presentation in an understandable format. It’s a long, arduous process that can take months or even years to implement. But the rewards can be game changing: a solid big data workflow can be a huge differentiator for a business.
In this article, we’ll introduce each big data component, explain the big data ecosystem overall, explain big data infrastructure and describe some helpful tools to accomplish it all.
Ingestion and Storage
Extract, transform and load (ETL) is the process of preparing data for analysis. While the actual ETL workflow is becoming outdated, it still works as a general terminology for the data preparation layers of a big data ecosystem. Concepts like data wrangling and extract, load, transform are becoming more prominent, but all describe the pre-analysis prep work. Working with big data requires significantly more prep work than smaller forms of analytics.
With different data structures and formats, it’s essential to approach data analysis with a thorough plan that addresses all incoming data. Sometimes you’re taking in completely unstructured audio and video, other times it’s simply a lot of perfectly-structured, organized data, but all with differing schemas, requiring realignment.
Before you get down to the nitty-gritty of actually analyzing the data, you need a homogenous pool of uniformly organized data (known as a data lake).
The first two layers of a big data ecosystem, ingestion and storage, include ETL and are worth exploring together.
Get our Big Data Requirements Template
The ingestion layer is the very first step of pulling in raw data. It comes from internal sources, relational databases, nonrelational databases and others, etc. It can even come from social media, emails, phone calls or somewhere else. There are two kinds of data ingestion:
- Batch, in which large groups of data are gathered and delivered together. Data collection can be triggered by conditions, launched on a schedule or ad hoc.
- Streaming, which is a continuous flow of data. This is necessary for real-time data analytics. It locates and pulls data as it’s generated. This requires more resources because it is constantly monitoring for changes in data pools.
It’s all about just getting the data into the system. Parsing and organizing comes later. It’s like when a dam breaks; the valley below is inundated. It’s quick, it’s massive and it’s messy. This presents lots of challenges, some of which are:
- Maintaining security and compliance: With so much data flowing in, making sure that any single dataset isn’t introducing security vulnerabilities is a legitimate worry. Additionally, legal regulations don’t go away just because there is so much content to sift through. All data must be obtained ethically and within the bounds of the law, which can be difficult to manage and validate with such large quantities.
- Variable data speeds: Data sources have different infrastructures for transporting data. A particularly slow source with low resources in exporting can bog down the entire process and even introduce errors if the speed is too much slower than other sources.
- Ensuring data quality: Just because there is a large sum of data available doesn’t mean it’s all relevant and useful. Having too much irrelevant, tangential or even incorrect, corrupt and incomplete data can cause issues in analysis and processing down the line. The next step of ETL helps address this.
Data Massaging, Cleansing and Organizing
As the data comes in, it needs to be sorted and translated appropriately before it can be used for analysis. Because there is so much data that needs to be analyzed in big data, getting as close to uniform organization as possible is essential to process it all in a timely manner in the actual analysis stage. The components in the storage layer are responsible for making data readable, homogenous and efficient.
Data arrives in different formats and schemas. It’s up to this layer to unify the organization of all inbound data. This task will vary for each data project, whether the data is structured or unstructured. If it’s the latter, the process gets much more convoluted.
Depending on the form of unstructured data, different types of translation need to happen. For things like social media posts, emails, letters and anything in written language, natural language processing software needs to be utilized. Formats like videos and images utilize techniques like log file parsing to break pixels and audio down into chunks for analysis by grouping.
Once all the data is converted into readable formats, it needs to be organized into a uniform schema.
A schema is simply defining the characteristics of a dataset, much like the X and Y axes of a spreadsheet or a graph. It’s a roadmap to data points. For structured data, aligning schemas is all that is needed. For unstructured and semistructured data, semantics needs to be given to it before it can be properly organized. Sometimes semantics come pre-loaded in semantic tags and metadata. For example, a photo taken on a smartphone will give time and geo stamps and user/device information. The metadata can then be used to help sort the data or give it deeper insights in the actual analytics.
Once all the data is as similar as can be, it needs to be cleansed. This means getting rid of redundant and irrelevant information within the data.
When data comes from external sources, it’s very common for some of those sources to duplicate or replicate each other. Often they’re just aggregations of public information, meaning there are hard limits on the variety of information available in similar databases. Other times, the info contained in the database is just irrelevant and must be purged from the complete dataset that will be used for analysis.
After all the data is converted, organized and cleaned, it is ready for storage and staging for analysis.
Get our Big Data Requirements Template
The final step of ETL is the loading process. This is where the converted data is stored in a data lake or warehouse and eventually processed. It’s the actual embodiment of big data: a huge set of usable, homogenous data, as opposed to simply a large collection of random, incohesive data.
Many consider the data lake/warehouse the most essential component of a big data ecosystem. It needs to contain only thorough, relevant data to make insights as valuable as possible. It must be efficient with as little redundancy as possible to allow for quicker processing. It needs to be accessible with a large output bandwidth for the same reason.
Lakes differ from warehouses in that they preserve the original raw data, meaning little has been done in the transformation stage other than data quality assurance and redundancy reduction. Comparatively, data stored in a warehouse is much more focused on the specific task of analysis, and is consequently much less useful for other analysis efforts. Because of the focus, warehouses store much less data and typically produce quicker results.
The tradeoff for lakes is an ability to produce deeper, more robust insights on markets, industries and customers as a whole. Talend’s blog puts it well, saying data warehouses are for business professionals while lakes are for data scientists.
This also means that a lot more storage is required for a lake, along with more significant transforming efforts down the line. Modern capabilities and the rise of lakes have created a modification of extract, transform and load: extract, load and transform.
Cloud and other advanced technologies have made limits on data storage a secondary concern, and for many projects, the sentiment has become focused on storing as much accessible data as possible.
There are obvious perks to this: the more data you have, the more accurate any insights you develop will be, and the more confident you can be in them. But it’s also a change in methodology from traditional ETL.
Extract, load and transform (ELT) is the process used to create data lakes. The data is not transformed or dissected until the analysis stage. It preserves the initial integrity of the data, meaning no potential insights are lost in the transformation stage permanently.
Data lakes are preferred for recurring, different queries on the complete dataset for this reason. With a warehouse, you most likely can’t come back to the stored data to run a different analysis. With a lake, you can.
Advances in data storage, processing power and data delivery tech are changing not just how much data we can work with, but how we approach it as ELT and other data preprocessing techniques become more and more prominent.
Analysis is the big data component where all the dirty work happens.
You’ve done all the work to find, ingest and prepare the raw data. Now it’s time to crunch them all together. In the analysis layer, data gets passed through several tools, shaping it into actionable insights.
There are four types of analytics on big data: diagnostic, descriptive, predictive and prescriptive.
- Diagnostic: Explains why a problem is happening. Big data allows analytics to take a deep dive into things like customer information, marketing metrics and key performance indicators to explain why certain actions didn’t produce the expected results. Projects are undertaken with an expectation of certain results based on certain estimations of markets, customers and other similar criteria. Diagnostic analytics digs into which assumed contributors didn’t meet their projected metrics.
- Descriptive: Describes the current state of a business through historical data. It uses previous trends to forecast things like sales rates, seasonal impacts and more. In big data, the use of far-reaching market data and customer insights help contextualize internal metrics and increase the intelligence of a business’s position amongst its competitors. In boiled down terms, it answers “what” questions.
- Predictive: Projects future results based on historical data. By highlighting patterns and evaluating trajectories of relevant metrics, predictive analytics estimates future efforts.
- Prescriptive: Takes predictive analytics a step further by projecting best future efforts. By tweaking inputs and changing actions, prescriptive analytics allows businesses to decide how to put their best foot forward. Different actions will yield different results, and prescriptive analytics helps decision makers try to decide the best way to proceed.
Just as the ETL layer is evolving, so is the analysis layer. AI and machine learning are moving the goalposts for what analysis can do, especially in the predictive and prescriptive landscapes. We can now discover insights impossible to reach by human analysis.
The final big data component involves presenting the information in a format digestible to the end-user. This can materialize in the forms of tables, advanced visualizations and even single numbers if requested. This is what businesses use to pull the trigger on new processes.
The most important thing in this layer is making sure the intent and meaning of the output is understandable. Up until this point, every person actively involved in the process has been a data scientist, or at least literate in data science. But in the consumption layer, executives and decision-makers enter the picture. They need to be able to interpret what the data is saying.
There’s a robust category of distinct products for this stage, known as enterprise reporting. That’s how essential it is.
Visualizations come in the form of real-time dashboards, charts, graphs, graphics and maps, just to name a few. Many rely on mobile and cloud capabilities so that data is accessible from anywhere.
Apache is a market-standard for big data, with open-source software offerings that address each layer. For lower-budget projects and companies that don’t want to purchase a bunch of machines to handle the processing requirements of big data, Apache’s line of products is often the go-to to mix and match to fill out the list of components and layers of ingestion, storage, analysis and consumption.
Almost all big data analytics projects utilize Hadoop, its platform for distributing analytics across clusters, or Spark, its direct analysis software. Airflow and Kafka can assist with the ingestion component, NiFi can handle ETL, Spark is used for analyzing, and Superset is capable of producing visualizations for the consumption layer.
Get our Big Data Requirements Template
In this article, we discussed the components of big data: ingestion, transformation, load, analysis and consumption. We outlined the importance and details of each step and detailed some of the tools and uses for each.
If you’re looking for a big data analytics solution, SelectHub’s expert analysis can help you along the way. The different components carry different weights for different companies and projects. Our custom leaderboard can help you prioritize vendors based on what’s important to you. If you’re just beginning to explore the world of big data, we have a library of articles just like this one to explain it all, including a crash course and “What Is Big Data?” explainer.
Which component do you think is the most important? What tools have you used for each layer? Let us know in the comments.