Data Mesh for Data Warehousing
Data mesh is a novel new paradigm that provides a data approach akin to software engineering best practices such as microservice architecture. Is this, however, appropriate for data? Data mesh is a product approach to data in which data is continually assessed to ensure that it is relevant, valuable, and really used.
This approach contrasts with having a single old monolithic system that houses all of the company's data. Data mesh is a decentralized approach to data management, putting control in the hands of those who are more familiar with the data and its applications.
We will have partially missed the point if we fall for this new technique and uproot our companies. As data grows in volume and complexity, we must arrange it in a modular and Agile manner. This is difficult because data is intrinsically less Agile than software due to two factors: (a) Any tiny modification to a data model has a cascading effect, and (b) Most firms have built massive and complicated data warehouses.
Creating smart data models that effectively reflect both raw data and suitable granularity of data throughout the enterprise takes effort and teamwork.
What Is a Data Mesh?
Domain-driven design (DDD) for data is a data mesh! The domains of an organization dictate the structure of data in DDD. Hence, each domain would drive the organization and logic.
DDD makes at least as much sense here as it does for software engineering since data can be seen as entities and characteristics, both of which are inherently domain-driven. Data mesh by Zhamak brings product thinking to the data realm, where data products are APIs. Data must be well-defined and documented to be "discoverable".
Traditional data marts, which are aggregations of data in data warehouses that are generally domain-driven and managed by a small team in a more Agile way, have a lot in common with the data mesh idea. They are used to gain insights and solve particular strategic problems.
What Is a Data Warehouse?
A data warehouse is a consolidated collection of data that can be analyzed to aid decision-making. Data flows to data warehouses on a regular basis from transactional systems, relational databases, and other sources. BI tools, SQL clients, and other analytics programmes are used by business analysts, data engineers, data scientists, and decision makers to access data.
Should You Use a Data Warehouse?
Data warehouses provide us with a comprehensive perspective of our company's information. The Inmon Enterprise Data Warehouse (EDW) strategy and the Kimball dimensional modeling technique are the two main schools of thought in data warehousing.
Fast writing and the typical extract, transform, and load (ETL) pattern are strengths of EDWs. Faster scans and a less normalized (more denormalized) approach were brought to the table by dimensional modeling. New cloud data warehouse solutions, such as Redshift and Azure Synapse, make it simpler to employ an extract, load, and transform (ELT) strategy, in which the data warehouse does the transformations. As a result, data warehouses, especially younger ones on the market, still have a lot of flexibility in their capabilities.
On the other hand, if you've ever worked on a huge, monolithic data warehouse, you know how difficult it is to make changes to business logic, add new data, or fill data gaps. It's sluggish, inconvenient, and takes a long time for the company to see benefits.
Due to expensive storage and relatively low compute, EDWs were developed to store heterogeneous data in a standardized way (comparatively). Normalization assures that there is no repetition and that writes are quick, but they come at a cost: costly joins and reads. Modern data storage methods tend to stray from complete normalization. This implies that switching to first or second normal form is frequently sufficient, and there is no need to compel your data to conform to 6NF or DKNF for performance reasons (sixth normal form or domain-key normal form).
Should You Use a Data Mesh?
When done correctly, a data mesh identifies who owns the data and, as a result, who can assist add new features, give more information about anomalies, and collaborate with business and technical teams to close gaps.
Data is divided into domains that do not require thorough normalization. Completely normalized data is no longer required since, in addition to being inexpensive to store, normalization raises the join complexity for BI and advanced analytics use cases. Instead, most teams on which I've worked have adopted a "Starflake" schema, which is a hybrid of Snowflake and Star. As a result, we've been able to fulfill the needs of additional development teams, as well as sophisticated analytics and reporting use cases.
Data Mesh Approach to Data Warehouse
There are still questions to be solved, as with any new architecture. There are issues regarding adoption, needed skills, disruption, and how an architectural approach that demands such widespread acceptance interacts with existing systems, such as the present data warehouse or data lake.
The advantage of a data mesh is that data products can be built up iteratively, offering early value---there is no need to wait for the central warehouse that represents the organization to be developed. As a consequence, organizations with brownfield IT estates may start employing the data mesh as an architectural strategy in line with digital modernization. This will extend the domain to include analytical data products, building out the mesh over time, and relieving the pressure on the data warehouse.
In terms of the existing data warehouse, the data mesh provides a solution. The warehouse just becomes a node on the mesh, albeit a critical node, ingesting and disseminating data products.
Nothing, however, is that straightforward. A decentralized paradigm necessitates upfront work to create governance and shared standards that enable interoperability of data products.
This should include, for example, explicit product definitions and the degree to which a domain delivers a precisely defined product that meets one or two use cases against something more generalized that the user must tailor further particular to their needs. In the case of the latter, the product must explain how various customers modify data further while maintaining consistency in interpretation.
Summary
We operate in an imperfect environment where conflicting objectives, realities on the ground, and monetary constraints easily generate continuous roadblocks and inter-dependencies amongst teams.
BI Competency Centers and other centralized, cross-cutting services often have the capacity to resolve such concerns as part of their governance structure. As a result, the data mesh should not be viewed as a bottom-up solution, but rather as an investment that will enable future interoperability and seamless integration across the organization. And that attention to detail needs to go above and beyond what centralized capabilities can provide.