'Data Mesh' or the Turkish equivalent of 'Data Network'; is a next-generation big data architecture concept that provides the fastest, most efficient, one-stop access and query for your data warehouse, data lake or database clusters. With this new technology for the world of big data, your data storage environments no matter how many different types or how many data sources you have, no matter how many different types or how many data sources you have, enabling you to query and manage billions of rows or hundreds of Terabytes of data in the most efficient way. The technology behind this definition, which we call data network; with the proliferation of data warehouse and data lake environments, the growing importance of IoT data and projects such as digital twins, becoming a society that produces more data every day, and the value of this data is increasing day by day, the concept has expanded to encompass a very wide framework and has become uncontrollable It promises to provide a flexible solution, even if our big data reaches huge volumes by bringing an overhead perspective to our data environments.
In the same way as the transition of software engineering teams from monolithic applications to microservice architectures with another approach, the data network is in many ways the data platform version of microservices!!! Unlike traditional monolithic data infrastructures that handle the consumption, storage, transformation, and output of data in a single central data lake, a data network supports distributed, domain-specific data consumption and with it positions a “data-as-a-product” understanding.
Another graph above shows the areas in which startups plan to invest under the heading of technology for 2022. Under this heading, the Data Analytics title, which includes the 'Data Network' technology, stands out as the second most invested area after cybersecurity with a 51% increase. To address the main reasons for this; the ratio of unique data to duplicated or copied data is expected to gradually increase from 1:9 to 1:10 from 2020 to 2024. Meanwhile, the amount of data created, captured, copied and consumed worldwide is expected to rise from about 59 Zettabytes in 2020 to 149 Zettabytes by 2024. This means that in 2020, 5.9 Zettabytes of singular and unique data will form the data world, while by 2024, this unique data will reach the level of 13.41 Zettabytes with an increase of 7.51 Zettabytes. On the other hand, the data that is replicated or copied will reach from 53.1 zettabytes to 135.59 zettabytes. That is, we will copy our existing individual data 2.5 times more in 4 years for use in our analytical applications or move it to other media for use.
So why would we need such massive ETL processes, hardware investments, and more effort in our infrastructure units to use this data? What is the point of having to go on so much adventure when we are saving money from where it is located instead of moving or copying our data?
I don't need it at all, let's see what we can do with the Data network and what benefits it can bring us;
Benefits of Using Data Mesh
- All structured data of a company is secured in one place without data movement and can be queried to scale and forever with a single SQL.
- The data network is not a data lake, on the contrary, it takes advantage of the technology of Data Lakes such as S3 or HDFS and allows you to combine different data lake environments in a single point.
- It is scalable and secure. Scalable security is provided by Authority-based access control (RBAC) and Action-based access control (ABAC). Users are not given SELECT to a single table, but are assigned to the attribute, and the attribute is associated with columns. Column origin allows these attributes to propagate automatically.
- Usually, metadata is lost during data movement. However, data network technology takes advantage of Metastore technologies such as Apache Hive to prevent the loss of your metadata.
- Questionable on the scale. Provides highest performance querying without creating copies in the asset catalog thanks to table redirects and local caching.
- SQL query languages with different standards of different databases create problems. Many data cleanings are performed for SQL databases that are not portable. A Data Network platform has only one common SQL language in ANSI-SQL standards, and all queries are defined in a single language, eliminating inconsistencies caused by data type and query form between different databases.
- SQL is almost 50 years old, but every company has old databases. Using a Query Fabric on top of physical databases will make it much easier to gradually decommission old databases since it does not require the migration of SQL code written by users.
- Provides simultaneous user advantage. It offers us the functionality of asynchronous and parallel data processing thanks to the distributed architecture that it has in a data network and the maximum number of concurrent users can use the system in the most efficient way.
How is Data Mesh Infrastructure and Functions?
Data network technology is technically based on the principle of multi-node and master node-worker nodes, as is the case with platforms such as Hadoop or Cloudera. After users receive data or query requests, the master nodes distribute the workloads to the Starburst layer to the data lake, relational database, NoSQL database, or message queue services where they need the data from the Starburst layer. Then the worker loads the data they will collect from different sources, providing it to the user through memory without the need for any transport or copying. At the request of the querying user in this process, it performs data encryption, data masking functions, and the user is allowed to access or view only the sources to which he is authorized in accordance with the powers he has, in the case of the line, milk or data source. When all this is done, log records of queries are kept specific to performance, resource consumption and accessed data and also kept in a database. The data network platform manages all of this end-to-end for users in a fully automated way, giving business intelligence, reporting, or other business unit teams an easy-to-use advantage.
If you are wondering how we as Komtaş position the 'Data Network' technology for your institution and what we bring you by doing this, you can get more ideas by watching the Demo demonstration below. In this demo application, it is integrated into the system as 2 Postgres datasets (sales and marketing), which are positioned as relational databases, and under the global heading, the S3 object storage layer is integrated into the system again. In addition, a cache cluster has been created on the data network so that this data can be cached, and data with low exchange frequency but high query frequency is again directed to this area. You can easily enrich your network by adding different types of data sources in cloud environments to this data network or large datasets such as Hadoop. In the sample application, the global sales data on S3 and the data in the relational databases are combined under a single query and queried in seconds, and in doing so, only the fields that are within the authority of the querying user are shown to the user. The most important point is that during this data virtualization process, no processes are applied on the data network that will bring more costs and burdens, such as copying data and moving data from one place to another.
As a result, teams that use large amounts of data resources and need to convert or use/access data quickly would be wise to consider leveraging a data network.
How to Calculate Your Organization's Data Mesh Score
I put together a simple calculation to determine if it makes sense for your organization to invest in a data network. Answer each question below with a number and add them all together to get your total data network score.
Amount of data sources: How many data sources does your company have?
Size of your data team: How many data analysts, data engineers, and product managers do you have in your data team?
Number of data domains: How many functional teams (marketing, sales, operations, etc.) rely on your data sources to drive decision-making, how many products does your company have, and how many data-driven features are being created? Add the total.
Data engineering bottlenecks: How often does the data engineering team experience a bottleneck in implementing new data products on a scale of 1 to 10, 1 “never” and 10 “always”?
Data management: How important is data governance for your organization?
What is a Data Mesh Score?
Here's how to break down your score:
1—15: Given the size and one-dimensionality of your data ecosystem, you may not need a data network.
15—30: Your organization is maturing rapidly and may even be at a crossroads in terms of actually being able to rely on data. It is highly recommended that you incorporate some data network applications and concepts so that a later transition can be easier.
30 or above: Your data organization is an innovation driver for your company, and a data network will support any ongoing or future initiatives to democratize data and provide self-service analytics across the organization. At this stage, a data network has begun to become a must for you.
You can fill out the form to share your score results with us and get more detailed information about our Data Network solutions. You can also learn about our Starburst technology, which is an expert in the field of Data Mesh.
İlginizi Çekebilecek Diğer İçeriklerimiz
"Amazon'un boyutu" ifadesini aradığınızı düşünün. Arama uygulaması şirketi mi yoksa nehri mi kastettiğinizi nasıl bilecek? Başka bir deyişle, yapay zeka belirli bir görevin bağlamını nasıl anlayabilir?
Veri analisti (Data Analyst), verileri toplayan, analiz eden ve bu verilerden anlamlı içgörüler çıkararak işletmelere stratejik kararlar almalarında yardımcı olan bir profesyoneldir.