Building a real-time data integration platform DataPipeline based on the KafkaConnect application

Introduction. With traditional ETL solutions overwhelming enterprises with the weight of data integration, a new real-time data integration platform built on Kafka connect is expected.

In the fourth event of Kafka Beijing Meetup on April 21, Chen Su, CTO of DataPipeline, shared how DataPipeline built a real-time data integration platform based on Kafka connect framework in application practice. The following is based on text compiled from live recordings for your reference.

What is data integration? The simplest application scenario is: a data source, a data destination, and the data destination can a data warehouse, and synchronizing data from a relational database to the data warehouse creates a data integration.

4 Challenges to Enterprise Data Integration

Let's look at a real-life example of data integration.

Company G is a typical customer of DataPipeline, with nearly a thousand data sources of types mainly including Oracle, SQL Server, MySQL, etc. Depending on the needs of the business and the existing infrastructure, each of these data sources needs to be synchronized to different destinations, and the types mainly include MySQL, HDFS, Kafka, etc. Based on the above background, the specific requirements of Company G are as follows.

1. Needed to support about 5TB of new data volume per day for synchronization, which will grow 5-10 times this year.

2. Some of these data sources require real-time synchronization, while others may accept timed synchronization.

3. Lack of strong O&M talent, limited business carrying pressure on existing data sources, very sensitive to pressure and demand for flow limiting.

4. The synchronization from these data sources to destinations is scripted by Kettle to achieve this, which is confusing to manage and requires centralized configuration and management of tasks through a management platform.

5. Both the upstream data source and the downstream data destination are unstable and can have various problems at any time, requiring a highly available platform to reduce the impact of data transmission interruptions.

6. When data synchronization tasks are randomly suspended/resumed, it is required that data integrity can be guaranteed.

7. The requirement is that data integrity can be guaranteed in the event of random failures and overloads at the data source and destination.

8. When the Data Source Schema changes, it is required that the destination policy can be configured flexibly based on business requirements.

The case of Company G is just one typical application scenario of current enterprise data integration needs. In fact, both Internet and traditional businesses face four challenges when it comes to data integration.

1. Heterogeneity of data sources: in traditional ETL solutions, synchronization from the data source to the destination is scripted, and heterogeneous data sources mean a lot of adaptation work for the enterprise.

2. Dynamic nature of data sources: During data integration, various changes often occur at the upstream data source end, and some data sources may have some structures deleted, which may affect the results of subsequent data analysis.

3. Scalability of tasks: when the data integration has only a few data sources, the problem of system stress is less prominent. When data integration is faced with hundreds or thousands of data sources, multitasking in parallel requires some scheduling of speed limits and buffering to allow read and write speeds to match each other.

4. Fault tolerance of the task: is it possible to achieve breakpoint retransmission when data goes wrong during transmission and no duplicate data is generated.

These are also the 4 most critical issues that DataPipeline will address for the enterprise data integration process.

Why choose Kafka connect as the underlying framework

Kafka connect is a tool for scalable, reliable streaming of data between Kafka and other systems, making it quicker and easier to move large collections of data into and out of Kafka's connectors. Kafka Connect provides a relatively mature and stable base framework for DataPipeline, and also provides a number of out-of-the-box tools that significantly reduce the investment in R&D and improve the quality of the application.

Below, we look at the specific benefits of Kafka Connect.

First, Kafka Connect provides a business abstraction centered on the data pipeline. There are two core concepts in Kafka Connect: Source and Sink. Source is responsible for importing data to Kafka and Sink is responsible for exporting data from Kafka, both of which are called Connectors. Source Connector, Sink Connector, for example, actually provides a high level of business abstraction for reading and writing data, which can simplify a lot of lifecycle management work.

definitely,Source Connector will de-initializeSource Task,Sink Connector will de-initializeSink Task。 These are standard packages。 For the data side, pass (a bill or inspection)Source & Sink Record Standardized abstraction of the structure of the data。 additionally, Enterprise customers doing data integration, Data requires a certain format in many application scenarios, So inKafka Connect use in sth.Schema Registry & Projector to address data format validation and compatibility issues。 When a change in the data source occurs, will generate a newSchema version, By different processing strategies usingProjector to accomplish compatibility with the data format。

Second, Kafka Connect has good scalability, & fault tolerance. These features are in the same vein as Kafka. In streaming and batch processing modes, it depends more on how the Source side goes about reading the data, and Kafka Connect naturally supports both streaming and batch transfer methods. Both single-node and cluster horizontal scaling features are directly supported by the Kafka Connect framework. As for task recovery and state maintenance, the write progress information of the destination task is automatically managed by the Kafka Connect framework, and the source task can put read progress information into Kafka as needed, saving a lot of effort to manage the progress of the task after restart.

For a generic application scenario like data integration in which everyone certainly doesn't want to reinvent the wheel. Currently, there are a total of 84 Connectors available directly under the Kafka Connect ecosystem, the vast majority of which are open source. Some of them are officially provided by Kafka, others are certified by Confluent, and some are provided by third parties. After appropriate tailoring to the requirements, each of these Connectors can be applied to its own system platform.

What core data integration challenges are addressed by DataPipeline

Based on the Kafka Connect framework, DataPipeline has done a lot of optimization and enhancement work that can well solve many of the core challenges facing enterprise data integration today.

1. Independence and global nature of the mandate.

From the very beginning of Kafka's design, it has complied with decoupling from source to destination. There can be many Consumers downstream, and the consumer side is hard to scale if it doesn't have this decoupling. When enterprises do data integration tasks, they need source-to-destination synergy, as they ultimately want to capture a controlled period of data synchronization from source to destination and be able to continuously maintain incremental synchronization. In this process, if the source and the destination are independent of each other, it will bring a problem that the speed of the source and the destination do not match, one is fast and the other is slow, resulting in a serious data pile-up. So, after a business user creates a data task, we want to control the buffering of the task to avoid data loss.

2. A task parallelization approach.

If an enterprise customer has 1,000 data tables that need to be created for a data integration task, consider the best way to slice and dice the task. One way to do this is to slice and dice the 1000 tables into tasks. In this case, it is difficult to balance the load of Source Task, Sink Task can consume multiple Topics and still have the problem of uneven load, and it is actually difficult to balance how many tables are loaded by each task. The Rebalance mechanism is triggered for each additional task. As you can imagine, each table initializes a source and destination task via Source Connector and Sink Connector, which would add significantly to the overhead of Rebalance.

3. Mapping of heterogeneous data.

When doing data integration for enterprise customers, there is a 50% chance that you will encounter some dirty work - mapping of heterogeneous data sources (Mapping). This mapping is not a serious matter for many Internet companies, because the database is designed to be more "elegant" (uniform) in the way fields are named, etc. But in traditional enterprises, many business systems are outsourced and there is some awareness that leads to less standardized and uniform database design. When using Kafka Connect for data integration, it is necessary to achieve as much accurate reduction of heterogeneous data as possible, especially in the financial industry where customers are more demanding. Also, when it does encounter a mismatch between data, a more reasonable mapping can be done between business data.

In addition, the Source Record on the source side contains the basic data type of each column (INT16, STRING, etc.) and optional meta information (e.g., "name"). When the destination side processes the Sink Record, it needs to decide the mapping relationship based on the basic data type as well as the meta information.

4. Schema change handling strategy.

When doing data integration for enterprises, you need to give corresponding processing strategies based on the changes in the Schema of the data source. Based on the Kafka Connect framework, we provide the following processing strategies.

(1) Backward Compatibility: All data can be accessed consistently using the latest Schema, e.g. deleting columns, adding columns with default values.

(2) Forward Compatibility: All data can be accessed consistently using the oldest Schema, e.g. remove columns with default values.

(3) Full Compatibility: All data can be accessed arbitrarily using the old and new Schema.

Kafka Connect recommends using Backward Compatibility, which is the default for Schema Registry. In addition, business users will also propose requirements such as deleting columns on the source side that need to be ignored on the destination side, adding columns with default values on the source side that need to be followed on the destination side, etc., all configured and implemented on a Task basis.

What enhancements have been made to DataPipeline based on Kafka Connect

While continuing to meet the data integration needs of today's enterprise customers, DataPipeline has made a number of very important enhancements based on the Kafka Connect framework.

1. System architecture level.

DataPipeline introduces the concept of DataPipeline Manager, which is mainly used to optimize the globalized lifecycle management of Source and Sink. Enables management of the destination and global lifecycle when a task has an exception. For example, handling the mismatch of read rates from the source to the destination and the cooperation of states such as paused.

To enhance the robustness of the system, we save the parameters of the Connector task in ZooKeeper to make it easy to read the configuration information after the task restarts.

The DataPipeline Connector reports statistics to the Dashboard via the JMX Client. In the Connector, some technical encapsulation is done to capture some general information, such as Connector history read information, and management-related information into the Dashboard and provide it to the customer.

2. Task parallel mode.

DataPipeline has made some enhancements in terms of task parallelism. We also encounter this problem when serving specific clients and need to synchronize dozens of tables. In the DataPipeline Connector, we allow a pool of threads that can be defined and maintained within each Task, by controlling the number of concurrent threads, and each Task allows to set row-level IO control. And for JDBC type Task, we additionally allow to configure the size of the connection pool to reduce the overhead of upstream and downstream resources.

3. Rules engine.

DataPipeline's basic niche when making applications based on Kafka Connect is data integration. The data integration process should not involve a lot of calculations on the data, but it is inevitable that some fields will be filtered, so in the product we are also considering how to provide a convergence.

While Kafka Connect provides a Transformation interface that can work with the Source Connector and Sink Connector to perform basic transformations on the data. However, this is based on the Connector as the base unit, which enterprise customers need to compile and deploy to all nodes of the cluster, and lacks good support for a visual dynamic compilation and debugging environment.

Based on this, the DataPipeline product provides two visual configuration environments: the Basic Code Engine and the Advanced Code Engine. The former provides functions including field filtering, field replacement and field ignoring, while the latter is based on Groovy and allows for more flexible processing of data and checking Schema consistency of the results. For advanced coding engines, DataPipeline also provides data sampling and dynamic debugging capabilities.

4. Error queue mechanism.

As we've seen in our work with enterprise clients, the data at the user's source is never "clean". Data that is not "clean" can come from several sources, such as when there are "dirty records" in a file-type data source, when the rules engine processes specific data with unanticipated exceptions, when some values cannot be written because of a Schema mismatch on the destination side, and various other reasons.

Faced with these situations, enterprise customers either stop the task or put the data on hold somewhere to be processed later. DataPipeline takes the second approach, specifying the policy for facing error queues through the error queue warning function in the product, supporting the setting and implementation of warning and interruption policies, etc. For example, the task will be suspended when the error queue reaches a certain percentage, such a setting can ensure that the task will not be interrupted by a small amount of abnormal data, and the abnormal data that has been completely recorded can be tracked, troubleshot and handled by the administrator very easily. Enterprise customers believe that this error queue visualization setup feature greatly improves administrator productivity compared to the previous ability to sift through logs to find abnormal data.

In doing data integration, there really shouldn't be too many transformations and calculations done on the raw data itself. After the traditional ETL scheme transforms the data heavily, it produces a more efficient output, but when the user's business needs change, a new data pipeline needs to be created for another transfer of the original data. This approach is not adapted to the current needs of big data analytics.

With this in mind, DataPipeline will advise clients to do a small amount of cleaning first to try to keep the data intact. This is not to say, however, that we don't value data quality. One of the important future work, DataPipeline will be based on Kafka Streaming will be used for data quality management, it is not responsible for the final output of the data, but from the business point of view to analyze whether the data in the exchange process has changed, through the sliding window to determine what exactly data problems, the condition is whether the number of records exceeded a certain percentage of the historical average, once this condition will further trigger an alarm and suspend the synchronization task.

To sum up, DataPipeline, after continuous efforts, has well solved the problems of enterprise data integration process that need to solve heterogeneity, dynamism, scalability and fault tolerance; built a mature enterprise data integration platform based on the good foundation support of Kafka Connect; optimized the challenges faced when applying Kafka Connect through secondary packaging and extensions based on Kafka Connect: including Schema mapping and evolution, task parallelization strategy and globalization management, etc. In the future, Datapipeline will further enhance data quality management based on streaming computing.

PS. Add DataPipeline Jun WeChat: datapipeline2018 to pull you into the technical discussion group.

1、The fourth generation of the new Audi A8L may go on sale in April 2018 for around 25 million
2、AI finds a second solar system with 8 planets
3、Are active brakes and autopilot really reliable
4、Confirmed Zuckerberg is actually a robot
5、Smart Contracts Walking the Blockchain

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送