The architecture evolution of daily active tens of millions, 1 billion APP big data statistical analysis platform
Meitu has a billion users, and tens of millions of users use each of Meitu's products every day, thus accumulating a large amount of user data.
With the continuous iteration of APP and the rapid expansion of users, products, operations, marketing, etc. rely more and more on data to optimize product features, track operational effects, analyze user behavior, etc., and with this comes more and more data statistics, analysis and other needs.
So how do you respond and meet the ever-expanding demand for data statistics and analytics? How does the constant evolution of the business in turn drive the transformation of the architectural implementation?
In this paper, we will introduce one of the products of the collision between big data business and technology: the architectural evolution of Meitu's big data statistical analysis platform, and hope that this sharing will bring you some thoughts on solving the data business and architecture.
If you have done any big data related development you should know that stats are a bit awkward, the first one it may not be a very technical thing and not very good for the growth of a technical person. Secondly it can be a more repetitive work thing that requires repetitive work to solve some simple needs.
In fact, there are many APPs in Meitu, and each APP basically has corresponding product operation, sales and data analysis students, who will raise all kinds of data statistical requirements, such as data reports or data analysis requirements. How are so many analytical or statistical needs solved at Mito?
Today I mainly want to present the solution in Mito, which is divided into three main sections.
Statistical operations and technology collide
This is basically my own personal experience of starting out alone in this part of the business and coming across some interesting points in perhaps three stages.
01
Initial project period
The characteristics of this stage are very obvious: Take Meipai as an example, the overall data volume at the beginning is small; the statistical requirements are relatively small, mainly some basic statistical indicators; the iteration of the product is very fast, and the statistical indicators of the data are required to quickly keep up with the iteration speed of the product.
Figure 1
The solution at this stage seems very rudimentary nowadays: one is that there may be multiple nodes on the service side of the business to ensure availability, and the service of each business node will hit the corresponding logs to local disk, and then will synchronize the logs to a data storage node uniformly by means of rsync.
Write some simple shell or PHP scripts on this node to implement the statistics logic, configure the appropriate crontab to trigger the statistics tasks at regular intervals, and finally store the data results to MySQL for the presentation layer to call to render reports.
02
Rapid Development Stage
After a sudden explosion in user volume, the amount of data will keep increasing, and the need for product operations and data analysis will grow.
The corresponding first-stage solution would be more problematic, with the following three main points.
Figure 2
So we've made some adjustments.
03
A programmer with a quest
When the demand keeps swelling, As a A programmer with a quest will take into account, The amount of duplicate code is very high, Even though we have a shelf layer Hive to write the appropriate code, Finally do a layer of data filtering or some aggregation。
In fact, for each statistical requirement we need to write a corresponding Java implementation, which is a very tedious and repetitive amount of work.
Figure 3
An aspiring programmer, if you will, may not be resigned to doing repetitive work every day. Because in the usual exposure to the business and implementation process, it is well understood that the flow of statistical business logic is basically the same.
So consider the abstraction of such a relatively common business process, the basic process from the data source Query out the data, and then do some business aspects of the aggregation or filtering, and finally the data stored to the DB.
That does a layer of abstraction at the code implementation level, abstracting a statistics component that contains Query, Aggregator, and DBStore, and then having implementations of a number of different Query and Store scenarios, respectively.
When a layer of such abstraction is made, the productivity is still improved more compared to the previous solution. At that time, I was able to do four or five statistical requirements a day by myself, while after the abstraction I was able to do about seven or eight statistical requirements a day from the beginning of understanding the requirements to the implementation, which was a good improvement in overall efficiency.
Architectural implementation of the Mito Statistics Platform
Having done the abstract above, there are still a number of pain points.
Figure 4
Based on the pain points above, let's go over how we solve these things. We're thinking about going to a platform where the business goes to use it on our platform and we provide the service just fine.
Figure 4 shows our general idea of doing platformization at that time, for example, this business party on the left has very many data requirements for reporting, and also may have data requirements for APP data scenarios, commercial advertising, etc.
We want to provide a platform where the business data demanders configure the data metrics they want, and the platform is responsible for computing, storing, and ultimately spitting out the appropriate data to the data applications.
Further, in making this platform, we may need to consider the following more important points.
Based on these points above, consider the need for a number of different modules to take care of the major functions mentioned above.
We have roughly three modules designed.
The next section details the approximate function points of these three modules and how they are implemented.
01
JobManager Module
This module focuses on abstracting statistical tasks and doing unified configuration management of the metadata of the tasks.
As in Figure 5, the main need is to provide a platform where the application side can go and configure the data they want, and the other point is that we need to have integrated data warehouses. The main reason for integrating the data warehouse is for the business side to be able to view the information in his corresponding business tables.
Figure 5
The right-hand side of this piece is mainly for the description of the metadata statistics task, which contains these big pieces, such as the source of the data, what the statistics operator is and the storage medium or special scenarios of data filters, dimensional aggregation and the description of the dependencies between tasks.
02
Task scheduling module Scheduler
The current implementation is relatively simple and is a single point of contact. There are currently several points of realization.
Figure 6
03
Task Execution Module JobExecutor
As in Figure 6, specific Query components are assembled from the plugin pool based on the source information of the task, and then the corresponding data is run to a specific Query layer (e.g., Hive), and the data that comes out does some filtering and aggregation in terms of dimensionality.
Figure 7
The components of the storage layer that eventually assembles the data based on the information from the task write the statistical results to the storage layer.
After talking about the three modules, let's review the infrastructure of this statistical platform. There is a JobManager on the left that manages the metadata, and the overall standard process of doing statistical tasks based on the metadata: query, filter, dimension aggregation, storage.
With this basic framework in place, Scenarios that can meet part of the basic statistics, But if it's a business scenario where more statistics are to be supported, Need to do more expansion of features( Figure 8)。
Figure 8
There are four broad directions of functional expansion here.
For temporary fetch scenarios
Not necessarily all businesses need to routinely run, there are very many temporary run scenarios, such as analysts need to temporarily look at the functional data of a corresponding APP or that operations need to look at the data indicators of a temporary organization activities, etc., usually encounter more temporary data collection corresponding needs.
For solving the requirement of temporary fetching, we have made two functions, one is to provide the function of filling in SQL directly to support the temporary fetching of data for users with SQL foundation.
This piece is to extend Hive's own integrated antlr to do HOL syntax parsing, parsing out, you need to check the legitimacy of the HOL, mainly to exclude some similar Insert, Delete operations, as well as limit the time range of the number of runs to avoid a long time to occupy the cluster computing resources.
Enriching data sources as much as possible
In general, you will increasingly encounter the need to import MySQL data from the business side to do simple statistics or Join calculations.
So this piece we have a plugin developed based on Sqoop to support importing business MySQL library tables to Hadoop.
The third is Bitmap, which is a system developed by Meitu, mainly to facilitate multi-dimensional de-weighting and the corresponding calculation of addition and retention, and its principle is mainly based on the operation between bits.
multi-storage
Most of the current data is stored in MongoDB, between traditional relational databases and NoSQL, which can mostly meet the business query scenarios and ensure distributed data storage.
The second is that there is a temporary larger data export situation where the business side needs to get a larger batch of data that they can import to HDFS and then the business user exports the data from HDFS for a different business application.
The third also supports some plain text, such as csv, etc. The fourth is MySQL, this currently has support for some split-table strategies for storage. The last piece is to enrich the statistical operators, currently there are some implementations of de-duplication, arrays, TopN and other statistical operators.
Data Visualization
As in Figure 9, because the storage layer is diverse, the original way is that our storage layer is exposed directly to the application's data backend, and they query the data parsed from our storage layer.
Figure 9
The first is that if the data table does not go transparently to this storage layer, the display layer development needs to learn Hbase, MySQL, Mongo, etc., which has a relatively large learning cost.
The second is the data security considerations, or the unified management of data storage, are relatively bad place, so the whole set of unified common API, there is a custom set of unified data protocols, to facilitate the display layer unified docking data to do display.
But then there will be some more questions that we need to go through to consider platform-based data security, as in Figure 10.
Figure 10
For example, usually, the backend of Meipai can only access the data related to Meipai, and does not allow the backend of Meipai to access the data of other APP commercials.
So we have to implement a unified authentication center CA, that is, the business side backend needs to go to the CA first to get the corresponding authorization token, and then to request the unified common API will bring acess token.
The generic API, as a generic service provider, will go to the CA to verify whether the Token is legitimate, and only if it is legitimate will it query the corresponding data in the storage layer and finally return it to the application.
The overall architecture of the Metropolis statistics platform
There is a unified task scheduler that dispatches all the stats tasks, and then a JobExecutor that is responsible for coming from the plugin pool to find the appropriate plugin, doing some queries, filtering, etc.
Figure 11
The final data is stored to the DB, there will be a unified API to encapsulate the storage layer, and then in some security considerations there is access to CA, and finally each data backend docking common API to do data presentation.
What is being done and what is planned for the next small phase
Figure 12
There are two pieces that are currently being worked on that are not yet officially live or accessible.
01
Distributed scheduling
The first piece is our own development of a set of Distributed scheduling, This set of dispatching is mainly biased towards a common set of dispatching platforms, Not only the task of scheduling statistics, All tasks for offline calculations and offline statistics can be scheduled later。
All the next statistical tasks are migrated to this generic Distributed scheduling Do unified task scheduling on the platform, Replaces a simple version of the current single point of dispatch center。 It will also go on to support resource segregation and resource scheduling。
02
Data Visualization
The second piece is data visualization. We just saw the front piece, the business side of all the data background needs to repeatedly access our unified common API over and over again, there is more duplication of work, and another pain point is that it really involves some relatively large APP.
For example, Meipai has a lot of statistical reports, and basically hundreds and thousands of such data in the background, so if a data demander wants to see their own data, it's very difficult for them to locate their own data metrics from hundreds and thousands of data metrics.
I don't need all business parties to access this common API, I can choose the data source I want or visualize my own reports in the same platform, and then present my own personalized data metrics, and I don't need to interface our API with the data backend of all applications.
Then do some graphic aspects of the presentation. So the data visualization piece is mainly to do statistics and provide customized, personalized data reports, somewhat similar to NetEase's bdp or Ali's dataV and other platforms.
The other two pieces are what we plan to do over the next period of time.
Author: Lu Rongbin Brief introduction: graduated from Xiamen University, joined Meitu in 2014, leading the design and development of Meitu's big data platform, responsible for Meitu's big data infrastructure, data service architecture and data statistical analysis, etc. He has experienced the building and architecture evolution of Meitu's big data platform from scratch, and has accumulated many years of experience in big data architecture and practice. Edited by Tao Jialong and Sun Shujuan Source: Reprinted with permission from Go China WeChat
Reprinted from 51CTO Tech Stack