The architecture evolution of daily active tens of millions, 1 billion APP big data statistical analysis platform

Meitu has a billion users, and tens of millions of users use each of Meitu's products every day, thus accumulating a large amount of user data.

With the continuous iteration of APP and the rapid expansion of users, products, operations, marketing, etc. rely more and more on data to optimize product features, track operational effects, analyze user behavior, etc., and with this comes more and more data statistics, analysis and other needs.

So how do you respond and meet the ever-expanding demand for data statistics and analytics? How does the constant evolution of the business in turn drive the transformation of the architectural implementation?

In this paper, we will introduce one of the products of the collision between big data business and technology: the architectural evolution of Meitu's big data statistical analysis platform, and hope that this sharing will bring you some thoughts on solving the data business and architecture.

If you have done any big data related development you should know that stats are a bit awkward, the first one it may not be a very technical thing and not very good for the growth of a technical person. Secondly it can be a more repetitive work thing that requires repetitive work to solve some simple needs.

In fact, there are many APPs in Meitu, and each APP basically has corresponding product operation, sales and data analysis students, who will raise all kinds of data statistical requirements, such as data reports or data analysis requirements. How are so many analytical or statistical needs solved at Mito?

Today I mainly want to present the solution in Mito, which is divided into three main sections.

  • Statistical operations and technology collide.
  • Implementation of the MitoStat platform architecture.
  • What is being done and some plans for the future.

Statistical operations and technology collide

This is basically my own personal experience of starting out alone in this part of the business and coming across some interesting points in perhaps three stages.

  • How do we go about responding to some of the initial needs of the product in the early stages of the project.
  • Once the user base exploded and the business data sources came up, how did we iterate.
  • How to get yourself out of some business and get some growth as a techie with a bit of a quest.


Initial project period

The characteristics of this stage are very obvious: Take Meipai as an example, the overall data volume at the beginning is small; the statistical requirements are relatively small, mainly some basic statistical indicators; the iteration of the product is very fast, and the statistical indicators of the data are required to quickly keep up with the iteration speed of the product.

Figure 1

The solution at this stage seems very rudimentary nowadays: one is that there may be multiple nodes on the service side of the business to ensure availability, and the service of each business node will hit the corresponding logs to local disk, and then will synchronize the logs to a data storage node uniformly by means of rsync.

Write some simple shell or PHP scripts on this node to implement the statistics logic, configure the appropriate crontab to trigger the statistics tasks at regular intervals, and finally store the data results to MySQL for the presentation layer to call to render reports.


Rapid Development Stage

After a sudden explosion in user volume, the amount of data will keep increasing, and the need for product operations and data analysis will grow.

The corresponding first-stage solution would be more problematic, with the following three main points.

  • The capacity of single point storage is very limited.
  • Computational bottlenecks quickly run into bottlenecks, and many times statistical reports are delayed in output the next day due to computational bottlenecks.
  • We use shell or PHP scripts to implement the statistical logic, the overall subsequent maintenance costs are relatively large, and it is not convenient to need to adjust a statistical logic or add some filter conditions, etc.

Figure 2

So we've made some adjustments.

  • Implemented a system for data collection, responsible for doing the data collection of server-side logs and landing the data to HDFS for final storage.
  • As mentioned earlier, there is a single point of storage and computation, so we build our own Hadoop cluster to solve the single point of storage and computation.
  • Based on Hive to solve the problem of writing too much code related to statistical logic.


A programmer with a quest

When the demand keeps swelling, As a A programmer with a quest will take into account, The amount of duplicate code is very high, Even though we have a shelf layer Hive to write the appropriate code, Finally do a layer of data filtering or some aggregation。

In fact, for each statistical requirement we need to write a corresponding Java implementation, which is a very tedious and repetitive amount of work.

Figure 3

An aspiring programmer, if you will, may not be resigned to doing repetitive work every day. Because in the usual exposure to the business and implementation process, it is well understood that the flow of statistical business logic is basically the same.

So consider the abstraction of such a relatively common business process, the basic process from the data source Query out the data, and then do some business aspects of the aggregation or filtering, and finally the data stored to the DB.

That does a layer of abstraction at the code implementation level, abstracting a statistics component that contains Query, Aggregator, and DBStore, and then having implementations of a number of different Query and Store scenarios, respectively.

When a layer of such abstraction is made, the productivity is still improved more compared to the previous solution. At that time, I was able to do four or five statistical requirements a day by myself, while after the abstraction I was able to do about seven or eight statistical requirements a day from the beginning of understanding the requirements to the implementation, which was a good improvement in overall efficiency.

Architectural implementation of the Mito Statistics Platform

Having done the abstract above, there are still a number of pain points.

  • The business dependency, meaning that the most time and cost we spend doing a statistical requirement is to understand the background of the data business side of the requirement, to understand what their product looks like or what activities their operations have done, and the cost of business communication background is very high.
  • Even if the abstraction is done there will still be some corresponding amount of repetitive code coding, there is also a need to do a statistics component to select the corresponding Query, the corresponding business logic processing and the DBstore for the storage layer.
  • O&M costs were high, when going live with a task required making a package to the line and changing some scripts like shells.
  • When it comes to personal growth, when one is doing something like this for a long time, there is a relatively large bottleneck to one's technical growth.

Figure 4

Based on the pain points above, let's go over how we solve these things. We're thinking about going to a platform where the business goes to use it on our platform and we provide the service just fine.

Figure 4 shows our general idea of doing platformization at that time, for example, this business party on the left has very many data requirements for reporting, and also may have data requirements for APP data scenarios, commercial advertising, etc.

We want to provide a platform where the business data demanders configure the data metrics they want, and the platform is responsible for computing, storing, and ultimately spitting out the appropriate data to the data applications.

Further, in making this platform, we may need to consider the following more important points.

  • We may need to have a clearer metadata description of statistical tasks that can describe what the calculations look like and what the arithmetic is for those statistical tasks.
  • Where the data source for this statistical task comes from and where the data needs to be stored is more appropriate for business queries.
  • A scheduling centre is needed to centrally schedule the execution of all statistical tasks.
  • To ensure the ultimate correct execution of the task.

Based on these points above, consider the need for a number of different modules to take care of the major functions mentioned above.

We have roughly three modules designed.

  • JobManager Module, Mainly to provide a platform, More user-friendly configuration for supply side, Ability to manage task metadata information and other data repositories、APP Management of information, etc.。
  • The Scheduler module, which is the scheduling center for tasks, is responsible for scheduling all statistical tasks.
  • JobExecutor The task execution module, responsible for tasks from querying and aggregation to the final result landing storage.

The next section details the approximate function points of these three modules and how they are implemented.


JobManager Module

This module focuses on abstracting statistical tasks and doing unified configuration management of the metadata of the tasks.

As in Figure 5, the main need is to provide a platform where the application side can go and configure the data they want, and the other point is that we need to have integrated data warehouses. The main reason for integrating the data warehouse is for the business side to be able to view the information in his corresponding business tables.

Figure 5

The right-hand side of this piece is mainly for the description of the metadata statistics task, which contains these big pieces, such as the source of the data, what the statistics operator is and the storage medium or special scenarios of data filters, dimensional aggregation and the description of the dependencies between tasks.


Task scheduling module Scheduler

The current implementation is relatively simple and is a single point of contact. There are currently several points of realization.

  • Ability to schedule tasks based on their priority.
  • Ability to schedule according to a task timing policy.
  • Being able to schedule a workflow is the scheduling of dependencies.

Figure 6


Task Execution Module JobExecutor

As in Figure 6, specific Query components are assembled from the plugin pool based on the source information of the task, and then the corresponding data is run to a specific Query layer (e.g., Hive), and the data that comes out does some filtering and aggregation in terms of dimensionality.

Figure 7

The components of the storage layer that eventually assembles the data based on the information from the task write the statistical results to the storage layer.

After talking about the three modules, let's review the infrastructure of this statistical platform. There is a JobManager on the left that manages the metadata, and the overall standard process of doing statistical tasks based on the metadata: query, filter, dimension aggregation, storage.

With this basic framework in place, Scenarios that can meet part of the basic statistics, But if it's a business scenario where more statistics are to be supported, Need to do more expansion of features( Figure 8)。

Figure 8

There are four broad directions of functional expansion here.

For temporary fetch scenarios

Not necessarily all businesses need to routinely run, there are very many temporary run scenarios, such as analysts need to temporarily look at the functional data of a corresponding APP or that operations need to look at the data indicators of a temporary organization activities, etc., usually encounter more temporary data collection corresponding needs.

For solving the requirement of temporary fetching, we have made two functions, one is to provide the function of filling in SQL directly to support the temporary fetching of data for users with SQL foundation.

This piece is to extend Hive's own integrated antlr to do HOL syntax parsing, parsing out, you need to check the legitimacy of the HOL, mainly to exclude some similar Insert, Delete operations, as well as limit the time range of the number of runs to avoid a long time to occupy the cluster computing resources.

Enriching data sources as much as possible

In general, you will increasingly encounter the need to import MySQL data from the business side to do simple statistics or Join calculations.

So this piece we have a plugin developed based on Sqoop to support importing business MySQL library tables to Hadoop.

The third is Bitmap, which is a system developed by Meitu, mainly to facilitate multi-dimensional de-weighting and the corresponding calculation of addition and retention, and its principle is mainly based on the operation between bits.


Most of the current data is stored in MongoDB, between traditional relational databases and NoSQL, which can mostly meet the business query scenarios and ensure distributed data storage.

The second is that there is a temporary larger data export situation where the business side needs to get a larger batch of data that they can import to HDFS and then the business user exports the data from HDFS for a different business application.

The third also supports some plain text, such as csv, etc. The fourth is MySQL, this currently has support for some split-table strategies for storage. The last piece is to enrich the statistical operators, currently there are some implementations of de-duplication, arrays, TopN and other statistical operators.

Data Visualization

As in Figure 9, because the storage layer is diverse, the original way is that our storage layer is exposed directly to the application's data backend, and they query the data parsed from our storage layer.

Figure 9

The first is that if the data table does not go transparently to this storage layer, the display layer development needs to learn Hbase, MySQL, Mongo, etc., which has a relatively large learning cost.

The second is the data security considerations, or the unified management of data storage, are relatively bad place, so the whole set of unified common API, there is a custom set of unified data protocols, to facilitate the display layer unified docking data to do display.

But then there will be some more questions that we need to go through to consider platform-based data security, as in Figure 10.

Figure 10

For example, usually, the backend of Meipai can only access the data related to Meipai, and does not allow the backend of Meipai to access the data of other APP commercials.

So we have to implement a unified authentication center CA, that is, the business side backend needs to go to the CA first to get the corresponding authorization token, and then to request the unified common API will bring acess token.

The generic API, as a generic service provider, will go to the CA to verify whether the Token is legitimate, and only if it is legitimate will it query the corresponding data in the storage layer and finally return it to the application.

The overall architecture of the Metropolis statistics platform

There is a unified task scheduler that dispatches all the stats tasks, and then a JobExecutor that is responsible for coming from the plugin pool to find the appropriate plugin, doing some queries, filtering, etc.

Figure 11

The final data is stored to the DB, there will be a unified API to encapsulate the storage layer, and then in some security considerations there is access to CA, and finally each data backend docking common API to do data presentation.

What is being done and what is planned for the next small phase

Figure 12

There are two pieces that are currently being worked on that are not yet officially live or accessible.


Distributed scheduling

The first piece is our own development of a set of Distributed scheduling, This set of dispatching is mainly biased towards a common set of dispatching platforms, Not only the task of scheduling statistics, All tasks for offline calculations and offline statistics can be scheduled later。

All the next statistical tasks are migrated to this generic Distributed scheduling Do unified task scheduling on the platform, Replaces a simple version of the current single point of dispatch center。 It will also go on to support resource segregation and resource scheduling。


Data Visualization

The second piece is data visualization. We just saw the front piece, the business side of all the data background needs to repeatedly access our unified common API over and over again, there is more duplication of work, and another pain point is that it really involves some relatively large APP.

For example, Meipai has a lot of statistical reports, and basically hundreds and thousands of such data in the background, so if a data demander wants to see their own data, it's very difficult for them to locate their own data metrics from hundreds and thousands of data metrics.

I don't need all business parties to access this common API, I can choose the data source I want or visualize my own reports in the same platform, and then present my own personalized data metrics, and I don't need to interface our API with the data backend of all applications.

Then do some graphic aspects of the presentation. So the data visualization piece is mainly to do statistics and provide customized, personalized data reports, somewhat similar to NetEase's bdp or Ali's dataV and other platforms.

The other two pieces are what we plan to do over the next period of time.

  • The first is that data analysts often complain that there can't be a faster way to query this data, so consider building an OLAP service, such as one based on kylin, etc.
  • The second is the real-time statistical aspect, which was just mentioned earlier is either a regular statistical requirement at regular intervals or an ad hoc statistical requirement without real-time. So this piece of consideration says that the back of this platform can also be docked to real-time statistical needs, and can be more quickly docked to meet the demand for real-time statistical scenarios.

Author: Lu Rongbin Brief introduction: graduated from Xiamen University, joined Meitu in 2014, leading the design and development of Meitu's big data platform, responsible for Meitu's big data infrastructure, data service architecture and data statistical analysis, etc. He has experienced the building and architecture evolution of Meitu's big data platform from scratch, and has accumulated many years of experience in big data architecture and practice. Edited by Tao Jialong and Sun Shujuan Source: Reprinted with permission from Go China WeChat

Reprinted from 51CTO Tech Stack

1、Weekly PodcastHow can a programmers job be quantified Use overtime hours to assess directly linked to pay
2、Learn to design IEC60730 compliant products
3、DeepSimpsons paradox How to prove opposite arguments with the same data
4、Data Mining CRM Application Based on Data Mining Technology
5、SpeedSense Chen Zhen Machine vision as the core so that lowcost costeffective robotics industry keywordsMagnesium guest please speak

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送