Mito Distributed Bitmap Practice: Naix

This article is the 11th session of Meitu Internet Technology Salon.

Big data technologies and applications are now playing a huge role in various industries, and a variety of open source technologies have brought great convenience to big data practitioners. Bitmap, as a computing system arising from the demand for big data, has many advantages such as fast computation, high information density, and support for large amounts of data.

With huge amount of user data, Meitu has a lot of data computing tasks every day. Bitmap technology can significantly reduce the computational overhead and save the cost of data storage. Although many companies have tried Bitmap, there is not a relatively mature distributed Bitmap open source application so far, so Meitu has developed its own distributed Bitmap system for data computation tasks in various scenarios in Meitu.

/ Bitmap Introduction /

Bitmap, a technology that is widely referenced by various frameworks, is actually quite simple in principle.

Bit is a bit, and a Bitmap identifies the value of an element by its bit (0 and 1 states are supported), in short, a Bitmap itself is a bit array.

As a simple example， Assuming that there are 10 individual user（ID separately 1~10）， one day 1、3、5、7、8、9 Login system， How to simply represent the login status of a user？ as if Figure 1， Just find the bit corresponding to the user， place side by side 1 can then (do sth)。

Figure 1

More often, if you need to see if a user is logged into the system that day, you only need to see if the value corresponding to that user ID bit is 1. Also, by counting the number of 1's in the Bitmap, you can tell the total number of users logged into the system. Bitmap already supports operations (such as AND, OR, ANDNOT, etc.) that can make calculations such as dimensional crossover easier.

Two important features of Bitmap

high performance

Bitmap's computational performance in its main battlefield is quite impressive. In Mito, the early statistics were mainly based on Hive. A simple retention calculation (i.e. counting the number of new users who are still active on the next day) was performed on the data of an app of the Meitu system, which took about 139 seconds by Hive (using left out join), while Bitmap (intersection calculation) took only 226 ms, and Hive took about 617 times longer than Bitmap. As shown in Figure 2, where Hive is based on a 4-node Hadoop cluster, while Bitmap uses only a single node single process.

Figure 2

Small storage space

Since Bitmap identifies the state by bit, the data is highly compressed and therefore takes up very little storage space. Assuming there are 1 billion active device IDs (numeric type), it would take about 3.72G to store them using a regular Int array, while a Bitmap would only take about 110M. Of course, if operations such as de-duplication and sorting are to be performed, the performance dividends (e.g., memory consumption, etc.) from the savings in storage space can be significant.

Beauty Bitmap Application

Meitu has many APPs, such as Meitu Show, Beauty Camera, Meipei, Beauty Camera, Chao Selfie, etc. The well-known Meitu Xiuxiu and Beauty Camera both have a daily activity of 10 million, with a history of billions of accumulated users.

Most of the major daily statistics functions are based on Bitmap, such as new, active, retention, upgrades, return visits, etc. We also support time granularity (e.g. days, weeks, months and even years) and multiple dimensions such as APP, channel, version, region, etc., as well as cross-calculation of each dimension, etc.

Figure 3

The Bitmap principle is simple, but supporting massive amounts of data and demand through Bitmap services alone is more complex than one might think. From 2015 to the present, from standalone to distributed, from single APP to various APP accesses, from a "small amount" of data with a small number of metrics to the current massive data and demand, we have encountered many challenges in the Bitmap practice, among which the more typical ones are.

• Hundred T-level Bitmap Index . This is a volume that is difficult to maintain for a single node and usually requires the help of external storage or a self-developed set of distributed data stores to address.
• Serialization and deserialization issues . Although Bitmap storage takes up less space and is faster to compute, there is a lot of room for optimization when using external storage for large Bitmap files that can still be several hundred megabytes or more per file after compression. (b) Also, storing and querying deserialized data is very time-consuming.
• How to go about doing multi-dimensional cross-counting on distributed Bitmap storage relatively well , and how to do it in highly concurrent query scenarios Fast response

/ Mito Distributed Bitmap-Naix /

Naix, the final form of Meitu Bitmap Service, is a general distributed Bitmap service developed by Meitu itself. In order to make Naix suitable for various scenarios, we have designed it to be as generic as possible in terms of components and structures.

The name Naix comes from Dota, and there are various projects in the 'Dota series' in the Meitu Data Technology team, such as Kunkka, Puck, Arachnia, etc. The reason for calling the distributed Bitmap Naix is simple: its resonant Next means Next Generation Bitmap.

Naix system design

total Naix systems such as Figure 4 Shown in three main layers： external call layer、 system core node layer、 Dependent external storage layer。

Figure 4

external call layer

external call layer subdivide generator harmony tcp client。generator is responsible for generating Bitmap tools， Raw data、 Regular data is usually stored in HDFS or in other storage media， Need to pass generator The node converts the corresponding text data or other data into Bitmap Relevant data， And then sync it to the system。tcp client Primarily responsible for the interaction of client applications with distributed systems。

core node layer

The core node layer contains three main types.

• The Master node, the core of Naix, is primarily responsible for cluster-related management and maintenance, such as adding Bitmaps, node management, and other operations.
• Transport nodes are intermediate nodes for query operations, which are distributed by Transport upon receipt of query-related requests.
• Data Nodes (the core data storage nodes in Naix), we use Paldb as the base data storage for Bitmap.

Dependent external storage layer

Naix has lightweight, dependencies on external storage, where mysql is primarily used for managing metadata and maintaining scheduling intermediate state, data storage, etc., and redis is used more as a cache during computation.

Naix data structures

index group

Figure 5

As shown in Figure 5, the index group is the most basic data structure of Naix, similar to a DataBase in a regular database, and is mainly used to isolate various different operations. For example, in Meitu's business, Meitu Xiuxiu and Meipai are two different index groups. Each index group can have multiple i ndexes, indexes are similar to tables in a regular database, e.g. new and active Bitmaps belong to two different indexes.

index

Figure 6

In each index, there is a solidified time attribute. Since Bitmap data may relate to different time periods, the data from different time periods are put into the same index by formatting the time. The index in the corresponding time period involves several dimensions, such as version, channel, region, etc. Each dimension involves a different dimension value (e.g. v1.0.0, v1.2.0, etc.), and the Bitmap file we refer to is for the specific dimension value.

Data information dictionary management

Bitmaps used to identify the state of a user or element usually refer to the ID, but this is often not the case in real business applications. If you need to count imei, idfa, you need to convert the device identifier to ID through the data dictionary mapping and then generate the Bitmap and complete the related statistics. Also, to facilitate the maintenance and use of the data, we have made dictionary mapping management for dimensions and dimension values.

Naix genertor

For Bitmap raw data usually refers to similar to Mysql record data, HDFS text files, etc., and the role of Naix generator is to transform the raw data into Bitmap related data and synchronize it to the Naix system. generator supports Bitmap generation for various scenarios in the form of plug-ins, and then business parties develop their own business logic based on the plug-ins.

simple plugin is the easiest way and the first plugin we used. In Mito, most of the data is raw HDFS data, filtered by the Hive Client to the processing server with relevant data, and then converted to Bitmap data by the plugin.

Due to the large volume of data and the complexity of the business, at some point in the past, data generation at Meitu consumed nearly 3 hours per day. If a problem occurs in the middle and then re-runs, it will inevitably affect other statistical operations with serious consequences. That's why we developed mapreduce plugin , expects to speed up data generation by distributing its own strengths.

It has been shown that using the mapreduce plugin can ultimately compress a near 3 hour generate process to about 8 minutes (based on a 4 node test cluster). Based on the characteristics of mapreduce, we can also easily maintain a consistently fast generate speed through node scaling or map and reduce number adjustment in the case of continuous increase in business and data volume.

The third plugin is bitmap to bitmap plugin The Bitmap data for various time periods can be configured with the plugin we provide to generate bitmaps from the bitmap periodically in the system. Similar to Bitmaps like week, month, year, the plugin can generate periodic Bitmaps (e.g. week by day, month by week, etc.) from native Bitmaps and the user simply submits the generation schedule and eventually the Bitmap data results are automatically generated in the system at regular intervals.

Naix Storage

segmentation

How to store massive amounts of data into a distributed system？ Usually， Conventional Distributed Bitmap are dependent on something like hbase or distributed storage by business cut， Finding data and copying data during computation is a huge bottleneck。 After various attempts， We ended up taking segmentation approximation， That is, by fixing the width of all Bitmap act as segmentation； same segmentation、 Data with the same replica serial number is stored to the same node， different segmentation The data may be stored in the same or different nodes。

Figure 7

The split design offers a number of benefits.

• The problem of distributed storage of hundreds of T of data is solved.
• parallel computing：Bitmap Very special structure， basic Bitmap The operations can all be performed by pressing segmentation parallel computing， Re-aggregation and integration。 For the huge bitmap data， Speed can also be increased in this way；
• Data copy problem: Typically, most Bitmap practices separate data by business before sharding, but when the data volume is large, the data for a single business cannot be stored in a single node. When it comes to cross-service computing, data copying is inevitably required. But sharding naturally distributes these computations to different nodes alone according to the sharding, avoiding data copy.
• Serialization and deserialization problems: usually occur in large Bitmaps, but after sharding all Bitmaps are basically of manageable size, there are no more serialization and deserialization problems.
• straddle int barriers： usually Bitmap Implemented to support only int coverage， And with the growth of the Meitu business， Its user growth will soon exceed that of int coverage。 Adopting data segmentation approximation， pass (a bill or inspection) segmentation inner id displacement (vector)， It is possible to easily incorporate segmentation Horizontal overlay to support up to long length of the。

replication

replication is an extremely important feature of regular distributed systems, preventing data loss due to machine downtime, disk corruption, etc. In Naix, replication supports index group level.

Figure 8

as if Figure 8 as shown， Mark the main in dark blue segmentation， Light blue and blue-green mark the remaining copies segmentation。 Through two different replication Quantity set by index group， and two index The internal counterpart of the two index， In the figure we can see that the corresponding same replication The same subscript segmentation， All will be stored in the same data node。 And for the same segmentation are necessarily stored in different nodes。

Space and file fragmentation related optimizations

Optimization of space and file fragmentation is one of the most attempted parts of Bitmap practice. Bitmap implementations based on Long arrays or other numeric arrays tend to be too dense or sparse in their data, leaving a lot of room for optimization. Most Bitmap compression algorithms are similar to alignment compression, which saves space and reduces computation through compression. In the Mito Bitmap, ewah was used early on (in a similar way to RLE), and subsequently switched to RoaringBitmap, which is currently the common Bitmap compression method used in Spark, Hive, Kylin, Druid and other frameworks. A performance comparison of ewah and RoaringBitmap, tested in our real business scenario, shows a 67.3% space savings and a 58% data time savings. In terms of querying, the real-world scenarios are slightly improved, but not as much as the space and generate time-consuming performance improvements.

Figure 9

Earlier Bitmaps were stored in files, and we performed mmap-like optimizations when reading them. However, there are many Bitmaps with fine-grained dimensional splits in daily business (e.g., fewer new users in a channel), and processing the split operation will split these small Bitmaps even smaller. Small files have very low block and inode utilization for the operating system itself, try to solve the small file fragmentation problem by optimizing the storage scheme. The following programmes were mainly investigated.

Redis Redis itself supports bitset operations, but its implementation falls short of expectations. Assuming a simple Bitmap data kv storage with 200T of data capacity, 256G per machine and keeping a single copy backup, roughly 160 servers are required, with incremental costs as the business grows. HBase In Mito Big Data, HBase is used in a lot of scenarios. If HBase is used to store Bitmap data, there is little room for optimization in read and write performance, and the dependence on HBase is too heavy to achieve the desired results. RocksDB RocksDB is currently in more common use in the industry. In the scenario where compression is turned on, RocksDB's CPU and disk usage is unstable due to compression, while in the scenario where compression is not turned on, RocksDB's performance degrades severely when the data volume rises, while there is no performance improvement on the use of multiple DBs. PalDB PalDB is a read-only kv store developed by linkedin, which in official tests performs about 8 times better than RocksDB and LevelDB, when the data volume reaches a certain level. PalDB's performance is even better than java's built-in HashSet and HashMap. The design of PalDB itself has both advantages and disadvantages, and its design leads to limited usage scenarios, but it has largely satisfied the usage scenarios of Naix. The PalDB source code is small and we have made simple adjustments based on specific usage scenarios. Tested, the final storage space saving is 13% and query time consumption is improved by more than 60% in real concurrent scenarios with PalDB. generator takes a little longer, but the effect is negligible due to the addition of the mr generator method.

Naix query

In the Naix system, the overall interaction is implemented on top of our self-developed rpc service (the rpc service is based on Netty and uses the TCP protocol). ProtoBuf is used in both the rpc underlay and the Naix server-side protocol.

Figure 10

The process of distributed computing includes nodes, sharding, data storage, etc., and for query scenarios, how do we find relevant data? How are the calculations performed?

Figure 11

In Naix, when a client initiates a query request to a transport node, the transport node selects the best node to distribute the request by querying the criteria. The selection of the slice is determined in the corresponding node according to the request conditions, and each node finds the corresponding slice and computes it, and the resultant nodes are aggregated and returned to the client, similar to the computation process of fork-join stacked with fork-join.

The Naix system supports queries in a generic way: it supports the operator ∩ ∪ - ( ) combination expression; users can assemble query expressions by selecting the desired query Tuple and operator according to the usage scenario, and of course, we also encapsulate the query expression assembly transformation tool. Since the queries supported by Naix are inherently business-independent, users can do various query operations by assembling expressions.

Figure 12

To give a few simple examples.

• The simplest device or user targeting, such as querying whether a particular user is a new or active user on a particular day.
• A combination of multi-dimensional analysis, such as looking at the retention of users added to the channel Meipai vivo on one day on the next.
• For example, if we count the number of active users of v6.0 and v8.0 versions of Meipai in Baidu and Vivo channels on a certain day, this involves a total of 4 combinations of queries from two channels and two versions of crossover. This operation is commonly used for data analysis. (a) Including the previous two, the average response time for these simple query operations is only a few milliseconds.
• The full cross-counting of multiple dimensions is similar to the need to know all the information of channels and versions in a given day's beauty shot to do the cross-counting and output such a large amount of level of data results. The performance of similar operations depends on the number of dimensions computed by the query and the amount of data involved, usually in the range of seconds to minutes of response.

/ Future outlook /

In order to extend the Naix system to more company operations and even external scenarios, we are also still refining and optimizing it, and are currently working on the following attempts.

• In the early days we concentrated more on system development to ensure that we could meet the demand. Ops tools are also being enriched to make it easier for Ops to maintain and manage Naix.
• Experiment with a variety of computational optimizations in an effort to bring query performance to another level.
• sql queries are also in our plans, as sql is a more widely accepted approach, and we hope to reduce the learning costs for a variety of different users using a generic query expression like this.

Recommended>>
1、Kodak announces it will release Kodak Coin
2、Technology sharing serialUGUIs handling of emoji emojiMemory fluctuation when loading resourcesAnimator sampling
3、2018 Baidu Keyword Ranking Optimization Practical Methods
4、Startup Skills Product Managers Should Have 3 Xiaomings Artificial Intelligence Startup Story
5、Can playing GTAOL unlock Drakengard Redemption 2 weapons Data mining makes big discoveries

已推荐到看一看 和朋友分享想法
最多200字，当前共 发送

已发送

确定
分享你的想法...
取消

确定
最多200字，当前共

发送中