Mito Distributed Bitmap Practice: Naix
This article is the 11th session of Meitu Internet Technology Salon.
Big data technologies and applications are now playing a huge role in various industries, and a variety of open source technologies have brought great convenience to big data practitioners. Bitmap, as a computing system arising from the demand for big data, has many advantages such as fast computation, high information density, and support for large amounts of data.
With huge amount of user data, Meitu has a lot of data computing tasks every day. Bitmap technology can significantly reduce the computational overhead and save the cost of data storage. Although many companies have tried Bitmap, there is not a relatively mature distributed Bitmap open source application so far, so Meitu has developed its own distributed Bitmap system for data computation tasks in various scenarios in Meitu.
/ Bitmap Introduction /
Bitmap, a technology that is widely referenced by various frameworks, is actually quite simple in principle.
Bit is a bit, and a Bitmap identifies the value of an element by its bit (0 and 1 states are supported), in short, a Bitmap itself is a bit array.
As a simple example, Assuming that there are 10 individual user(ID separately 1~10), one day 1、3、5、7、8、9 Login system, How to simply represent the login status of a user? as if Figure 1, Just find the bit corresponding to the user, place side by side 1 can then (do sth)。
Figure 1
More often, if you need to see if a user is logged into the system that day, you only need to see if the value corresponding to that user ID bit is 1. Also, by counting the number of 1's in the Bitmap, you can tell the total number of users logged into the system. Bitmap already supports operations (such as AND, OR, ANDNOT, etc.) that can make calculations such as dimensional crossover easier.
Two important features of Bitmap
high performance
Bitmap's computational performance in its main battlefield is quite impressive. In Mito, the early statistics were mainly based on Hive. A simple retention calculation (i.e. counting the number of new users who are still active on the next day) was performed on the data of an app of the Meitu system, which took about 139 seconds by Hive (using left out join), while Bitmap (intersection calculation) took only 226 ms, and Hive took about 617 times longer than Bitmap. As shown in Figure 2, where Hive is based on a 4-node Hadoop cluster, while Bitmap uses only a single node single process.
Figure 2
Small storage space
Since Bitmap identifies the state by bit, the data is highly compressed and therefore takes up very little storage space. Assuming there are 1 billion active device IDs (numeric type), it would take about 3.72G to store them using a regular Int array, while a Bitmap would only take about 110M. Of course, if operations such as de-duplication and sorting are to be performed, the performance dividends (e.g., memory consumption, etc.) from the savings in storage space can be significant.
Beauty Bitmap Application
Meitu has many APPs, such as Meitu Show, Beauty Camera, Meipei, Beauty Camera, Chao Selfie, etc. The well-known Meitu Xiuxiu and Beauty Camera both have a daily activity of 10 million, with a history of billions of accumulated users.
Most of the major daily statistics functions are based on Bitmap, such as new, active, retention, upgrades, return visits, etc. We also support time granularity (e.g. days, weeks, months and even years) and multiple dimensions such as APP, channel, version, region, etc., as well as cross-calculation of each dimension, etc.
Figure 3
The Bitmap principle is simple, but supporting massive amounts of data and demand through Bitmap services alone is more complex than one might think. From 2015 to the present, from standalone to distributed, from single APP to various APP accesses, from a "small amount" of data with a small number of metrics to the current massive data and demand, we have encountered many challenges in the Bitmap practice, among which the more typical ones are.
/ Mito Distributed Bitmap-Naix /
Naix, the final form of Meitu Bitmap Service, is a general distributed Bitmap service developed by Meitu itself. In order to make Naix suitable for various scenarios, we have designed it to be as generic as possible in terms of components and structures.
The name Naix comes from Dota, and there are various projects in the 'Dota series' in the Meitu Data Technology team, such as Kunkka, Puck, Arachnia, etc. The reason for calling the distributed Bitmap Naix is simple: its resonant Next means Next Generation Bitmap.
Naix system design
total Naix systems such as Figure 4 Shown in three main layers: external call layer、 system core node layer、 Dependent external storage layer。
Figure 4
external call layer
external call layer subdivide generator harmony tcp client。generator is responsible for generating Bitmap tools, Raw data、 Regular data is usually stored in HDFS or in other storage media, Need to pass generator The node converts the corresponding text data or other data into Bitmap Relevant data, And then sync it to the system。tcp client Primarily responsible for the interaction of client applications with distributed systems。
core node layer
The core node layer contains three main types.
Dependent external storage layer
Naix has lightweight, dependencies on external storage, where mysql is primarily used for managing metadata and maintaining scheduling intermediate state, data storage, etc., and redis is used more as a cache during computation.
Naix data structures
index group
Figure 5
As shown in Figure 5, the index group is the most basic data structure of Naix, similar to a DataBase in a regular database, and is mainly used to isolate various different operations. For example, in Meitu's business, Meitu Xiuxiu and Meipai are two different index groups. Each index group can have multiple i ndexes, indexes are similar to tables in a regular database, e.g. new and active Bitmaps belong to two different indexes.
index
Figure 6
In each index, there is a solidified time attribute. Since Bitmap data may relate to different time periods, the data from different time periods are put into the same index by formatting the time. The index in the corresponding time period involves several dimensions, such as version, channel, region, etc. Each dimension involves a different dimension value (e.g. v1.0.0, v1.2.0, etc.), and the Bitmap file we refer to is for the specific dimension value.
Data information dictionary management
Bitmaps used to identify the state of a user or element usually refer to the ID, but this is often not the case in real business applications. If you need to count imei, idfa, you need to convert the device identifier to ID through the data dictionary mapping and then generate the Bitmap and complete the related statistics. Also, to facilitate the maintenance and use of the data, we have made dictionary mapping management for dimensions and dimension values.
Naix genertor
For Bitmap raw data usually refers to similar to Mysql record data, HDFS text files, etc., and the role of Naix generator is to transform the raw data into Bitmap related data and synchronize it to the Naix system. generator supports Bitmap generation for various scenarios in the form of plug-ins, and then business parties develop their own business logic based on the plug-ins.
simple plugin is the easiest way and the first plugin we used. In Mito, most of the data is raw HDFS data, filtered by the Hive Client to the processing server with relevant data, and then converted to Bitmap data by the plugin.
Due to the large volume of data and the complexity of the business, at some point in the past, data generation at Meitu consumed nearly 3 hours per day. If a problem occurs in the middle and then re-runs, it will inevitably affect other statistical operations with serious consequences. That's why we developed mapreduce plugin , expects to speed up data generation by distributing its own strengths.
It has been shown that using the mapreduce plugin can ultimately compress a near 3 hour generate process to about 8 minutes (based on a 4 node test cluster). Based on the characteristics of mapreduce, we can also easily maintain a consistently fast generate speed through node scaling or map and reduce number adjustment in the case of continuous increase in business and data volume.
The third plugin is bitmap to bitmap plugin The Bitmap data for various time periods can be configured with the plugin we provide to generate bitmaps from the bitmap periodically in the system. Similar to Bitmaps like week, month, year, the plugin can generate periodic Bitmaps (e.g. week by day, month by week, etc.) from native Bitmaps and the user simply submits the generation schedule and eventually the Bitmap data results are automatically generated in the system at regular intervals.
Naix Storage
segmentation
How to store massive amounts of data into a distributed system? Usually, Conventional Distributed Bitmap are dependent on something like hbase or distributed storage by business cut, Finding data and copying data during computation is a huge bottleneck。 After various attempts, We ended up taking segmentation approximation, That is, by fixing the width of all Bitmap act as segmentation; same segmentation、 Data with the same replica serial number is stored to the same node, different segmentation The data may be stored in the same or different nodes。
Figure 7
The split design offers a number of benefits.
replication
replication is an extremely important feature of regular distributed systems, preventing data loss due to machine downtime, disk corruption, etc. In Naix, replication supports index group level.
Figure 8
as if Figure 8 as shown, Mark the main in dark blue segmentation, Light blue and blue-green mark the remaining copies segmentation。 Through two different replication Quantity set by index group, and two index The internal counterpart of the two index, In the figure we can see that the corresponding same replication The same subscript segmentation, All will be stored in the same data node。 And for the same segmentation are necessarily stored in different nodes。
Space and file fragmentation related optimizations
Optimization of space and file fragmentation is one of the most attempted parts of Bitmap practice. Bitmap implementations based on Long arrays or other numeric arrays tend to be too dense or sparse in their data, leaving a lot of room for optimization. Most Bitmap compression algorithms are similar to alignment compression, which saves space and reduces computation through compression. In the Mito Bitmap, ewah was used early on (in a similar way to RLE), and subsequently switched to RoaringBitmap, which is currently the common Bitmap compression method used in Spark, Hive, Kylin, Druid and other frameworks. A performance comparison of ewah and RoaringBitmap, tested in our real business scenario, shows a 67.3% space savings and a 58% data time savings. In terms of querying, the real-world scenarios are slightly improved, but not as much as the space and generate time-consuming performance improvements.
Figure 9
Earlier Bitmaps were stored in files, and we performed mmap-like optimizations when reading them. However, there are many Bitmaps with fine-grained dimensional splits in daily business (e.g., fewer new users in a channel), and processing the split operation will split these small Bitmaps even smaller. Small files have very low block and inode utilization for the operating system itself, try to solve the small file fragmentation problem by optimizing the storage scheme. The following programmes were mainly investigated.
Redis Redis itself supports bitset operations, but its implementation falls short of expectations. Assuming a simple Bitmap data kv storage with 200T of data capacity, 256G per machine and keeping a single copy backup, roughly 160 servers are required, with incremental costs as the business grows. HBase In Mito Big Data, HBase is used in a lot of scenarios. If HBase is used to store Bitmap data, there is little room for optimization in read and write performance, and the dependence on HBase is too heavy to achieve the desired results. RocksDB RocksDB is currently in more common use in the industry. In the scenario where compression is turned on, RocksDB's CPU and disk usage is unstable due to compression, while in the scenario where compression is not turned on, RocksDB's performance degrades severely when the data volume rises, while there is no performance improvement on the use of multiple DBs. PalDB PalDB is a read-only kv store developed by linkedin, which in official tests performs about 8 times better than RocksDB and LevelDB, when the data volume reaches a certain level. PalDB's performance is even better than java's built-in HashSet and HashMap. The design of PalDB itself has both advantages and disadvantages, and its design leads to limited usage scenarios, but it has largely satisfied the usage scenarios of Naix. The PalDB source code is small and we have made simple adjustments based on specific usage scenarios. Tested, the final storage space saving is 13% and query time consumption is improved by more than 60% in real concurrent scenarios with PalDB. generator takes a little longer, but the effect is negligible due to the addition of the mr generator method.
Naix query
In the Naix system, the overall interaction is implemented on top of our self-developed rpc service (the rpc service is based on Netty and uses the TCP protocol). ProtoBuf is used in both the rpc underlay and the Naix server-side protocol.
Figure 10
The process of distributed computing includes nodes, sharding, data storage, etc., and for query scenarios, how do we find relevant data? How are the calculations performed?
Figure 11
In Naix, when a client initiates a query request to a transport node, the transport node selects the best node to distribute the request by querying the criteria. The selection of the slice is determined in the corresponding node according to the request conditions, and each node finds the corresponding slice and computes it, and the resultant nodes are aggregated and returned to the client, similar to the computation process of fork-join stacked with fork-join.
The Naix system supports queries in a generic way: it supports the operator ∩ ∪ - ( ) combination expression; users can assemble query expressions by selecting the desired query Tuple and operator according to the usage scenario, and of course, we also encapsulate the query expression assembly transformation tool. Since the queries supported by Naix are inherently business-independent, users can do various query operations by assembling expressions.
Figure 12
To give a few simple examples.
/ Future outlook /
In order to extend the Naix system to more company operations and even external scenarios, we are also still refining and optimizing it, and are currently working on the following attempts.