In each index, there is a solidified time attribute. Since Bitmap data may relate to different time periods, the data from different time periods are put into the same index by formatting the time. The index in the corresponding time period involves several dimensions, such as version, channel, region, etc. Each dimension involves a different dimension value (e.g. v1.0.0, v1.2.0, etc.), and the Bitmap file we refer to is for the specific dimension value.
Data information dictionary management
Bitmaps used to identify the state of a user or element usually refer to the ID, but this is often not the case in real business applications. If you need to count imei, idfa, you need to convert the device identifier to ID through the data dictionary mapping and then generate the Bitmap and complete the related statistics. Also, to facilitate the maintenance and use of the data, we have made dictionary mapping management for dimensions and dimension values.
For Bitmap raw data usually refers to similar to Mysql record data, HDFS text files, etc., and the role of Naix generator is to transform the raw data into Bitmap related data and synchronize it to the Naix system. generator supports Bitmap generation for various scenarios in the form of plug-ins, and then business parties develop their own business logic based on the plug-ins.
simple plugin is the easiest way and the first plugin we used. In Mito, most of the data is raw HDFS data, filtered by the Hive Client to the processing server with relevant data, and then converted to Bitmap data by the plugin.
Due to the large volume of data and the complexity of the business, at some point in the past, data generation at Meitu consumed nearly 3 hours per day. If a problem occurs in the middle and then re-runs, it will inevitably affect other statistical operations with serious consequences. That's why we developed mapreduce plugin , expects to speed up data generation by distributing its own strengths.
It has been shown that using the mapreduce plugin can ultimately compress a near 3 hour generate process to about 8 minutes (based on a 4 node test cluster). Based on the characteristics of mapreduce, we can also easily maintain a consistently fast generate speed through node scaling or map and reduce number adjustment in the case of continuous increase in business and data volume.
The third plugin is bitmap to bitmap plugin The Bitmap data for various time periods can be configured with the plugin we provide to generate bitmaps from the bitmap periodically in the system. Similar to Bitmaps like week, month, year, the plugin can generate periodic Bitmaps (e.g. week by day, month by week, etc.) from native Bitmaps and the user simply submits the generation schedule and eventually the Bitmap data results are automatically generated in the system at regular intervals.
How to store massive amounts of data into a distributed system？ Usually， Conventional Distributed Bitmap are dependent on something like hbase or distributed storage by business cut， Finding data and copying data during computation is a huge bottleneck。 After various attempts， We ended up taking segmentation approximation， That is, by fixing the width of all Bitmap act as segmentation； same segmentation、 Data with the same replica serial number is stored to the same node， different segmentation The data may be stored in the same or different nodes。