Anatomy of a Big Data Platform for Data Analytics


Neither capturing data nor storing it is the ultimate goal of a big data platform. Without the data processing component, even precious gold mine data is nothing more than a pile of scrap metal. Data processing is the core path of the big data industry, and then add the last mile of data visualization, the whole chain is even completely gone.

Classification of data processing

As shown in the figure below, we can categorize data processing from three different perspectives: business, technical and programming models.

The classification of business perspectives is related to specific business scenarios, but ultimately governs the selection of technology, especially for data storage. For example, for full-text search in query retrieval, ElasticSearch would be the best choice, while for statistical analysis, because statistical analysis involves operations that may be for a column of data, such as summation operations for sales, is for the entire column of sales data, at this time, the choice of columnar storage structure may be more appropriate.

In the classification from a technical point of view, the SQL approach cannot be strictly classified into a separate category; it can actually be seen as a wrapper around the API, wrapping specific processing techniques through a DSL like SQL, thus reducing the migration costs of data processing scripts. After all, most enterprise internal data processing systems, prior to the Big Data era, mostly accessed stored data in SQL form. By and large, SQL is a wrapper for MapReduce, such as Hive, Impala, or Spark SQL.

Streaming stream processing receives a steady stream of data from upstream in real time, and then processes the data in the process in some small window of time. The upstream data consumed can be a stream of bytes passed over the network, a stream of data read from HDFS, or a stream of messages coming from a message queue. Usually, it corresponds to the real-time programming model in the programming model.

Machine learning and deep learning both fall under the category of deep analytics. With Google's AlphaGo and the open-sourcing of the TensorFlow framework, deep learning has turned into an explicit discipline. I don't know much about it, so I won't show my face here.

Machine learning is slightly different from common data analysis and usually requires multiple stages going through multiple iterations to get a satisfactory result. The following diagram shows the architecture of the deep analysis.

For the stored data, data samples need to be collected and feature extraction performed, then the sample data is trained and the data model is obtained. If the model is tested to meet the requirements, it can be applied to the data analysis scenario, otherwise the algorithm and model need to be adjusted for the next iteration.

The offline programming model is represented by Hadoop's MapReduce, the in-memory programming model is represented by Spark, and the real-time programming model mainly refers to stream processing, but of course, a Lambda architecture may be used to create a Serving Layer between the Batch Layer (i.e., the offline programming model) and the Speed Layer (the real-time programming model) to take advantage of free time and free resources, or to pre-compute (aggregate) the big data to be processed by the offline programming model while writing data, thus forming a fused view stored in a database (e.g., HBase) for fast query or computation.

Scenario-driven data processing

Different business scenarios (business scenarios may appear mixed) require different data processing techniques and thus a mixture of techniques (programming models) may be required under a big data system.

Scenario 1: Opinion analysis of a manufacturer

When we implemented opinion analysis for a vendor, the parts related to data processing included semantic analysis, full text search and statistical analysis according to the client's needs. The data crawled through the web crawler is written to Kafka, while the consumer side de-duplicates and de-noises the data through Spark Streaming, after which it is handed over to SAS's ECC server for semantic analysis of the text. The analyzed data is written to both HDFS (text in Parquet format) and ElasticSearch. At the same time, in order to avoid the error of the de-noising algorithm, some of the useful data is "killed", and a full amount of data is stored in MongoDB. As shown in the figure below.

Scenario 2: Airbnb's Big Data Platform

Airbnb's Big Data platform also offers a variety of processing options based on business scenarios, and the architecture of the entire platform is shown in the following diagram.

Panoramix (now renamed Caravel) provides data probing capabilities for Airbnb and visualization of results, and Airpal is a web-based query execution tool, both of which have an underlying data query execution against HDFS via Presto. The Spark cluster then provides Airbnb's engineers and data scientists with a platform for machine learning and stream processing.

The overall structure of the Big Data platform

At this point in the article, the entire series on big data platforms is almost over. Finally, I give an overall structure diagram combining the four components of data source, data collection, data storage and data processing as follows.

This diagram uses query retrieval scenario, OLAP scenario, statistical analysis scenario and deep analysis scenario as the core four scenarios and identifies different programming models with different colors. From left to right, four relatively complete phases of data source, data collection, data storage and data processing are experienced for the overall reference of a big data platform.


Recommended>>
1、Bitcoin We were in tears when we learned the truth
2、Top 5 features of vivo phones shared have you used any of these
3、Asus stunning new wireless products at CES 2018
4、Industrial IoT transition to intelligence Artificial intelligence accelerates into the market
5、Dont panic 360 found the EOS vulnerability will be fixed and the Air Force rolled

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号