The classification of business perspectives is related to specific business scenarios, but ultimately governs the selection of technology, especially for data storage. For example, for full-text search in query retrieval, ElasticSearch would be the best choice, while for statistical analysis, because statistical analysis involves operations that may be for a column of data, such as summation operations for sales, is for the entire column of sales data, at this time, the choice of columnar storage structure may be more appropriate.
In the classification from a technical point of view, the SQL approach cannot be strictly classified into a separate category; it can actually be seen as a wrapper around the API, wrapping specific processing techniques through a DSL like SQL, thus reducing the migration costs of data processing scripts. After all, most enterprise internal data processing systems, prior to the Big Data era, mostly accessed stored data in SQL form. By and large, SQL is a wrapper for MapReduce, such as Hive, Impala, or Spark SQL.
Streaming stream processing receives a steady stream of data from upstream in real time, and then processes the data in the process in some small window of time. The upstream data consumed can be a stream of bytes passed over the network, a stream of data read from HDFS, or a stream of messages coming from a message queue. Usually, it corresponds to the real-time programming model in the programming model.
Machine learning and deep learning both fall under the category of deep analytics. With Google's AlphaGo and the open-sourcing of the TensorFlow framework, deep learning has turned into an explicit discipline. I don't know much about it, so I won't show my face here.
Machine learning is slightly different from common data analysis and usually requires multiple stages going through multiple iterations to get a satisfactory result. The following diagram shows the architecture of the deep analysis.
For the stored data, data samples need to be collected and feature extraction performed, then the sample data is trained and the data model is obtained. If the model is tested to meet the requirements, it can be applied to the data analysis scenario, otherwise the algorithm and model need to be adjusted for the next iteration.
The offline programming model is represented by Hadoop's MapReduce, the in-memory programming model is represented by Spark, and the real-time programming model mainly refers to stream processing, but of course, a Lambda architecture may be used to create a Serving Layer between the Batch Layer (i.e., the offline programming model) and the Speed Layer (the real-time programming model) to take advantage of free time and free resources, or to pre-compute (aggregate) the big data to be processed by the offline programming model while writing data, thus forming a fused view stored in a database (e.g., HBase) for fast query or computation.
Scenario-driven data processing
Different business scenarios (business scenarios may appear mixed) require different data processing techniques and thus a mixture of techniques (programming models) may be required under a big data system.
Scenario 1: Opinion analysis of a manufacturer
When we implemented opinion analysis for a vendor, the parts related to data processing included semantic analysis, full text search and statistical analysis according to the client's needs. The data crawled through the web crawler is written to Kafka, while the consumer side de-duplicates and de-noises the data through Spark Streaming, after which it is handed over to SAS's ECC server for semantic analysis of the text. The analyzed data is written to both HDFS (text in Parquet format) and ElasticSearch. At the same time, in order to avoid the error of the de-noising algorithm, some of the useful data is "killed", and a full amount of data is stored in MongoDB. As shown in the figure below.
Scenario 2: Airbnb's Big Data Platform
Airbnb's Big Data platform also offers a variety of processing options based on business scenarios, and the architecture of the entire platform is shown in the following diagram.
Panoramix (now renamed Caravel) provides data probing capabilities for Airbnb and visualization of results, and Airpal is a web-based query execution tool, both of which have an underlying data query execution against HDFS via Presto. The Spark cluster then provides Airbnb's engineers and data scientists with a platform for machine learning and stream processing.
The overall structure of the Big Data platform
At this point in the article, the entire series on big data platforms is almost over. Finally, I give an overall structure diagram combining the four components of data source, data collection, data storage and data processing as follows.