Hadoop Development History
Now that Hadoop released a stable version of 2.7.2 in January, it has grown from the traditional Hadoop triad of HDFS, MapReduce, and HBase communities to a massive ecosystem of more than 60 related components, including more than 25 components included in major distributions, including data storage, execution engines, programming and data access frameworks, and more.
Hadoop evolved from the three-tier structure of 1.0 to its current four-tier architecture after 2.0 separated resource management from MapReduce into a general-purpose framework.
Framework for Hadoop
Bottom - storage layer, file system HDFS
Intermediate layer - resource and data management layer, YARN, and Sentry, etc.
Upper layer - MapReduce, Impala, Spark and other computational engines
Top layer - advanced packaging and tools based on MapReduce, Spark and other computational engines such as Hive, Pig, Mahout, etc.
Why do you need spark when you have Hadoop?
Definitely Spark has better advantages over Hadoop's MR calculations, well in the following ways.
(1) Why is it efficient?
1. relative to Hadoop's MR computation, Spark supports DAG, which can cache intermediate data and reduce the number of data drops.
2. Use multi-thread to start task, more lightweight and fast task start. There is a theoretical 10-100x increase in computational speed. (computational efficiency is at least 3 times greater relative to Hadoop, as verified by personal work)
3, highly abstract API, code 2-5 times less than MR or even more, high development efficiency
(2) Why multi-frame integration?
Compared to past Big Data platform architectures that used Hadoop + Hive + Mahout + Storm to address batch processing, SQL queries and real-time processing and machine learning scenarios, the biggest problem is the different framework languages, the complexity of integration, and also the need for more maintenance costs.
While using Spark on top of Spark core's batch processing, Spark Sql, Spark Streaming, Spark Mllib, Spark GraphX are built to address real-time computing, machine learning and graph computing scenarios, facilitating the integration of different component features while maintaining low costs.
Spark vs. Hadoop
This is because traditional hadoop's MapReduce has the fatal disadvantage of high latency and cannot handle highly time-sensitive data. The computational model of hadoop itself dictates that all work on hadoop needs to be converted into core phases such as Map, Shuffle, and Reduce, which leads to immutable latency due to the need to read and write data from disk for each computation, as well as the need for network transfers throughout the model. And with the advent of spark, hadoop doesn't have the time or the need to refactor itself either. Of course hadoop as a technology system, spark is mainly a replacement for its Map/Reduce function, hadoop's HDFS function is still used in combination with spark.
Features of Spark
Cost of spark
Spark and Hadoop MapReduce are both open source, but the expense of machines and labor is still inevitable.
Hardware differences between Spark and Hadoop
The Spark cluster's memory should be at least as large as the chunks of data to be processed, because only the right size of chunks and memory can bring out its optimal performance. So if you really need to work with very large data, Hadoop is the right choice - after all, the cost of hard drives is much lower than the cost of memory.
Given Spark's performance criteria, it should be more cost effective to require less hardware while running faster for the same task, especially when in the cloud, when it's pay-as-you-go only.
What spark does for Hadoop
More precisely, Spark is a computational framework, while Hadoop contains the computational framework MapReduce and the distributed file system HDFS, and Hadoop more broadly includes other systems on its ecosystem, such as Hbase, Hive, etc. Spark is an alternative to MapReduce and is compatible with distributed storage layers such as HDFS and Hive, and can be integrated into the Hadoop ecosystem to make up for the lack of MapReduce.
Spark vs. Hadoop in Data Intermediate Data Processing Differences.
The Spark architecture uses the Master-Slave model in distributed computing. Master is the node in the corresponding cluster that contains the Master process, and Slave is the node in the cluster that contains the Worker process. The Master acts as the controller of the whole cluster and is responsible for the normal operation of the whole cluster; the Worker is equivalent to a compute node, receiving commands from the Master and reporting status; the Executor is responsible for the execution of tasks; the Client acts as the user's client responsible for submitting applications, and the Driver is responsible for controlling the execution of an application.
Spark scheduling module
Once the Spark cluster is deployed, the Master process and the Worker process need to be started on the master and slave nodes respectively to control the entire cluster. The Driver and Worker are two important roles in the execution of a Spark application. The Driver program is the starting point for the execution of the application logic and is responsible for the scheduling of jobs, i.e., the distribution of Task tasks, while multiple Workers are used to manage compute nodes and create Executors to process tasks in parallel. During the execution phase, Driver serializes the Task and the file and jar on which the Task depends and passes it to the corresponding Worker machine, while Executor processes the Task for the corresponding data partition.
The following details the basic components of Spark's architecture.
1, ClusterManager: in Standalone mode that is the Master (master node), control the entire cluster, monitor the Worker. In YARN mode for Explorer.
2. Worker: the slave node, responsible for controlling the compute node and starting the Executor or Driver. NodeManager in the YARN schema, which is responsible for the control of the compute nodes.
3, Driver: run the main() function of the Application and create the SparkContext.
4, Executor: Executor, a component that executes tasks on the worker node, used to start a thread pool to run tasks. Each Application has a separate set of Executors.
5. SparkContext: the context of the entire application, controlling the life cycle of the application.
6, RDD: Spark's basic computational unit, a set of RDDs can form a directed acyclic graph RDD Graph of execution.
7. DAG Scheduler: constructs a Stage-based DAG based on the job (Job) and submits the Stage to TaskScheduler.
8. TaskScheduler: distributes the task (Task) to the Executor for execution.
9, SparkEnv: thread-level context that stores references to important components of the runtime. References to some important components are created and contained within SparkEnv as follows.
10. MapOutPutTracker: responsible for the storage of Shuffle meta-information.
11、BroadcastManager: responsible for the control of broadcast variables and storage of meta information.
12、BlockManager: responsible for storage management, creating and finding blocks.
13. MetricsSystem: Monitor runtime performance metrics information.
14. SparkConf: responsible for storing configuration information.
The overall process of Spark is: the Client submits the application, the Master finds a Worker to start the Driver, the Driver requests resources from the Master or the Resource Manager, after which the application is transformed into an RDD Graph, and then the DAGScheduler transforms the RDD Graph into a directed acyclic graph of the Stage and submits it to the TaskScheduler, which submits tasks to the Executor for execution. During task execution, other components work in concert to ensure smooth execution of the entire application.
Spark job hierarchy
Application is the overall code submitted by the user submit, and there are many action operations in the code, the action operator divides the Application into multiple jobs, jobs are divided into different Stages according to wide dependencies, the Stage is divided into many (the number is determined by the partition, the data of a partition is calculated by a task) functionally identical tasks, and then these tasks are submitted to the Executor for calculation and execution, and the results are returned to the Driver for aggregation or storage.
4.1. Examples of word frequencies in statistical data sets
Here's a look at solving a HelloWord entry-level Spark program code using Spark, much easier than writing Map/Reduce code inside Hadoop ....
# Word frequency of statistical wordsval rdd = sc.textFile("/home/scipio/README.md")val wordcount = rdd.flatMap(_.split(' ')).map((_,1)).reduceByKey(_+_)val wcsort = wordcount.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1))wcsort.saveAsTextFile("/home/scipio/sort.txt")