Spark scheduling module
Once the Spark cluster is deployed, the Master process and the Worker process need to be started on the master and slave nodes respectively to control the entire cluster. The Driver and Worker are two important roles in the execution of a Spark application. The Driver program is the starting point for the execution of the application logic and is responsible for the scheduling of jobs, i.e., the distribution of Task tasks, while multiple Workers are used to manage compute nodes and create Executors to process tasks in parallel. During the execution phase, Driver serializes the Task and the file and jar on which the Task depends and passes it to the corresponding Worker machine, while Executor processes the Task for the corresponding data partition.
The following details the basic components of Spark's architecture.
1, ClusterManager: in Standalone mode that is the Master (master node), control the entire cluster, monitor the Worker. In YARN mode for Explorer.
2. Worker: the slave node, responsible for controlling the compute node and starting the Executor or Driver. NodeManager in the YARN schema, which is responsible for the control of the compute nodes.
3, Driver: run the main() function of the Application and create the SparkContext.
4, Executor: Executor, a component that executes tasks on the worker node, used to start a thread pool to run tasks. Each Application has a separate set of Executors.
5. SparkContext: the context of the entire application, controlling the life cycle of the application.
6, RDD: Spark's basic computational unit, a set of RDDs can form a directed acyclic graph RDD Graph of execution.
7. DAG Scheduler: constructs a Stage-based DAG based on the job (Job) and submits the Stage to TaskScheduler.
8. TaskScheduler: distributes the task (Task) to the Executor for execution.
9, SparkEnv: thread-level context that stores references to important components of the runtime. References to some important components are created and contained within SparkEnv as follows.
10. MapOutPutTracker: responsible for the storage of Shuffle meta-information.
11、BroadcastManager: responsible for the control of broadcast variables and storage of meta information.
12、BlockManager: responsible for storage management, creating and finding blocks.
13. MetricsSystem: Monitor runtime performance metrics information.
14. SparkConf: responsible for storing configuration information.
The overall process of Spark is: the Client submits the application, the Master finds a Worker to start the Driver, the Driver requests resources from the Master or the Resource Manager, after which the application is transformed into an RDD Graph, and then the DAGScheduler transforms the RDD Graph into a directed acyclic graph of the Stage and submits it to the TaskScheduler, which submits tasks to the Executor for execution. During task execution, other components work in concert to ensure smooth execution of the entire application.
Spark job hierarchy
Application is the overall code submitted by the user submit, and there are many action operations in the code, the action operator divides the Application into multiple jobs, jobs are divided into different Stages according to wide dependencies, the Stage is divided into many (the number is determined by the partition, the data of a partition is calculated by a task) functionally identical tasks, and then these tasks are submitted to the Executor for calculation and execution, and the results are returned to the Driver for aggregation or storage.
4.1. Examples of word frequencies in statistical data sets
Here's a look at solving a HelloWord entry-level Spark program code using Spark, much easier than writing Map/Reduce code inside Hadoop ....
# Word frequency of statistical words val rdd = sc.textFile("/home/scipio/README.md") val wordcount = rdd.flatMap(_.split(' ')).map((_,1)).reduceByKey(_+_) val wcsort = wordcount.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1)) wcsort.saveAsTextFile("/home/scipio/sort.txt")