Want to play around with machine learning in industry? Let's learn Spark first.


The main article is 5145 words, 16 pictures, and has an estimated reading time of 13 minutes.

Why do machine learners need to learn SPARK?

There is this passage about big data:

“Big data is like teenage sex,everyone talks about it,nobody really knows how to do it,everyone thinks everyone else is doing it,so everyone claims they are doing it.”

As a student, how have not heard of Spark as a computing framework, then I think it's better to stay in academia and hang out in machine learning, industry may not be for you right now.

In academia, the data is generally a public dataset that someone else has processed, and we just practice the algorithms of academia on it; no one in industry is going to give you all the business data ready for you...

It is well known that machine learning and statistical techniques are key to transforming big data into behavioral knowledge, and furthermore, it is often said by machine learners that the amount of data you can control determines the upper limit of what your model can ultimately achieve, and that constantly optimized models are simply designed to keep getting closer to that upper limit.

There is a connection between data and algorithms, one is the blood and one is the heart.

In the information age, most of the top Internet companies have accumulated massive amounts of data, the amount of data that can be controlled is one of the very important elements that your model can eventually approach the best results. For industrial industry machine learning algorithm engineers, in addition to understanding the principles of machine learning in academia, optimization theory and the implementation of a variety of stand-alone version of small demos ..... to really be able to solve the actual business problems, then you must have the ability to handle and use massive amounts of business data, and Spark is precisely the tool that gives us the ability to control big data.

The amount of data you have, is not the same as the amount of data you can control, learn Spark and empower you to control big data!

There is a connection between data and algorithms, one is the blood and one is the heart.

Take a look at the demand for related positions .....

Industry needs Spark

The recommendation departments of two companies I've been in contact with both work on a scala (python) + spark + hadoop platform, thus showing that Spark is one of the very important skills in the industrial world of machine learning!

Anyway, for those of you who want to work in machine learning related jobs in the future, start learning Spark!

What is spark?

Spark and Big Data

Spark is the next generation of distributed in-memory computing engine after hadoop, born in 2009 at the AMPLab at UC Berkeley and now mainly maintained by Databricks. It is the most active, popular, and efficient general-purpose computing platform for big data in the field today.

Official definition: spark is a general-purpose big data processing engine that can be simply understood as a big data distributed processing framework.

Spark is better performing (fast) and more scalable (technology stack) than the traditional hadoop-based first-generation Big Data technology ecosystem.

Features of Spark

Predecessor hadoop

At the beginning of the project in 2006, the word "Hadoop" stood for only two components - HDFS and MapReduce. Now 10 years old, the word stands for "core" (i.e., the Core Hadoop project) and the growing ecosystem associated with it. This is very similar to Linux in that both consist of a core and an ecosystem.

Hadoop Development History

Now that Hadoop released a stable version of 2.7.2 in January, it has grown from the traditional Hadoop triad of HDFS, MapReduce, and HBase communities to a massive ecosystem of more than 60 related components, including more than 25 components included in major distributions, including data storage, execution engines, programming and data access frameworks, and more.

Hadoop evolved from the three-tier structure of 1.0 to its current four-tier architecture after 2.0 separated resource management from MapReduce into a general-purpose framework.

Framework for Hadoop

Bottom - storage layer, file system HDFS

Intermediate layer - resource and data management layer, YARN, and Sentry, etc.

Upper layer - MapReduce, Impala, Spark and other computational engines

Top layer - advanced packaging and tools based on MapReduce, Spark and other computational engines such as Hive, Pig, Mahout, etc.

Why do you need spark when you have Hadoop?

Definitely Spark has better advantages over Hadoop's MR calculations, well in the following ways.

(1) Why is it efficient?

1. relative to Hadoop's MR computation, Spark supports DAG, which can cache intermediate data and reduce the number of data drops.

2. Use multi-thread to start task, more lightweight and fast task start. There is a theoretical 10-100x increase in computational speed. (computational efficiency is at least 3 times greater relative to Hadoop, as verified by personal work)

3, highly abstract API, code 2-5 times less than MR or even more, high development efficiency

(2) Why multi-frame integration?

Compared to past Big Data platform architectures that used Hadoop + Hive + Mahout + Storm to address batch processing, SQL queries and real-time processing and machine learning scenarios, the biggest problem is the different framework languages, the complexity of integration, and also the need for more maintenance costs.

While using Spark on top of Spark core's batch processing, Spark Sql, Spark Streaming, Spark Mllib, Spark GraphX are built to address real-time computing, machine learning and graph computing scenarios, facilitating the integration of different component features while maintaining low costs.

Spark vs. Hadoop

This is because traditional hadoop's MapReduce has the fatal disadvantage of high latency and cannot handle highly time-sensitive data. The computational model of hadoop itself dictates that all work on hadoop needs to be converted into core phases such as Map, Shuffle, and Reduce, which leads to immutable latency due to the need to read and write data from disk for each computation, as well as the need for network transfers throughout the model. And with the advent of spark, hadoop doesn't have the time or the need to refactor itself either. Of course hadoop as a technology system, spark is mainly a replacement for its Map/Reduce function, hadoop's HDFS function is still used in combination with spark.

Features of Spark

Cost of spark

Spark and Hadoop MapReduce are both open source, but the expense of machines and labor is still inevitable.

Hardware differences between Spark and Hadoop

The Spark cluster's memory should be at least as large as the chunks of data to be processed, because only the right size of chunks and memory can bring out its optimal performance. So if you really need to work with very large data, Hadoop is the right choice - after all, the cost of hard drives is much lower than the cost of memory.

Given Spark's performance criteria, it should be more cost effective to require less hardware while running faster for the same task, especially when in the cloud, when it's pay-as-you-go only.

What spark does for Hadoop

More precisely, Spark is a computational framework, while Hadoop contains the computational framework MapReduce and the distributed file system HDFS, and Hadoop more broadly includes other systems on its ecosystem, such as Hbase, Hive, etc. Spark is an alternative to MapReduce and is compatible with distributed storage layers such as HDFS and Hive, and can be integrated into the Hadoop ecosystem to make up for the lack of MapReduce.

Spark vs. Hadoop in Data Intermediate Data Processing Differences.

The Spark architecture uses the Master-Slave model in distributed computing. Master is the node in the corresponding cluster that contains the Master process, and Slave is the node in the cluster that contains the Worker process. The Master acts as the controller of the whole cluster and is responsible for the normal operation of the whole cluster; the Worker is equivalent to a compute node, receiving commands from the Master and reporting status; the Executor is responsible for the execution of tasks; the Client acts as the user's client responsible for submitting applications, and the Driver is responsible for controlling the execution of an application.

Spark scheduling module

Once the Spark cluster is deployed, the Master process and the Worker process need to be started on the master and slave nodes respectively to control the entire cluster. The Driver and Worker are two important roles in the execution of a Spark application. The Driver program is the starting point for the execution of the application logic and is responsible for the scheduling of jobs, i.e., the distribution of Task tasks, while multiple Workers are used to manage compute nodes and create Executors to process tasks in parallel. During the execution phase, Driver serializes the Task and the file and jar on which the Task depends and passes it to the corresponding Worker machine, while Executor processes the Task for the corresponding data partition.

The following details the basic components of Spark's architecture.

1, ClusterManager: in Standalone mode that is the Master (master node), control the entire cluster, monitor the Worker. In YARN mode for Explorer.

2. Worker: the slave node, responsible for controlling the compute node and starting the Executor or Driver. NodeManager in the YARN schema, which is responsible for the control of the compute nodes.

3, Driver: run the main() function of the Application and create the SparkContext.

4, Executor: Executor, a component that executes tasks on the worker node, used to start a thread pool to run tasks. Each Application has a separate set of Executors.

5. SparkContext: the context of the entire application, controlling the life cycle of the application.

6, RDD: Spark's basic computational unit, a set of RDDs can form a directed acyclic graph RDD Graph of execution.

7. DAG Scheduler: constructs a Stage-based DAG based on the job (Job) and submits the Stage to TaskScheduler.

8. TaskScheduler: distributes the task (Task) to the Executor for execution.

9, SparkEnv: thread-level context that stores references to important components of the runtime. References to some important components are created and contained within SparkEnv as follows.

10. MapOutPutTracker: responsible for the storage of Shuffle meta-information.

11、BroadcastManager: responsible for the control of broadcast variables and storage of meta information.

12、BlockManager: responsible for storage management, creating and finding blocks.

13. MetricsSystem: Monitor runtime performance metrics information.

14. SparkConf: responsible for storing configuration information.

The overall process of Spark is: the Client submits the application, the Master finds a Worker to start the Driver, the Driver requests resources from the Master or the Resource Manager, after which the application is transformed into an RDD Graph, and then the DAGScheduler transforms the RDD Graph into a directed acyclic graph of the Stage and submits it to the TaskScheduler, which submits tasks to the Executor for execution. During task execution, other components work in concert to ensure smooth execution of the entire application.

Spark job hierarchy

Application is the overall code submitted by the user submit, and there are many action operations in the code, the action operator divides the Application into multiple jobs, jobs are divided into different Stages according to wide dependencies, the Stage is divided into many (the number is determined by the partition, the data of a partition is calculated by a task) functionally identical tasks, and then these tasks are submitted to the Executor for calculation and execution, and the results are returned to the Driver for aggregation or storage.

4.1. Examples of word frequencies in statistical data sets

Here's a look at solving a HelloWord entry-level Spark program code using Spark, much easier than writing Map/Reduce code inside Hadoop ....

# Word frequency of statistical wordsval rdd = sc.textFile("/home/scipio/README.md")val wordcount = rdd.flatMap(_.split(' ')).map((_,1)).reduceByKey(_+_)val wcsort = wordcount.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1))wcsort.saveAsTextFile("/home/scipio/sort.txt")

Spark execution process

Above is a Spark wordcount example, based on the above stage division principle, this job is divided into 2 stages with three rows for data read, computation and stored procedures.

Looking at the code alone, we simply do not appreciate that the data is behind the parallel computation. From the figure it can be seen that the data is distributed in different partitions (on different machines on the cluster) and the data flows through flapMap, map and reduceByKey operators in different partitions of the RDD. (These operators are the functions that compute on RDDs as described above.) I'll have time later to introduce my own summary of the operators commonly used in Spark and the Scala functions (https://www.jianshu.com/p/addc95d9ebb9).

Recommend Spark official website Chinese translation version of introductory learning materials: http://spark.apachecn.org/docs/cn/2.2.0/sql-programming-guide.html


Recommended>>
1、Little Lee did get his Oscar and this time he didnt fail to live up to the big numbers
2、Want to be a top developer You need to practice more Please keep these resources handy
3、WebAssembly Explained and Its Use Cases
4、Comparing the seven ministries token announcements a linebyline explanation of CoinHubs compliance
5、How many allied organizations are there in the mobile information security industry

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号