cool hit counter Building Hadoop in CentOS_Intefrankly

Building Hadoop in CentOS

Build Note: If you are building Hadoop for the first time, please strictly follow the software environment and steps in the article to build, different versions may cause problems.

Software environment.

Virtual Machine: VMware Pro14

Linux:CentOS-6.4( Download Address (Just download the DVD version)

JDK: OpenJDK1.8.0 (it is strongly recommended not to use Oracle's Linux version of the JDK)

Hadoop:2.6.5( Download Address

The installation of the virtual machine and the installation of the Linux system is omitted here, you can refer to the online tutorials to install, there is generally no big problem, you need to pay attention to remember the user password you enter here, the following will also be used, as shown in the figure below.

Set user password.png

User selection

Using virtual machines erect After the good system, You can see the login screen , as shown in the figure below.

Enter system.png

option Other , in Username Enter in the input box root , enter, and then add the following to the Password Enter your password when you created the user in the input box. The root user is a superuser created automatically by installing CentOS, but the password is the same as the normal user password you created when you installed the system.

Normally when using CentOS, it is not recommended to use the root user, as this user has the highest privileges on the entire system and using this user can lead to serious consequences, but only if you know Linux well enough to do it by mistake. To build Hadoop's big data platform, you need to use a normal user, and many commands require the sudo command to obtain root user privileges, which is rather troublesome, so you can just use the root user.


Cluster, single-node mode all require SSH logins (similar to a remote login where you can log into a particular Linux host and run commands on it).

First make sure your CentOS system can access the internet properly, you can check the network icon in the top right corner of your desktop, if it shows a red cross then it is not connected to the internet, you can click to select an available network, or you can use Firefox in the top left corner of your desktop to enter a web address to verify that your internet connection is working. If you still can't access the Internet, check the settings of the virtual machine, choose NAT mode, or go to Baidu to solve the problem.

Check network status.png

Once you are sure that your network connection is working, open CentOS's terminal by right-clicking on the CentOS desktop and selecting Open In Terminal , as shown in the figure below.

Open terminal.png

Normally, CentOS has SSH client, SSH server installed by default, you can check this by opening a terminal and executing the following command.

rpm -qa | grep ssh

If the result returned as shown below contains SSH client and SSH server, no further installation is required.

Check if SSH is installed.png

If you need to install it, you can do so via yum, the package manager. (The installation process will ask you to enter [y/N], just enter y)

Note: The command is executed as a single line, not just pasting two commands over.

Paste in the terminal can be selected by right mouse click Paste Paste, or you can paste by shortcut key [Shift + Insert].

yum install openssh-clients
yum install openssh-server

SSH erect after completion, Test it by executing the following command SSH Availability(SSH First time login prompt yes/no information, importation yes can then (do sth), Then follow the prompts and enter root User's password, That's how you log in to the machine.) , as shown in the figure below.

First login to SSH.png

However, this requires a password every time you log in, and we need to configure SSH passwordless logins to be more convenient.

First enter exit Quitting the ssh you just did brings you back to our original terminal window, then use ssh-keygen to generate the key and add it to the license.

 exit # exit ssh localhost just now
 cd ~/.ssh/ # If prompted for this directory, run ssh localhost once first
 ssh-keygen -t rsa # It will prompt, just press enter on both
cat >> authorized_keys  #  Join the mandate
 chmod 600 ./authorized_keys # Modify file permissions

At this point, use the ssh localhost command, You can log in without entering your password , as shown in the figure below.

Login to SSH again.png

Installing the Java Environment

Java environment can choose Oracle's JDK, or OpenJDK (can be regarded as the open source version of the JDK), now the default installation of the general Linux system is basically OpenJDK, here is the installation of the OpenJDK1.8.0 version.

Some CentOS 6.4 has OpenJDK 1.7 installed by default, here we can check it with the command, same as under Windows, and also check the value of the JAVA_HOME environment variable.

 java -version # View the version of java
 javac -version # View the version of the compile command Javac
 echo $JAVA_HOME # View the value of the $JAVA_HOME environment variable

If the system does not have erect OpenJDK, We can get to the bottom of this by yum The bag manager comes erect。( erect The process will let the input [y/N], importation y can then (do sth))

yum install java-1.8.0-openjdk  java-1.8.0-openjdk-devel  # erect openjdk1.8.0

Install OpenJDK by the above command, the default installation location is /usr/lib/jvm/java-1.8.0, use this location when configuring JAVA_HOME below.

Next, you need to configure the JAVA_HOME environment variable, for convenience, set it directly in ~/.bashrc, which is equivalent to configuring a Windows user environment variable that only takes effect for a single user, when the user logs in, every time the shell terminal is opened, the .bashrc file will be read.

To modify a file, you can either open it directly with the vim editor or use the gedit text editor, which is similar to Windows Notepad.

Choose either of the following commands.

 vim ~/.bashrc # Use the vim editor to open the .bashrc file in the terminal
 gedit ~/.bashrc # Use the gedit text editor to open the .bashrc file

Add the following separate line at the end of the file (pointing to the JDK installation location), and save

Configure JAVA_HOME environment variable.png

Then you also need to make that environment variable take effect by executing the following command.

 source ~/.bashrc # Make the variable settings take effect

After setting it up let's check if it's set up correctly , as shown in the figure below.

 echo $JAVA_HOME # Check the value of the variable
java -version
javac -version
 $JAVA_HOME/bin/java -version # Same as executing java -version directly

Check that the JAVA_HOME environment variable is configured correctly.png

This installs the Java runtime environment required for Hadoop.

Installing Hadoop

upfront software environment It's already been given. hadoop2.6.5 of Download Address, You can open the download directly in Firefox, The default download location is (located) at user Home hit the target Downloads folder , as shown in the figure below.

Download Hadoop.png

Once the download is complete, we unzip Hadoop into /usr/local/.

 tar -zxf ~/download/hadoop-2.6.5.tar.gz -C /usr/local # Extract to the /usr/local directory
 cd /usr/local/ # Switches the current directory to /usr/local
 mv ./hadoop-2.6.5/ ./hadoop # Change the folder name to hadoop
 chown -R root:root ./hadoop # Change the file permissions, root is the current username

Once Hadoop is unpacked and ready to use, enter the following command to check if Hadoop is available, and if it succeeds, Hadoop version information will be displayed.

 cd /usr/local/hadoop # Switch the current directory to /usr/local/hadoop directory
 ./bin/hadoop version # View the version information of Hadoop

Or just type in hadoop version command can also be viewed.

 hadoop version # View the version information of Hadoop

View Hadoop version information.png

There are three types of Hadoop installation, namely, standalone mode, pseudo-distributed mode, and distributed mode.

  • Standalone mode: Hadoop defaults to non-distributed mode (local mode) and runs without additional configuration. Non-distributed i.e. single Java process for easy debugging.
  • Pseudo-distributed mode: Hadoop can run in a pseudo-distributed manner on a single node, with the Hadoop process running as a detached Java process and the node acting as both a NameNode and a DataNode, while, at the same time, reading files in HDFS.
  • Distributed mode: uses multiple nodes to form a clustered environment to run Hadoop, requiring multiple hosts, which can also be virtual.

Hadoop Pseudo-distributed configuration

Now we can come and run some examples with Hadoop, which comes with a lot of examples to run hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.5.jar See all the examples.

Let's run an example query here, using the input folder as the input folder, and filtering it to match the regular expression dfs[a-z.]+ The word is counted, the number of times it is filtered, and the result is output to the output folder.

 cd /usr/local/hadoop # Switch the current directory to /usr/local/hadoop directory
 mkdir ./input # Create the input folder in the current directory
 cp ./etc/hadoop/*.xml ./input # Copy the hadoop configuration file to the new input folder input
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep ./input ./output 'dfs[a-z.]+' 
 cat ./output/* # View the output

By command cat ./output/* Viewing the results, the word dfsadmin matching the regularity occurs 1 times.

Running a test Hadoop example.png

If there is an error running, such as the following prompt.

Error running Hadoop example.png

If the prompt "WARN util. NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable", this WARN prompt can be ignored and does not affect the normal operation of Hadoop.

Note: Hadoop does not overwrite the result file by default, so running the above example again will prompt an error and you will need to delete the output folder first.

 rm -rf ./output # Execute in /usr/local/hadoop directory

Testing that our Hadoop installation is OK, we can start setting up the Hadoop environment variables, again configured in the ~/.bashrc file.

 gedit ~/.bashrc # Use the gedit text editor to open the .bashrc file

Add the following to the end of the .bashrc file, noting that HADOOP_HOME is in the right place, if it all follows the previous configuration, this part can be copied.

# Hadoop Environment Variables
export HADOOP_HOME=/usr/local/hadoop

Configuration of Hadoop environment variables.png

save Remember to turn it off after gedit procedures, Otherwise it will take up the terminal, The following command cannot be executed, You can press 【Ctrl + C】 The key terminates the program。

save back, Don't forget to execute the following command to make the configuration take effect。

source ~/.bashrc

The Hadoop configuration file is located under /usr/local/hadoop/etc/hadoop/. Pseudo-distribution requires 2 configuration files to be modified core-site.xml harmony hdfs-site.xml . Hadoop configuration files are in xml format, and each configuration is implemented as a declaration of the name and value of the property.

Modify the configuration file core-site.xml (It's easier to edit via gedit, by typing the command gedit ./etc/hadoop/core-site.xml)。

(located) at <configuration></configuration> Insert the following code in between.

 <description>Abase for other temporary directories.</description>

Similarly, modify the configuration file hdfs-site.xmlgedit ./etc/hadoop/hdfs-site.xml


After the configuration is complete, perform the formatting of the NameNode. (This command is required for the first start of Hadoop)

hdfs namenode -format

If successful, you will see "successfully formatted" and "Exitting with status 0", if it is "Exitting with status 1" then it is an error.

NameNode formatting.png

Next start Hadoop. # Start the NameNode and DataNode processes

If the following SSH prompt appears "Are you sure you want to continue connecting", enter yes.

Notes on starting Hadoop.png

After start-up is complete, may By command jps If the following four processes, NameNode, DataNode, SecondaryNameNode, and Jps, are present, Hadoop is successfully started.

 jps # View process to determine if Hadoop started successfully

Determining if Hadoop started successfully.png

Once successfully launched, the web interface can also be accessed http://localhost:50070 View NameNode and Datanode information, and also view files in HDFS online.

Hadoop normal startup web interface.png

Start YARN

YARN is separated from MapReduce and is responsible for resource management and task scheduling. YARN runs on top of MapReduce, providing high availability, high scalability. (Pseudo-distribution is fine without YARN, and generally does not affect program execution)

The above-mentioned passage command to start Hadoop, which simply starts the MapReduce environment, we can start YARN and let YARN take care of resource management and task scheduling.

firstly Modify the configuration file mapred-site.xml You need to rename the mapred-site.xml.template file to mapred-site.xml first.

 mv ./etc/hadoop/mapred-site.xml.template ./etc/hadoop/mapred-site.xml # rename the file
 gedit ./etc/hadoop/mapred-site.xml # Open with gedit text editor

Next, modify the configuration file yarn-site.xml

 gedit ./etc/hadoop/yarn-site.xml # Open with gedit text editor

Then you can start YARN by executing Orders.

Note: Before starting YARN, make sure that dfs Hadoop is already started, i.e., that you have executed the # Start YARN start historyserver # Start the historyserver in order to see the job running in the web

Turn on and pass jps look over, You can see more NodeManager harmony ResourceManager Two processes , as shown in the figure below.

Launch YARN.png

After starting YARN, the method of running the instance is still the same, only the resource management method and task scheduling are different. One of the advantages of starting YARN is that you can see what tasks are running through the web interface: thehttp://localhost:8088/cluster As shown in the figure below.

YARN's web interface.png

YARN mainly provides better resource management and task scheduling for the cluster, so if you don't want to start YARN, be sure to change the configuration file mapred-site.xml Rename it to mapred-site.xml.template and just change it back when you need to use it. Otherwise with that configuration file present and YARN not enabled, running the program will prompt the error "Retrying connect to server:", which is why the initial file name of that configuration file is mapred-site.xml.template.

The command to shut down YARN is as follows, start for on and stop for off. stop historyserver

In our normal studies, it is sufficient for us to use pseudo-distribution.

Reference article.

  1. Hadoop Installation Tutorial Pseudo-distributed configurationCentOS6.4/Hadoop2.6.0
  2. Big Data Processing Architecture Hadoop Study Guide
  3. CentOS7 install Java SDK (openjdk) using yum command

1、Xiamen takes important step toward 5G smart transportation pilot city BRT to have artificial intelligence driver
2、Foreign programmers spill the beans never work for Oracle again
3、ODYSSEY Weekly Progress Report 730 85
4、How do I compile create and deploy smart contracts
5、OGC online gaming chain blows the attack on DApps

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送