Login to SSH again.png
Installing the Java Environment
Java environment can choose Oracle's JDK, or OpenJDK (can be regarded as the open source version of the JDK), now the default installation of the general Linux system is basically OpenJDK, here is the installation of the OpenJDK1.8.0 version.
Some CentOS 6.4 has OpenJDK 1.7 installed by default, here we can check it with the command, same as under Windows, and also check the value of the JAVA_HOME environment variable.
java -version # View the version of java
javac -version # View the version of the compile command Javac
echo $JAVA_HOME # View the value of the $JAVA_HOME environment variable
If the system does not have erect OpenJDK， We can get to the bottom of this by yum The bag manager comes erect。（ erect The process will let the input [y/N]， importation y can then (do sth)）
yum install java-1.8.0-openjdk java-1.8.0-openjdk-devel # erect openjdk1.8.0
Install OpenJDK by the above command, the default installation location is /usr/lib/jvm/java-1.8.0, use this location when configuring JAVA_HOME below.
Next, you need to configure the JAVA_HOME environment variable, for convenience, set it directly in ~/.bashrc, which is equivalent to configuring a Windows user environment variable that only takes effect for a single user, when the user logs in, every time the shell terminal is opened, the .bashrc file will be read.
To modify a file, you can either open it directly with the vim editor or use the gedit text editor, which is similar to Windows Notepad.
Choose either of the following commands.
vim ~/.bashrc # Use the vim editor to open the .bashrc file in the terminal
gedit ~/.bashrc # Use the gedit text editor to open the .bashrc file
Add the following separate line at the end of the file (pointing to the JDK installation location), and
Configure JAVA_HOME environment variable.png
Then you also need to make that environment variable take effect by executing the following command.
source ~/.bashrc # Make the variable settings take effect
After setting it up let's check if it's set up correctly , as shown in the figure below.
echo $JAVA_HOME # Check the value of the variable
$JAVA_HOME/bin/java -version # Same as executing java -version directly
Check that the JAVA_HOME environment variable is configured correctly.png
This installs the Java runtime environment required for Hadoop.
software environment It's already been given. hadoop2.6.5 of Download Address， You can open the download directly in Firefox， The default download location is (located) at user Home hit the target Downloads folder , as shown in the figure below.
Once the download is complete, we unzip Hadoop into /usr/local/.
tar -zxf ~/download/hadoop-2.6.5.tar.gz -C /usr/local # Extract to the /usr/local directory
cd /usr/local/ # Switches the current directory to /usr/local
mv ./hadoop-2.6.5/ ./hadoop # Change the folder name to hadoop
chown -R root:root ./hadoop # Change the file permissions, root is the current username
Once Hadoop is unpacked and ready to use, enter the following command to check if Hadoop is available, and if it succeeds, Hadoop version information will be displayed.
cd /usr/local/hadoop # Switch the current directory to /usr/local/hadoop directory
./bin/hadoop version # View the version information of Hadoop
Or just type in
hadoop version command can also be viewed. hadoop version # View the version information of Hadoop
View Hadoop version information.png
There are three types of Hadoop installation, namely, standalone mode, pseudo-distributed mode, and distributed mode. Standalone mode: Hadoop defaults to non-distributed mode (local mode) and runs without additional configuration. Non-distributed i.e. single Java process for easy debugging. Pseudo-distributed mode: Hadoop can run in a pseudo-distributed manner on a single node, with the Hadoop process running as a detached Java process and the node acting as both a NameNode and a DataNode, while, at the same time, reading files in HDFS. Distributed mode: uses multiple nodes to form a clustered environment to run Hadoop, requiring multiple hosts, which can also be virtual. Hadoop Pseudo-distributed configuration
Now we can come and run some examples with Hadoop, which comes with a lot of examples to run
hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.5.jar See all the examples.
Let's run an example query here, using the input folder as the input folder, and filtering it to match the regular expression
dfs[a-z.]+ The word is counted, the number of times it is filtered, and the result is output to the output folder. cd /usr/local/hadoop # Switch the current directory to /usr/local/hadoop directory
mkdir ./input # Create the input folder in the current directory
cp ./etc/hadoop/*.xml ./input # Copy the hadoop configuration file to the new input folder input
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep ./input ./output 'dfs[a-z.]+'
cat ./output/* # View the output
cat ./output/* Viewing the results, the word dfsadmin matching the regularity occurs 1 times.
Running a test Hadoop example.png
If there is an error running, such as the following prompt.
Error running Hadoop example.png
If the prompt "WARN util. NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable", this WARN prompt can be ignored and does not affect the normal operation of Hadoop.
Note: Hadoop does not overwrite the result file by default, so running the above example again will prompt an error and you will need to delete the output folder first. rm -rf ./output # Execute in /usr/local/hadoop directory
Testing that our Hadoop installation is OK, we can start setting up the Hadoop environment variables, again configured in the ~/.bashrc file.
gedit ~/.bashrc # Use the gedit text editor to open the .bashrc file
Add the following to the end of the .bashrc file, noting that HADOOP_HOME is in the right place, if it all follows the previous configuration, this part can be copied.
# Hadoop Environment Variables
Configuration of Hadoop environment variables.png
save Remember to turn it off after gedit procedures， Otherwise it will take up the terminal， The following command cannot be executed， You can press 【Ctrl + C】 The key terminates the program。
save back， Don't forget to execute the following command to make the configuration take effect。
The Hadoop configuration file is located under /usr/local/hadoop/etc/hadoop/. Pseudo-distribution requires 2 configuration files to be modified
core-site.xml harmony hdfs-site.xml . Hadoop configuration files are in xml format, and each configuration is implemented as a declaration of the name and value of the property.
Modify the configuration file
core-site.xml (It's easier to edit via gedit, by typing the command gedit ./etc/hadoop/core-site.xml)。
<configuration></configuration> Insert the following code in between. <configuration>
<description>Abase for other temporary directories.</description>
Similarly, modify the configuration file
hdfs-site.xml， gedit ./etc/hadoop/hdfs-site.xml <configuration>
After the configuration is complete, perform the formatting of the NameNode. (This command is required for the first start of Hadoop)
hdfs namenode -format
If successful, you will see "successfully formatted" and "Exitting with status 0", if it is "Exitting with status 1" then it is an error.
Next start Hadoop.
start-dfs.sh # Start the NameNode and DataNode processes
If the following SSH prompt appears "Are you sure you want to continue connecting", enter yes.
Notes on starting Hadoop.png
After start-up is complete， may By command
jps If the following four processes, NameNode, DataNode, SecondaryNameNode, and Jps, are present, Hadoop is successfully started. jps # View process to determine if Hadoop started successfully
Determining if Hadoop started successfully.png
Once successfully launched, the web interface can also be accessed
http://localhost:50070 View NameNode and Datanode information, and also view files in HDFS online.
Hadoop normal startup web interface.png
YARN is separated from MapReduce and is responsible for resource management and task scheduling. YARN runs on top of MapReduce, providing high availability, high scalability. (Pseudo-distribution is fine without YARN, and generally does not affect program execution)
The above-mentioned passage
start-dfs.sh command to start Hadoop, which simply starts the MapReduce environment, we can start YARN and let YARN take care of resource management and task scheduling.
firstly Modify the configuration file
mapred-site.xml You need to rename the mapred-site.xml.template file to mapred-site.xml first. mv ./etc/hadoop/mapred-site.xml.template ./etc/hadoop/mapred-site.xml # rename the file
gedit ./etc/hadoop/mapred-site.xml # Open with gedit text editor <configuration>
Next, modify the configuration file
yarn-site.xml。 gedit ./etc/hadoop/yarn-site.xml # Open with gedit text editor <configuration>
Then you can start YARN by executing
Note: Before starting YARN, make sure that dfs Hadoop is already started, i.e., that you have executed the
start-dfs.sh。 start-yarn.sh # Start YARN
mr-jobhistory-daemon.sh start historyserver # Start the historyserver in order to see the job running in the web
Turn on and pass
jps look over， You can see more NodeManager harmony ResourceManager Two processes , as shown in the figure below.
After starting YARN, the method of running the instance is still the same, only the resource management method and task scheduling are different. One of the advantages of starting YARN is that you can see what tasks are running through the web interface: the
http://localhost:8088/cluster As shown in the figure below.