2.0 Hadoop Runtime Environment
Category Hadoop Tutorial
Since Hadoop is designed for clusters, it is inevitable that we will encounter the need to configure Hadoop on multiple computers when learning its usage. This poses several obstacles for learners, mainly two:
Expensive computer clusters. A cluster environment consisting of multiple computers requires expensive hardware.
Difficult to deploy and maintain. Deploying the same software environment on numerous computers is a significant task and is very inflexible, making it hard to redeploy after environmental changes.
To address these issues, we have a very mature solution: Docker.
Docker is a container management system that can run multiple "virtual machines" (containers) and form a cluster, similar to virtual machines. However, since virtual machines fully virtualize a computer, they consume a lot of hardware resources and are inefficient. Docker, on the other hand, provides an independent, replicable runtime environment, and all processes in the container are still executed in the host's kernel, so its efficiency is almost the same as that of processes on the host (close to 100%).
This tutorial will describe the usage of Hadoop with Docker as the underlying environment. If you are not familiar with Docker and do not know a better way, please study the Docker Tutorial.
Note: Windows users are advised to install Docker using a virtual machine solution.
Docker Deployment
After entering the Docker command line, pull a Linux image to serve as the Hadoop runtime environment. Here, CentOS image is recommended (Debian and other images may have some issues for now).
docker pull centos:8
Then, you can view the local images with the docker images
command:
Now, create a container:
docker run -d centos:8 /usr/sbin/init
You can view running containers with docker ps
:
We can make the container print "Hello World":
This indicates that Docker has been successfully installed and deployed.
Creating a Container
Hadoop supports running on a single device, mainly in two modes: standalone mode and pseudo-cluster mode.
This chapter covers the installation and standalone mode of Hadoop.
Configuring Java and SSH Environment
Now, create a container named java_ssh_proto to configure an environment containing Java and SSH:
docker run -d --name=java_ssh_proto --privileged centos:8 /usr/sbin/init
Then, enter the container:
docker exec -it java_ssh_proto bash
Configure the image:
sed -e 's|^mirrorlist=|#mirrorlist=|g' \
-e 's|^#baseurl=http://mirror.centos.org/$contentdir|baseurl=https://mirrors.ustc.edu.cn/centos|g' \
-i.bak \
/etc/yum.repos.d/CentOS-Linux-AppStream.repo \
/etc/yum.repos.d/CentOS-Linux-BaseOS.repo \
/etc/yum.repos.d/CentOS-Linux-Extras.repo \
/etc/yum.repos.d/CentOS-Linux-PowerTools.repo \
/etc/yum.repos.d/CentOS-Linux-Plus.repo
yum makecache
Install OpenJDK 8 and SSH services:
yum install -y java-1.8.0-openjdk-devel openssh-clients openssh-server
Then, enable SSH services:
systemctl enable sshd && systemctl start sshd
>
For Ubuntu systems, use the following commands to start SSH services:
systemctl enable ssh && systemctl start ssh
If no faults occur, a prototype container with a Java runtime environment and SSH environment has been created. This is a crucial container, so it is recommended to exit the container with the exit
command here, then stop the container and save it as an image named java_ssh
:
docker stop java_ssh_proto
docker commit java_ssh_proto java_ssh
Hadoop Installation
Download Hadoop
Hadoop official website: http://hadoop.apache.org/
Hadoop release versions download: https://hadoop.apache.org/releases.html
In current tests, versions 3.1.x and 3.2.x have better compatibility. This tutorial uses version 3.1.4 as an example.
Hadoop 3.1.4 image address, download the tar.gz compressed package file for later use.
Create Hadoop Standalone Container
Now, create a container named hadoop_single
using the previously saved java_ssh
image:
docker run -d --name=hadoop_single --privileged java_ssh /usr/sbin/init
Copy the downloaded Hadoop compressed package to the /root
directory in the container:
docker cp <path where you stored the Hadoop compressed package> hadoop_single:/root/
Enter the container:
docker exec -it hadoop_single bash
Go to the /root
directory:
cd /root
Here, you should find the Hadoop-x.x.x.tar.gz file that was just copied. Now, extract it:
tar -zxf hadoop-3.1.4.tar.gz
After extraction, you will get a folder named hadoop-3.1.4
. Now, move it to a common location:
mv hadoop-3.1.4 /usr/local/hadoop
Then, configure the environment variables:
echo "export HADOOP_HOME=/usr/local/hadoop" >> /etc/bashrc
echo "export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin" >> /etc/bashrc
Exit the Docker container and re-enter.
At this point, the result of echo $HADOOP_HOME
should be /usr/local/hadoop
.
echo "export JAVA_HOME=/usr" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh
echo "export HADOOP_HOME=/usr/local/hadoop" >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh
These steps configure the Hadoop internal environment variables. Then, execute the following command to check if it was successful:
hadoop version
This indicates that your Hadoop standalone version has been configured successfully.
- 2.0 Hadoop Runtime Environment
-4.0 HDFS Configuration and Usage