5.0 HDFS Cluster
Category Hadoop Tutorial
An HDFS cluster is built on top of a Hadoop cluster, as HDFS is the most critical daemon of Hadoop, the configuration process of an HDFS cluster is representative of the Hadoop cluster configuration process.
Using Docker can facilitate and efficiently build a cluster environment.
Configuration in Each Computer
How to configure a Hadoop cluster and what configurations should be present on different computers are questions that arise during learning. This chapter's configuration will provide a typical example, but the complex and diverse configuration items of Hadoop go far beyond this.
The remote control of data nodes by the HDFS NameNode is implemented through SSH, so the key configuration items should be set on the NameNode, and non-critical node configurations should be set on each data node. That is to say, the configurations of data nodes and the NameNode can be different, and the configurations between different data nodes can also vary.
However, for the convenience of establishing a cluster in this chapter, the same configuration files will be synchronized to all cluster nodes through the form of Docker images, which is specially explained.
Specific Steps
The overall idea is as follows: we first use an image containing Hadoop for configuration, configure it so that all nodes in the cluster can share it, and then generate several containers based on it to form a cluster.
Configuration Prototype
First, we will start the hadoop_proto image prepared earlier as a container:
docker run -d --name=hadoop_temp --privileged hadoop_proto /usr/sbin/init
Enter the Hadoop configuration file directory:
cd $HADOOP_HOME/etc/hadoop
Now, a brief description of the role of the files here:
File | Function |
---|---|
workers | Records the hostnames or IP addresses of all data nodes |
core-site.xml | Hadoop core configuration |
hdfs-site.xml | HDFS configuration items |
mapred-site.xml | MapReduce configuration items |
yarn-site.xml | YARN configuration items |
We now design such a simple cluster:
1 NameNode nn
2 DataNodes dn1, dn2
First, edit workers and change the file content to:
dn1
dn2
Then edit core-site.xml and add the following configuration items:
<!-- Configure HDFS host address and port number -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://nn:9000</value>
</property>
<!-- Configure the temporary file directory for Hadoop -->
<property>
<name>hadoop.tmp.dir</name>
<value>file:///home/hadoop/tmp</value>
</property>
Configure hdfs-site.xml and add the following configuration items:
<!-- Each data block is replicated 2 times for storage -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<!-- Set the directory for storing name information -->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hdfs/name</value>
</property>
Finally, configure SSH:
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
ssh-copy-id -i ~/.ssh/id_rsa hadoop@localhost
With this, the cluster prototype is configured. You can exit the container and upload the container to a new image cluster_proto:
docker stop hadoop_temp
docker commit hadoop_temp cluster_proto
If necessary, you can delete the temporary image hadoop_temp here.
Deploy the Cluster
Next, deploy the cluster.
First, establish a dedicated network hnet for the Hadoop cluster:
docker network create --subnet=172.20.0.0/16 hnet
Next, create cluster containers:
``` docker run -d --name=nn --hostname=nn --network=hnet --ip=172.20.1.0 --add-host=dn1:172.20.1.1 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init docker run -d --name=dn1 --hostname=dn1 --network=hnet --ip=172.20.1.1 --add-host=nn:172.20.1.0 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init docker run -d --name=dn2 --hostname=dn2 --network=hnet --ip=172.20.1.2 --add-host=nn:172.20.1.0 --add-host=dn1