5.0 HDFS Cluster

❮ Es6 String Html5 Canvas Zrender ❯

5.0 HDFS Cluster

Category Hadoop Tutorial

An HDFS cluster is built on top of a Hadoop cluster, as HDFS is the most critical daemon of Hadoop, the configuration process of an HDFS cluster is representative of the Hadoop cluster configuration process.

Using Docker can facilitate and efficiently build a cluster environment.

Configuration in Each Computer

How to configure a Hadoop cluster and what configurations should be present on different computers are questions that arise during learning. This chapter's configuration will provide a typical example, but the complex and diverse configuration items of Hadoop go far beyond this.

The remote control of data nodes by the HDFS NameNode is implemented through SSH, so the key configuration items should be set on the NameNode, and non-critical node configurations should be set on each data node. That is to say, the configurations of data nodes and the NameNode can be different, and the configurations between different data nodes can also vary.

However, for the convenience of establishing a cluster in this chapter, the same configuration files will be synchronized to all cluster nodes through the form of Docker images, which is specially explained.

Specific Steps

The overall idea is as follows: we first use an image containing Hadoop for configuration, configure it so that all nodes in the cluster can share it, and then generate several containers based on it to form a cluster.

Configuration Prototype

First, we will start the hadoop_proto image prepared earlier as a container:

docker run -d --name=hadoop_temp --privileged hadoop_proto /usr/sbin/init

Enter the Hadoop configuration file directory:

cd $HADOOP_HOME/etc/hadoop

Now, a brief description of the role of the files here:

File	Function
workers	Records the hostnames or IP addresses of all data nodes
core-site.xml	Hadoop core configuration
hdfs-site.xml	HDFS configuration items
mapred-site.xml	MapReduce configuration items
yarn-site.xml	YARN configuration items

We now design such a simple cluster:

1 NameNode nn
2 DataNodes dn1, dn2

First, edit workers and change the file content to:

dn1
dn2

Then edit core-site.xml and add the following configuration items:

<!-- Configure HDFS host address and port number -->
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://nn:9000</value>
</property>
<!-- Configure the temporary file directory for Hadoop -->
<property>
    <name>hadoop.tmp.dir</name>
    <value>file:///home/hadoop/tmp</value>
</property>

Configure hdfs-site.xml and add the following configuration items:

<!-- Each data block is replicated 2 times for storage -->
<property>
    <name>dfs.replication</name>
    <value>2</value>
</property>

<!-- Set the directory for storing name information -->
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///home/hadoop/hdfs/name</value>
</property>

Finally, configure SSH:

ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
ssh-copy-id -i ~/.ssh/id_rsa hadoop@localhost

With this, the cluster prototype is configured. You can exit the container and upload the container to a new image cluster_proto:

docker stop hadoop_temp
docker commit hadoop_temp cluster_proto

If necessary, you can delete the temporary image hadoop_temp here.

Deploy the Cluster

Next, deploy the cluster.

First, establish a dedicated network hnet for the Hadoop cluster:

docker network create --subnet=172.20.0.0/16 hnet

Next, create cluster containers:

``` docker run -d --name=nn --hostname=nn --network=hnet --ip=172.20.1.0 --add-host=dn1:172.20.1.1 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init docker run -d --name=dn1 --hostname=dn1 --network=hnet --ip=172.20.1.1 --add-host=nn:172.20.1.0 --add-host=dn2:172.20.1.2 --privileged cluster_proto /usr/sbin/init docker run -d --name=dn2 --hostname=dn2 --network=hnet --ip=172.20.1.2 --add-host=nn:172.20.1.0 --add-host=dn1

❮ Es6 String Html5 Canvas Zrender ❯