3.0 Hadoop Concepts

❮ C Pointer Detail Android Tutorial Genymotion Install ❯

3.0 Hadoop Concepts

Category Hadoop Tutorial

This chapter focuses on the concepts and components within Hadoop, which is a theoretical section. If you are in a hurry, you can skip it. However, the author does not recommend skipping it as it is closely related to the subsequent chapters.

Hadoop Overall Design

The Hadoop framework is designed for processing large data sets on computer clusters, so it must be a software that can be deployed on multiple computers. Hosts with Hadoop software deployed communicate with each other through sockets (network).

Hadoop mainly includes two major components: HDFS and MapReduce. HDFS is responsible for distributed data storage, and MapReduce is responsible for mapping, reducing data processing, and summarizing processing results.

The most fundamental principle of the Hadoop framework is to use a large number of computers to perform calculations simultaneously to speed up the processing of large amounts of data. For example, a search engine company needs to filter and summarize popular words from trillions of un-reduced data, which requires organizing a large number of computers into a cluster to process this information. If traditional databases are used to process this information, it would take a long time and a large amount of processing space to process the data. This scale is difficult to achieve for any single computer, mainly because organizing a large number of hardware and integrating it into a computer at high speed is difficult, and even if it is successfully implemented, it will result in expensive maintenance costs.

Hadoop can run on up to thousands of inexpensive mass-produced computers and organize them into a computer cluster.

A Hadoop cluster can efficiently store data and distribute processing tasks, which has many benefits. First, it can reduce the construction and maintenance costs of computers. Second, if any computer has a hardware failure, it will not have a fatal impact on the entire computer system, because the cluster framework developed for the application layer must assume that computers will fail.

HDFS

Hadoop Distributed File System, abbreviated as HDFS.

HDFS is used to store files in the cluster, and its core idea is the GFS idea of Google, which can store very large files.

In server clusters, file storage is often required to be efficient and stable, and HDFS has achieved both advantages.

The efficiency of HDFS storage is achieved by the independent handling of requests by the computer cluster. Because when users (usually backend programs) make data storage requests, the response server is often processing other requests, which is the main reason for the slow service efficiency. However, if the response server directly assigns a data server to the user, and then the user interacts directly with the data server, the efficiency will be much faster.

The stability of data storage is often achieved by "storing multiple copies," and HDFS also uses this method. The storage unit of HDFS is the block, and a file may be divided into multiple blocks and stored in the physical storage device. Therefore, HDFS often stores the data blocks in n copies according to the requirements of the setter and stores them on different data nodes (servers that store data), and the data will not be lost if a data node fails.

HDFS Nodes

HDFS runs on many different computers, some of which are specifically used for storing data, and some are specifically used for directing other computers to store data. The "computer" mentioned here can be called a node in the cluster.

NameNode

The NameNode is a node used to direct other nodes to store. Any "file system" (File System, FS) needs to have the ability to map file paths to files, and the NameNode is the computer used to store this mapping information and provide mapping services. It plays the role of "administrator" in the entire HDFS system, so there is only one NameNode in an HDFS cluster.

DataNode

The DataNode is a node used to store data blocks. When a file is recognized by the NameNode and divided into blocks, it will be stored in the allocated DataNode. The DataNode has the function of storing and reading and writing data, and the stored data blocks are similar to the "sector" concept in the hard disk, which is the basic unit of HDFS storage.

Secondary NameNode

The Secondary NameNode, also known as the "secondary naming node," is the "secretary" of the NameNode. This description is very appropriate because it cannot replace the work of the NameNode, regardless of whether the NameNode can continue to work. It is mainly responsible for sharing the pressure of the NameNode, backing up the state of the NameNode, and performing some management work if the NameNode requires it. If the NameNode fails, it can also provide backup data to restore the NameNode. There can be multiple Secondary NameNodes.

MapReduce

The meaning of MapReduce is as simple as

❮ C Pointer Detail Android Tutorial Genymotion Install ❯