Distributed File System of the nutch Search Engine

Source: Internet
Author: User

1. Introduction

NDfS: stores a large number of stream-oriented files on a series of machines, including multi-host storage redundancy and load balancing.
Files are stored in blocks on NDfS discrete machines. A traditional input/output stream interface is provided for file read/write.
The details of block search and data transmission over the network are automatically completed by NDfS, which is transparent to users. In addition, NDfS can be well processed.
The sequence of machines used for storage, allowing you to easily add and delete a machine. When a machine is unavailable, NDfS automatically guarantees files
. As long as the online machine sequence can provide enough storage space, it is necessary to ensure the normal operation of the NDfS file system.
NDfS is built on a general disk and does not require raid controllers or other disk array solutions.

2. Syntax

1) the file can only be written once. After being written, it becomes read-only (but can be deleted)
2). The file is stream-oriented and can only be throttled at the end of the file, and can only read and write pointers can only increase progressively.
3) There is no storage access control for files.

Therefore, all access to NDfS is verified by the customer code. No API is provided for access by other programs. Therefore, nutch is an NDfS
Simulate a user.

3. System Design

NDfS contains two types of machines: namenodes and datanodes: namenodes maintain namespace, while datanodes stores
Data Block. NDfS contains a namdnode and any number of datanodes. Each datanodes is configured with a unique namenode.
Communication.
1) namenode: stores the layout of the entire namespace and file system. Is a key point and cannot be down. However
Not much, so it is not a load bottleneck.
Maintain a table on the disk: Filename-0-> blockid_a, blockid_ B... blockid_x, etc .;
Filename is a string, and bolockid is a unique identifier. Each filename has any blocks.
2) datanode: responsible for data storage. One block should have backups in multiple datanode; while one datanode has the most
Only one backup is included.
Maintain a table: blockid_x-> array of bytes ..

3) Cooperation: After datanode is started, it actively communicates with namenode to inform namenode of the Local block information. Namenode data
You can construct a tree to describe how to find blocks in NDfS. This tree is updated in real time. Datanode regularly sends messages
Namenode to prove its existence. If namenode does not receive this information, it will think that datanode has been down.

4) file read/write process: for example, clientto read foo.txt, there are the following processes.
A. The client contacts namenode over the network and submits filename: "foo.txt"
B. The client receives a response from the namenode, including: the file block that makes up "foo.txt" and the datanode sequence of each block.
C. The client reads each file block in sequence. For a file block, the client obtains the appropriate datanode from its datanode sequence,
Then, send the request to datanode, which transmits the data to the client.

4. system availability

The availability of NDfS depends on the redundancy of blocks, that is, how many datanode should be used to maintain the backup of the same block. If conditions are met
You can set three backups and two minimum backups (desired_replication and min_replication constants in FS. fsnamesystem ).
When a block is smaller than min_replication, namenode instructs datanode to create a new backup.

5.net. nutch. FS package Introduction
1) NDfS. Java: contains two main functions: namenode and datanode.
2) fsnamesystem. Java: maintains the namespace and includes namenode functions, such as finding blocks and available datanode sequences.
3) fsdirectory. Java: called by fsnamesystem to maintain the status of the namespace. Records all the statuses and changes of namenode. When
When namenode crashes, it can be restored based on this log.
4) fsdataset. Java: used for datanode and block sequence maintenance.
5) block. Java and datanodeinfo: used to maintain block information
6) fsresults. Java and fsparam. Java: used to transmit parameters on the network.
7) fsconstants. Java: contains constants for parameter adjustment.
8) ndfsclient. Java: used to read and write data
9) testclient. Java: contains a main function that provides commands for access to NDfS.

6. Simple Example
1) create a namenode:
Machine A: Java net. nutch. fs. NDfS $ namenode 9000 namedir
2) create a datanode:
Machine B: Java net. nutch. fs. NDfS $ datanode datadir1 machineb 8000 Machinea: 9000
Machine C: Java net. nutch. fs. NDfS $ datanode datadir2 machinec 8000 Machinea: 9000

After step 2 is run, an NDfS is obtained, including one namenode and two datanode. (It can be on the same machine.
Install NDfS in different directories)

3) client-side file access:
Create a file: Java net. nutch. fs. testclient Machinea: 9000 create foo.txt
Read File: Java net. nutch. fs. testclient Machinea: 9000 get foo.txt
Rename the file: Java net. nutch. fs. testclient Machinea: 9000 rename foo.txt bar.txt
Read the file again: Java net. nutch. fs. testclient Machinea: 9000 get bar.txt
Delete a file: Java net. nutch. fs. testclient Machinea: 9000 Delete bar.txt

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.