Nutch Hadoop Tutorial

Source: Internet
Author: User
Keywords Hadoop
How to install Nutch and Hadoop

Searching Web pages and mailing lists rarely seem to have articles about how to use Hadoop (formerly DNFs) Distributed File Systems (HDFS) and MapReduce to install Nutch. The purpose of this tutorial is to explain how to run Nutch on a multi-node Hadoop file system, including the ability to index (crawl) and search for multiple machines, step-by-step.

This document does not involve Nutch or Hadoop architecture. It just tells how to get the system up and running. However, at the end of the tutorial, if you want to learn more about the architecture of Nutch and Hadoop, I will refer you to the relevant resources.

Some of the things in this tutorial are presupposed:

First, I made some settings and used root-level access. This includes establishing the same user on multiple machines and creating a local file system outside the user's home directory. Installing Nutch and Hadoop does not necessarily have to be root access (although it is sometimes handy). If you do not have root access, you will need to install the same user on all the machines you are using, and you may need to use a local file system within your home directory.

Second, because Hadoop uses SSH to boot from the server, all machines will need to have an SSH server running (not just a client).

Third, this tutorial uses Whitebox Enterprise Linux 3 respin 2 (Whel). People who don't know Whitebox can think of it as a cloned version of Redhatenterprise Linux. You should be able to extend it to any Linux system, but I use the system as a whitebox.

Four, this tutorial uses Nutch 0.8 Dev Revision 385702 and may not be compatible with future versions of Nutch or Hadoop.

Five, we install Nutch on 6 different computers through this tutorial. If you are using a different number of machines, you should still be able to do so, but you should have at least two different machines to prove the distribution capability of HDFs and MapReduce.

In this guide, we establish nutch from the source. You can get Nutch and Hadoop per page build, and I'll give you those links later.

Seven, keep in mind that this is a tutorial from my personal installation of Nutch and Hadoop experience. If an error occurs, try searching and sending a mailing list to Nutch or Hadoop users. Furthermore, suggestions to help improve the tutorial are welcome.

Network Settings

First please allow me to arrange the settings for the computer that will be used in our installation. To install Nutch and Hadoop, we prepared 7 PCs from 750Mghz to 1.0 GHz. Each computer has at least 128Megs of RAM and at least 10G hard drives. One computer has two 750Mghz CPUs and the other has two 30Gigabyte hard drives. All these computers were purchased at $500.00 of the clear price. I'm telling you this, is to let you know that you don't have to have big hardware to start and run using Nutch and Hadoop. Our computer is named like this:

Devcluster01
Devcluster02
Devcluster03
Devcluster04
Devcluster05
Devcluster06

Our main node is Devcluster01. Run the Hadoop service by the master node, coordinate with from the node (on all other computers), on this machine, we perform the retrieval and deployment of our search site.

download Nutch and Hadoop

Both Nutch and Hadoop can be downloaded from the Apache web site. The necessary Hadoop is bundled with Nutch, so unless you plan to develop Hadoop, you only need to download nutch.

After downloading the source of the Nutch from the version library, we need to build the Nutch. There are every page builds of Nutch and Hadoop:

http://cvs.apache.org/dist/lucene/nutch/nightly/

http://cvs.apache.org/dist/lucene/hadoop/nightly/

I used eclipse to develop it, so I used the Eclipse plug-in to download the Nutch and Hadoop version library. Eclipse's Subversion plug-in can be downloaded by using the link below:

http://subclipse.tigris.org/update_1.0.x

If you are not using eclipse, you will need to get a version control client. As long as you have a version control client, you can browse the Nutch version Control Web page:

Http://lucene.apache.org/nutch/version_control.html

Alternatively, you can access the Nutch version library from the client side:

http://svn.apache.org/repos/asf/lucene/nutch/

Downloading code from the server to Myeclipce can also be downloaded to a standard file system. We use ant to build it, so it's easier if you have Java and Ant installed.

I'm not going to explain how to install Java or ant, if you're using these software you should know what to do and have a lot of tutorials about ant software construction. If you want a complete ant reference book, recommend using the Erik Hatcher "Java Development and Ant":

compiles nutch and Hadoop

Once you download the Nutch to the download directory, you should see the following folders and files:

+ Bin
+ conf
+ Docs
+ Lib
+ Site
+ src
Build.properties (Add this one)
Build.xml
CHANGES.txt
Default.properties
Index.html
LICENSE.txt
README.txt

A new build.properties file is added and inside it is a variable called Dist.dir, whose value is the location where the Nutch is to be built. So, if you're building on Linux, it will look like this:

Dist.dir=/path/to/build

This step is actually optional, Nutch will create a build directory in the default directory it extracts, but I prefer to build it into an external directory. You can name the build directory casually, but I recommend using a new, empty folder to build. Keep in mind that if the build folder does not exist you must build it yourself.

Invoke the Ant package task like this to build the Nutch:

Ant Package

This should build the Nutch into your build folder. When it completes, you are ready to start deploying and configuring Nutch.

build the deployment architecture

Once we deploy Nutch to all six machines, we will invoke the script start-all.sh to start the service on the primary node and the data node. This means that the script will start the Hadoop daemon on the master node, then ssh to all the from nodes and start the daemon from the node.

The start-all.sh script expects the Nutch to be precisely installed in exactly the same location as each machine. Hadoop is also expected to store data on exactly the same path for each machine.

The purpose of this is to build the following directory structure on each machine. The search directory is the Nutch installation directory. The filesystem is the root file of the Hadoop file system. The home directory is the home directory of the Nutch user. In our master node, we also installed a tomcat5.5 server for our search.

/nutch
/search
(Nutch installation goes here)
/filesystem
/local (used for local directory for searching)
/home
(nutch user ' s home directory)
/tomcat (only in one server for searching)

I'm not going to talk about how to install Tomcat, there's a lot of tutorials on how to do this. What I'm saying is that we removed all the Web application files from the WebApps directory and created a folder named Root under WebApps, and we unpacked Nutch's network application files (Nutch-0.8-dev.war) into this folder. This makes editing the configuration file in the Nutch network application file easier. Therefore, log on to the primary node and all the root directories from the node. Build Nutch users and different file systems with the following command:

Ssh-l Root Devcluster01

Mkdir/nutch
Mkdir/nutch/search
Mkdir/nutch/filesystem
Mkdir/nutch/local
Mkdir/nutch/home

Groupadd Users
useradd-d/nutch/home-g Users Nutch
Chown-r Nutch:users/nutch
passwd Nutch Nutchuserpassword

Similarly, if you do not have root-level access, you will still need to have the same users on every machine, as the start-all.sh script expects. There is no need to create a user named Nutch, though we are using it. You can also place the file system in the home directory of the public user. Basically, you don't have to root the user, but if it is, it helps.

The start-all.sh script that initiates the daemon on the master-slave node will need to be able to log in password-less mode via SSH. To do this, we will have to install the SSH key on each node. Since the master node will start the daemon on its own, we also need to be able to log on to this computer with fewer passwords.

You may see an old tutorial or material on the user list that says you will need to edit the SSH daemon to satisfy the permituserenvironment attribute, and you also need to install local environment variables to use SSH logins through an environment file. There is no need to do so now. We no longer need to edit the SSH daemon, and we can install environment variables within the hadoop-env.sh file. Open hadoop-env.sh file with VI:

Cd/nutch/search/conf
VI hadoop-env.sh

The following is a template for an environment variable that needs to be changed in a hadoop-env.sh file:
Export Hadoop_home=/nutch/search
Export java_home=/usr/java/jdk1.5.0_06
Export Hadoop_log_dir=${hadoop_home}/logs
Export Hadoop_slaves=${hadoop_home}/conf/slaves

There are other variables in this file that will affect the behavior of Hadoop. If you get an SSH error after you start running the script, try changing the hadoop_ssh_opts variable. It is also noted that after the initial replication, you can set up hadoop_master in your conf/hadoop-env.sh, which synchronizes the changes from the host computer to each machine. Here's a section that tells how to do this.

Next we build the key values for the main node and copy each from the node. This must be done under the Nutch user we created earlier. Do not log on as a Nutch user, start a new shell and log in as a Nutch user. If you log in in password-less mode, the installation will not work properly in the test, but it will work correctly when a new user logs on as a nutch user.

Cd/nutch/home

ssh-keygen-t RSA (use empty responses for each prompt)
Enter passphrase (empty for no passphrase):
Enter same Passphrase again:
Your identification has been saved In/nutch/home/.ssh/id_rsa.
Your public key has been saved in/nutch/home/.ssh/id_rsa.pub.
The key fingerprint is:
A6:5C:C3:EB:18:94:0B:06:A1:A6:29:58:FA:80:0A:BC Nutch@localhost

The primary node will replicate the public key to a file named Authorized_keys that you just built in the same directory:

Cd/nutch/home/.ssh
CP Id_rsa.pub Authorized_keys

You only need to run Ssh-keygen on the primary node. On each node, when the file system is created, you only need to replicate the key values by using the SCP.

Scp/nutch/home/.ssh/authorized_keys Nutch@devcluster02:/nutch/home/.ssh/authorized_keys

Nutch the user must enter a password for the first time. An SSH PROPMT will appear the first time you log on to each computer to ask you if you want to add calculations to a known host. The answer to PROPMT is OK. Once the key is replicated, you no longer need to enter a password when you log on as a nutch user. Log on from the node where you just copied the key value to test:

SSH Devcluster02
nutch@devcluster02$ (a command prompt should appear without requiring a password)
Hostname (should return to the name of the slave node, here Devcluster02)

Once we have created the SSH key value, we are ready to start deploying nutch at all nodes from the node.

deploy Nutch to a single machine

First, we will deploy Nutch to a single node, the master node, but run it in distributed mode. This means that the file system of Hadoop will be used, not the local file system. We'll open a single node to make sure everything is running and then move on to the other new from node. All of the following are performed as Nutch users. We will install Nutch on the main node, and then when we are ready, we will copy the entire installation to each from the node. First, use a command similar to the following to build the copy file from Nutch to the deployment directory:

Cp-r/path/to/build/*/nutch/search

Then make sure all the shell scripts are in UNIX format and are executable.

Dos2unix/nutch/search/bin/*.sh/nutch/search/bin/hadoop/nutch/search/bin/nutch
chmod 700/nutch/search/bin/*.sh/nutch/search/bin/hadoop/nutch/search/bin/nutch
dos2unix/nutch/search/config/*.sh
chmod 700/nutch/search/config/*.sh

When we first tried to install Nutch, we encountered bad interpreters and commands, but found no errors because the script was in DOS format on Linux, not executable. Please note that we are doing both the bin directory and the config directory. There is a file named hadoop-env.sh in the Config directory, which is called by other scripts.

There are a few scripts you need to know. There are nutch scripts, Hadoop scripts, start-all.sh scripts, and stop-all.sh scripts in the bin directory. Nutch scripts are used to do things like open nutch crawling. Hadoop scripts allow you to work with the Hadoop file system. The start-all.sh script starts all servers on the master-slave node. stop-all.sh Stop all servers.

If you want to see the Nutch option, use the following command:

Bin/nutch

Or, if you want to see the options for Hadoop, use:

Bin/hadoop

If you want to see the options for a component such as a distributed file system, use the symbol name as input, as follows:

Bin/hadoop DFS

There are a few files you need to know. In the Conf directory there are Nutch-default.xml,nutch-site.xml,hadoop-default.xml and hadoop-site.xml. Save all the default options for Nutch in the Nutch-default.xml file, hadoop-default.xml files with all the default options for Hadoop. To overwrite all of these options, copy the attributes to their respective *-site.xml files to change their values. Below I will give you an example of a hadoop-site.xml file and an example of a nutch-site.xml file.

There is also a file named slaves in the Config directory. That's where we put the names from the nodes. Since we run the data from the node and the master node on the same machine, we also need the local computer on the list from the node. The following is what appears when you start from a node file.

localhost

It starts in this way, so you shouldn't make any changes. Then we will add all the nodes to this file, one node per line. Here is an example of a hadoop-site.xml file.

<?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>

<!--put site-specific property overrides in this file. -->

<configuration>

<property>
<name>fs.default.name</name>
<value>devcluster01:9000</value>
<description>
The name of the default file system. Either the literal string
' Local ' or a host:port for NDFs.
</description>
</property>

<property>
<name>mapred.job.tracker</name>
<value>devcluster01:9001</value>
<description>
The host and port that's MapReduce job tracker runs at. If
' Local ', then jobs are run in-process as a single map and
Reduce task.
</description>
</property>

<property>
<name>mapred.map.tasks</name>
<value>2</value>
<description>
Define MAPRED.MAP tasks to be number of slave hosts
</description>
</property>

<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
<description>
Define Mapred.reduce tasks to be number of slave hosts
</description>
</property>

<property>
<name>dfs.name.dir</name>
<value>/nutch/filesystem/name</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>/nutch/filesystem/data</value>
</property>

<property>
<name>mapred.system.dir</name>
<value>/nutch/filesystem/mapreduce/system</value>
</property>

<property>
<name>mapred.local.dir</name>
<value>/nutch/filesystem/mapreduce/local</value>
</property>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

</configuration>

Nutch uses the Fs.default.name property to determine which file system to use. Since we are using a Hadoop file, we must indicate this to the Hadoop master node or the name node. In this case, the name node on our network is placed on the server devcluster01:9000.

The Hadoop package has two components. One of these is the Distributed file system. The second is the MapReduce function. Distributed file systems allow you to store and copy files on many commodity machines, and the MapReduce package allows you to easily perform parallel programming tasks.

The Distributed File system has name nodes and data nodes. When a client wants to use a file in the file system, it contacts the name node, which tells it which data node to contact to get the file. The name node is the coordinator and stores what blocks (not real files, but you can now think of them) on what computer, and what needs to be replicated to different data nodes. These data nodes are heavy tasks. Their work includes storing actual files, servicing their requirements, and so on. So if you're running a name node and a data node with it on the same computer, it still communicates through sockets as if the data node is on a different computer.

I'm not going to talk about how MapReduce works here, this is a topic given in another tutorial, when I get to know it better, I'll write a tutorial. But now it's just a mapreduce programming task that splits into map operations (a-> b,c,d) and reduce operations (list-> a). Once a problem has been decomposed into map and reduce operations, multiple map operations and multiple reduce operations can be distributed to run concurrently on different servers. So instead of handing a file to a file system node, we hand over a process to a node that will handle the process and return the result to the master node. The MapReduce coordination server is called the MapReduce job tracker. Each executing processing node has a daemon called the task Tracker running, which is connected to the MapReduce job tracker.

The file system and MapReduce nodes communicate with their main node through a continuous heartbeat (like a burst sound) every 5-10 seconds or so. If the heartbeat stops, the primary node assumes that the node is closed and does not use it in subsequent operations.

The Mapred.job.tracker property specifies the host MapReduce tracker, so I guess the name node and the MapReduce tracker are on different computers. However, I haven't done anything to verify my conjecture.

The Mapred.map.tasks and Mapred.reduce.tasks properties record the number of tasks to run in parallel. This should be a multiple of the number of computers you have. In our case, we have a computer since we started, so we will have 2 maps and 2 reduce tasks. After that, as we add more nodes, we will add these values.

The Dfs.name.dir property is the directory used by the name node to store tracking and coordination information for the data node.

The Dfs.data.dir property is the directory used by the data node to store the actual file system data blocks. Keep in mind that this is expected to be the same on every node.

The Mapred.system.dir property is the directory that the MapReduce tracker uses to store its data. This is only on the tracker, not on the MapReduce host.

The Mapred.local.dir property is the directory on the node where the MapReduce is used to store local data. I find that MapReduce uses a lot of local space to do its job (that is, in gigabytes). But that might just see how I did my server configuration. I also found that the intermediate files produced by MapReduce did not appear to be deleted when the task exited. It could also be my configuration problem. This property is also expected to be the same on every node.

The Dfs.replication property records the number of servers that a file should be copied to before it is used. Since we currently use only a single server, this number is 1. If you set this value higher than the number of data nodes you can use, you will begin to see a large number of (Zero targets found,forbidden1.size = 1) type errors in the log. As we add more nodes, we will increase the value.

Before you start the Hadoop server, make sure that you format the Distributed File system as the same name node:

Bin/hadoop Namenode-format

Now that we've configured our Hadoop and from the node file, it's time to start Hadoop on a single node and test that it works properly. To start all Hadoop servers (name nodes, data nodes, MapReduce trackers, job trackers) on the local computer as a nutch user, use the following command:

Cd/nutch/search
bin/start-all.sh

To stop all servers, you can use the following command:

bin/stop-all.sh

If everything is installed correctly, you should be able to see the output, Prompt for name node, data node, job Tracker, and task tracker service work already started. If you see this, that means we're ready to test the filesystem. You can also look at the log files under the/nutch/search/logs path to see the output from the different daemon services that we just started.

to test the filesystem, we're going to create a list of links that we'll use to crawl later. Run the following command:

Cd/nutch/search
mkdir URLs
VI Urls/urllist.txt

http://lucene.apache.org

You should now have a urls/urllist.txt file with one line pointing to Apache Lucene's web site. Now we're going to add the directory to the file system. Nutch Crawl will then use this file as a list of links to crawl. To add a linked directory to the file system, run the following command:

Cd/nutch/search
Bin/hadoop dfs-put URLs

You should see that the prompt directory has been added to the file system's output. By using the LS command, you can also verify that the directory has been added:

Cd/nutch/search
Bin/hadoop Dfs-ls

Interestingly, the Distributed file system is specific to the user. If you use a Nutch user to store a directory link, it is actually stored as a/user/nutch/urls. This means that the user who completes the crawl and stores it in the Distributed file system must be the user who opened the search, otherwise there will be no results returned. You can try this yourself by logging in as a different user and running the LS command as shown in the figure. It will not find the directory because it is looking for/user/username under a different directory instead of/user/nutch.

If everything works well, you can add other nodes and start crawling.

deploy Nutch to multiple machines

Once you have the single node up and running, we can copy the configuration to the other nodes and set up the boot script from the node that will be turned on. First, if you have a server running on a local node, use the Stop-all script to stop them.

Run the following command to copy the configuration to another machine. If you follow the configuration, things should go well:

Cd/nutch/search
Scp-r/nutch/search/* Nutch@computer:/nutch/search

Do these things for each computer you want to use as a node. Then edit from the file, adding each from the node name to this file, one for each line. You will also modify the Hadoop-site.xml file and change the number of map and reduce tasks to make it a multiple of the number of machines you own. For systems with 6 data nodes, I set the number of tasks to 32. Replication properties can also be changed. A good starting value is like 2 or 3. * like this. (See bottom for comments that you may have to clean up the new Data node file system). Once you do this you should be able to start all nodes.

just like before. We use the same command to start all nodes:

Cd/nutch/search
bin/start-all.sh

A command like ' bin/slaves.sh uptime ' is a good way to test the configuration correctly before invoking the start-all.sh script.

The first time you start all nodes, you may receive an SSH dialog box asking if you want to add a host to the Known_hosts file. You must enter yes for each and press ENTER. The output is a bit strange at first, but if the conversation keeps appearing please enter Yes and press ENTER. You should see that the output shows all server launches on the local machine and boot from the job tracker and Data node server on the node. Once this is done, we are ready to start our climb.

Perform Nutch crawl

Now that we have the start and run of the Distributed file system, we can start our Nutch crawl. In this tutorial we only crawl a single point. I'm more concerned with the ability to install Distributed file systems and mapreduce than to care if others are aware of nutch crawling.

To make sure we crawl just a single point, we want to edit the crawl Urlfilter file, set the filter to read only lucene.apache.org:

Cd/nutch/search
VI Conf/crawl-urlfilter.txt

Change "line" that reads: +^http://([a-z0-9]*\.) *my. domain.name/
To read: +^http://([a-z0-9]*\.) *apache.org/

We have added our link to the Distributed File system and have edited the Urlfilter, and now we have to start crawling. Start Nutch crawling using the following command:

Cd/nutch/search
Bin/nutch Crawl Urls-dir crawled-depth 3

We are using the Nutch crawl command. This link is the link directory we added to the Distributed file system. -dir Crawl is the output directory. This will also go to the Distributed file system. The depth is 3, which means it will only have 3 pages of links deep. There are other options that you can specify, see the command file for these options.

You should see the crawl start, see the output of the running jobs, and the map and reduce percentages. You can track jobs by pointing your browser to the primary name node:

http://devcluster01:50030

You can also start a new terminal to view the verbose output from the node by tracking log files. Crawling may take a while to complete. We are ready to do the search when it is finished.

Perform a search

To search the distributed file system for the index we just built, we need to do two things. First, we need to put the index into a local filesystem, and second we need to install and configure the Nutch network application files. While technically feasible, searching with Distributed file systems is unwise.

DFS is a record of mapreduce process results, including a full index, but its search time is too long. In a production system you will want to use the MapReduce system to index and store the results on DFS. You then copy the indexes to a local file system for easy searching. If the index is too large (with a 100 million-page index), you will want to split the index into multiple slices (1-2 millions of pages per slice), copy the slices from DFS to the local file system, and read multiple search servers from those local index slices. The subject of the full distributed search settings is specifically covered in another tutorial. But now realize that you do not want to use DFS for searching, and you want to use the local file system for searching.

Once the index has been created on DFS, you can use the Hadoop copytolocal command to move it to the local file system like this.

Bin/hadoop dfs-copytolocal crawled/d01/local/

Your crawl directory should have an indexed directory that contains the actual index files. After using Nutch and Hadoop, if you have an indexed directory of folders inside it, such as part-xxxxx, you can use the Nutch Merge command to merge a partial index into a single index. When a search site is pointed to local, it looks for a directory with an indexed folder that contains the merged index file or an index folder that contains a partial index. This could be a tricky part because your search site works, but if it doesn't find the index then all the searches will return empty.

If you have a Tomcat server installed as we mentioned earlier, you should have a tomcat installed under/nutch/tomcat, and you should have a folder named root in the WebApps directory. The root folder has uncompressed Nutch network application files. Now we just need to configure the application to use the Distributed File System search. We do this by editing the Hadoop-site.xml file in the Web-inf/classes directory. Please use the following command:

{{cd/nutch/tomcat/webapps/root/web-inf/classes VI nutch-site.xml}}}

The following is a template Nutch-site.xml file:

<?xml version= "1.0"?>
<?xml-stylesheet type= "text/xsl" href= "configuration.xsl"?>

<configuration>

<property>
<name>fs.default.name</name>
<value>local</value>
</property>

<property>
<name>searcher.dir</name>
<value>/d01/local/crawled</value>
</property>

</configuration>

The Fs.default.name property points to local to find the local index. With this in mind, we're not using DFS or MapReduce to do the search, all on the local machine.

Searcher.dir is the index and the directory where the resulting database is stored in the local file system. Before our crawl command, we used the crawl directory, which stores the results of crawling on DFS. We then copy the crawl folder to the/d01/local directory of our local file system. So we point this attribute to/d01/local/crawled. The path you point to should not include only index directories, but also database connections, fragmentation, and so on. All these different databases have been used for this search. That's why we copy the crawl directory, not just the index directory.

Once the Nutch-site.xml file is edited, the application is ready to run. You can start Tomcat with the following command:

Cd/nutch/tomcat
bin/startup.sh

Then use the browser to access http://devcluster01:8080 (your search server) to see Nutch Search network applications. If everything is properly configured, then you should be able to access the query and achieve results. If the site works but you do not get any results, it may be because the index directory is not found. The Searcher.dir property must be specified as the parent of the directory index. This parent must contain this index fragment section, database connection, crawl folder. The index folder must be named Index and contain a merged fragmented index, which means that the index file is in the index directory, not in the following directory, such as a directory named Part-xxxx, Alternatively, the index directory must be named indexes and contain a fragment index known as PART-XXXXX, which holds the index file. I prefer merging indexes to fragmented indexes.

Distributed Search

Although not the true subject of this tutorial, distributed search requires attention. In a production system, you will use DFS and MapReduce to create indexes and corresponding databases (that is, crawldb). However, you can use the local file system on a dedicated search server to search for them to ensure speed and avoid network overhead.

Here is a brief description of how you can set up a distributed search. In the Nutch-site.xml file Tomcat web-inf/classes directory, you can point the Searcher.dir property to a file that contains search-servers.txt files. Search-servers.txt files look like this.

Devcluster01 1234
Devcluster01 5678
Devcluster02 9101

Each row contains a machine name and port that represents the search server. This tells the Web site to connect to a search server on these ports.

On each search server, because we are looking for a local directory search, you need to make sure that the file system in the Nutch-site.xml file points to local. One of the problems that can be crossed is that I am using the same Nutch assignment that I used when I run the distributed Search server to run the DFS and Mr as one from the node. The problem is that when a distributed search server is started, it is looking for files to read in DFS. It couldn't find them, I only got a log message saying X server has 0 slices.

I found it easiest to create another nutch in a separate folder. Then start the distributed Search server from this single assignment. I just used the default configuration, Nutch-site.xml and Hadoop-site.xml files are not configured. The default file system is local and the distributed Search server can find the files it needs on the local machine.

No matter how you do this, if your index is on the local file system, the configuration needs to be indicated by using the local file system, as shown below. This is usually set in the Hadoop-site.xml file.

<property>
<name>fs.default.name</name>
<value>local</value>
<description>the name of the default file system. Either the
Literal string "Local" or a host:port for dfs.</description>
</property>

On each search server, you can start a distributed Search server by using the Nutch server command like this:

Bin/nutch Server 1234/d01/local/crawled

The port on which the server is started must match the directory you entered into the Search-servers.txt file and the parent of the local index folder. Once the distributed search begins on each machine, you can start the site. The search is then performed normally, with the exception that the search results may be withdrawn from the Distributed Search Server index. After you log on to the search site (usually a catalina.out file), you should see information that tells you about the number of servers and index fragments that are relevant to your site and are searching. This will let you know if the installation is correct. There is no command to close the distributed search service process and can only be closed manually. The good news is that the Web site keeps checking servers in its Search-servers.txt files to determine if they are working so that you can shut down a single distributed search server, change its index and back it up and then the site will automatically reconnect. The entire search is never closed at any time, and only certain parts of the index are closed.

In a production environment, search consumes the most machines and power. The reason is that once the index needs to exceed about 2 million pages, it spends too much time reading from disk, so you can have 100 million pages indexed on a single machine without having to manage the size of the hard disk. Fortunately, using distributed search, you can have multiple dedicated search servers, each with its own index slice to complete a parallel search. This allows very large indexing systems to efficiently search.

The 100 million page system will require approximately 50 dedicated search servers to provide more than 20 query services per second. One way to avoid having so many machines is to use multiprocessor machines, which have multiple disks that can run multiple search servers, each using a separate disk and index. By this route, you can reduce the cost of machinery by as much as 50, and the cost of electricity will fall by up to 75. A multiple-disk machine cannot handle as many queries per second as a dedicated one-disk drive, but it can handle an index page that is significantly larger, so it's more efficient on average.

Sync code to

from node

Nutch and Hadoop can synchronize the changes of the master node to the node. This is optional, however, because it slows down the startup of the server, and you may not want to automatically synchronize the changes to the node.

If you want to enable this feature, I will show you how to configure your server to complete synchronization from the master node. There are a few things you should know beforehand. One, even if from the node can be synchronized from the master node, the first time you also need to replicate the base installation to the node, when the script is available for synchronization. We did the above so there is no need to make any changes. Second, the synchronous occurs in the way that the master node ssh to the from node and invokes the bin/hadoop-daemon.sh. From the script on the node call rsync back to the master node. This means that you must have a login that can be logged into the password-less mode from each node to the main node. Before we install password-less login mode, we need to do it reverse. Third, if you have problems with the rsync option (I have a problem, I have to modify the option, because I am performing an old version of SSH), call the Rsync command at the bin/hadoop-daemon.sh script 82 line or so.

So the first thing we need to do is install the variables for the Hadoop master node in the conf/hadoop-env.sh file. Modify variables as follows:

Export Hadoop_master=devcluster01:/nutch/search

This will need to be replicated to all the from nodes:

Scp/nutch/search/conf/hadoop-env.sh nutch@devcluster02:/nutch/search/conf/hadoop-env.sh

Finally, you need to log in to each node, build a default SSH key value for each machine, and then copy it back to the master node, which you attach to the/nutch/home/.ssh/authorized_keys file on the master node. Here are the instructions from each node, and when you copy the key-value file back to the main node, be sure to change the name of the node, so that you do not overwrite the file:

Ssh-l Nutch Devcluster02
Cd/nutch/home/.ssh

ssh-keygen-t RSA (use empty responses for each prompt)
Enter passphrase (empty for no passphrase):
Enter same Passphrase again:
Your identification has been saved In/nutch/home/.ssh/id_rsa.
Your public key has been saved in/nutch/home/.ssh/id_rsa.pub.
The key fingerprint is:
A6:5C:C3:EB:18:94:0B:06:A1:A6:29:58:FA:80:0A:BC Nutch@localhost

SCP Id_rsa.pub nutch@devcluster01:/nutch/home/devcluster02.pub

Once you do this for each node, you can attach the file to the Authorized_keys file on the master node:

Cd/nutch/home
Cat Devcluster*.pub >> Ssh/authorized_keys

After this installation, whenever the bin/start-all.sh is run, the script files are synchronized from the master node to each from the node.

Conclusion

I know it's a long tutorial, but hopefully you'll be familiar with Nutch and Hadoop through it. Nutch and Hadoop are complex applications, and setting them up as you have learned is not necessarily an easy task. I hope that this document will help you.

If you have any comments or suggestions, please feel free to send me an email nutch-dev@dragonflymc.com. If you have questions about them, nutch or Hadoop has their own mailing address. Here are a variety of resources for the use and development of Nutch and Hadoop.

Update I no longer use rsync Sync code server. I now use expect scripts and Python scripts to manage and automate the system. The distributed search I use has 1-2 pages per index fragmentation. We now have multiple processors and multiple disks (4 disk per machine) servers to run multiple search servers (one server per disk) to reduce cost and power requirements. A server with 8 million web pages can process 10 queries per second.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.