I. Background recently, due to the needs of the project and the paper, a vertical search environment needs to be set up, consulted a lot of information, decided to use Apache's solution hadoop+hbase+nutch+es. The role of these artifacts is not much to introduce, self-reference all kinds of encyclopedia on the line. I chose this scenario mainly based on the following considerations: 1, extensible, although
I. BACKGROUND
Recently, due to the need for projects and papers, a vertical search environment was needed, and a lot of information was consulted to determine the use of Apache's solution hadoop+hbase+nutch+es. The role of these artifacts is not much to introduce, self-reference all kinds of encyclopedia on the line. My choice of such a scheme is mainly based on the following considerations:
1, can be extended, although only experimental environment, but in the project is to be applied to the production, with the increase in data volume, the need for hardware equipment can be easily joined in, so chose the distributed scheme of the most popular hadoop+hbase combination
2, the data source is compatible, NUTCH2 later integrates the Gora and the Tika, can carry on the data ORM and the analysis conveniently
3, with the Times, ES is very hot, and various evaluation said ES than SOLR faster and more stable, although not tested, but follow the GitHub Daniel's footsteps are not too wrong
Second, preface
This part of the Pure spit groove, a lot of domestic technical blog is not stuck in the nutch1x stage, is a variety of irresponsible plagiarism reprint, a few pioneers said that these things version must be one-to-one compatibility, and then everyone to do so, not a bit of exploration and questioning spirit. Today, I will be the first to eat crabs, who said gora0.3 only with hbase0.92, who said NUTCH2 can only be equipped with es0.19, since the open-source Daniel released the latest version of stability, there must be compatible with the truth!
III. installation and configuration process (pseudo-distributed)
The entire experimental environment is built with pseudo-distributed mode, that is, there is only one master distributed environment, and later expansion only need to continue to configure Slaver. System for Ubuntu Server 12.04
hadoop1.2.1
The premise of Hadoop installation is Java and SSH password-free login configuration, this is not much to say, basic, jdk1.6 and 1.7 can be.
1, the official website of the stable directory inside download hadoop1.2.1 deb package
2. Install the Deb package for Hadoop
sudo dpkg-i/home/hadoop/hadoop_1.2.1-1_x86_64.deb
3. View the location of the installation
Whereis Hadoop
Output:
Hadoop:/usr/bin/hadoop/etc/hadoop/usr/etc/hadoop/usr/bin/x11/hadoop/usr/include/hadoop/usr/share/hadoop
The/etc/hadoop directory is a variety of Hadoop configuration files, and/usr/share/hadoop is the main jar and monitoring page for Hadoop.
4, the following start to modify the configuration file:
Hdfs-site.xml
<?xml version= "1.0"? ><?xml-stylesheet type= "text/xsl" href= "http://xieminis.me/configuration.xsl"?> <configuration><!--File System Properties--><property><name>dfs.name.dir</name> <value>/home/hadoop/name</value><!--value to fill in the directory accessible by the account running Hadoop--></property>< property><name>dfs.data.dir</name><value>/home/hadoop/data</value><!-- Value fills in the directory that the account running Hadoop has access to--></property><property><name>dfs.replication</name>< Value>1</value><description>default block replication. The actual number of replications can specified when the file is created. The default is used if replication isn't specified in Create Time.</description></property></configuratio N>
Mapred-site.xml?
<?xml version= "1.0"? ><?xml-stylesheet type= "text/xsl" href= "http://xieminis.me/configuration.xsl"?> <!--Put Site-specific property overrides in the this file. --><configuration><property><name>mapred.job.tracker</name><value>localhost :54311</value><!--If it is a fully distributed mode, localhost should be replaced with the master's intranet IP address, port arbitrary, attention to avoid port conflicts--><description>the host and port that the MapReduce job tracker Runsat. If "Local", then jobs is run in-process as a single mapand reduce Task.</description></property></configur Ation>
Core-site.xml
<?xml version= "1.0"? ><?xml-stylesheet type= "text/xsl" href= "http://xieminis.me/configuration.xsl"?> <!--Core-site.xml--><configuration><property><name>hadoop.tmp.dir</name>< Value>/home/hadoop/tmp</value><description>a base for other temporary directories.</description ></property><property><name>fs.default.name</name><value>hdfs://localhost : 54310</value><description>the name of the default file system. A URI whosescheme and authority determine the FileSystem implementation. Theuri ' s scheme determines the Config property (fs. Scheme.impl) Namingthe FileSystem implementation class. The URI ' s authority is used todetermine the host, port, etc. for a filesystem.</description></property></co Nfiguration>
hadoop-env.sh
Modifying the Java Path
Export Java_home=/usr/lib/jvm/java-7-oracle
Modify the PID path for the Hadoop running account can access the path, the default is/var/run/hadoop, if not the sudo group is not authorized, and each restart will clear the directory, with Chown can not play a long-term role.
Export Hadoop_pid_dir=/home/hadoop/run/hadoopexport Hadoop_secure_dn_pid_dir=/home/hadoop/run/hadoop
Masters and Slaves
If it is pseudo-distributed, direct localhost, if it is fully distributed, Masters fill in the master IP address, slaves fill in the Slave IP address, enter the separate
This basic configuration is complete, if you want to learn more about the configuration can refer to this article "Hadoop three configuration file parameter meaning description"
5. Start Hadoop
Format Namenode First
Hadoop?namenode?-format
And then start
start-all.sh
=============================================
hbase0.94.11
1. Download the hbase0.94.11 tar package and unzip it under the official stable directory.
TAR-ZXVF hbase-0.94.11.tar.gz
2, go to conf directory modification Hbase-site.xml
<configuration><property><name>hbase.rootdir</name><value>hdfs://localhost:54310 /hbase</value><!--port number and IP address are fs.default.name consistent with the Hadoop configuration parameters--></property><property>< Name>hbase.cluster.distributed</name><value>true</value></property><property> <name>hbase.zookeeper.quorum</name><value>localhost</value></property></ Configuration>
3. Modify the hbase-env.sh file
Add the following three lines:
Export Java_home=/usr/lib/jvm/java-7-oracle/export Hbase_classpath=/etc/hadoopexport HBASE_MANAGES_ZK=true
At this point, the configuration file modification is finished (if the full distribution also modifies regionservers), more configuration parameters and tuning can refer to this article "HBase Introduction 3-hbase configuration file parameter settings and optimization"
4. Replace the Hadoop jar file
hbase0.94.11 default support is hadoop1.0.4, we can replace the Hadoop-core way to let IT support hadoop1.2.1
Rm?/home/hadoop/hbase-0.94.11/lib/hadoop-core-1.0.4.jarcp?/usr/share/hadoop/hadoop-core-1.2.1.jar?/home/hadoop /hbase-0.94.11/libcp/usr/share/hadoop/lib/commons-collections-3.2.1.jar?/home/hadoop/hbase-0.94.11/libcp/usr/ Share/hadoop/lib/commons-configuration-1.6.jar?/home/hadoop/hbase-0.94.11/lib
5. Start HBase
/home/hadoop/hbase-0.94.11/bin/start-hbase.sh
6, use the JPS command to see if the operation is normal
The output is:
2032 NameNode13764 HQuorumPeer29069 Jps2630 JobTracker2280 DataNode13889 HMaster2535 SecondaryNameNode2904 TaskTracker14180 Hregionserver
Note that these few things one can not be less, if there is less, be sure to go to the log to see what is going on.
7. Try to run the HBase command
/home/hadoop/hbase-0.94.11/bin/hbase?shellhbase? Shell;? Enter? ' Help<return> '? For?list?of?supported?commands. Type? " Exit<return> "? to?leave?the? HBase? shellversion?0.90.4,?r1150278, huh? Sun? Jul?24?15:53:29? Pdt?2011hbase (Main): 001:0>?listtable?????????????????????????????????????????? Webpage????????????????????????????????????????? 1?row (s)? In?0.5270?seconds
If the error is not indicated, the configuration is successful
?==================================================================
ElasticSearch0.90.5
Here is not like the general blog installed Nutch first, but first installed ES, why, because logically speaking, Nutch is a crawler plus integrator, ES is Nutch integration, so first install ES, which is called from 0 to the whole.
1. Download es0.90.5 's Deb installation package and install it
sudo dpkg-i/home/hadoop/elasticsearch/elasticsearch-0.90.5.deb
2. See what's Installed
Whereis?elasticsearch
Output:
Elasticsearch:/etc/elasticsearch/usr/share/elasticsearch
Where the/etc/elasticsearch directory inside the elasticsearch.yml file is a more important configuration file, here we use the default configuration, do not modify, require special configuration of students can refer to this article " The distributed search Elasticsearch configuration file is detailed.
And the/usr/share/elasticsearch inside is ES main executable file and jar package
3. Check ES running status
ES installed after the default is turned on, seemingly close can only kill the process, the start of the word directly input command elasticsearch can be.
Use Curl to check the running status of ES cluster and get clustername
Curl-xget ' Localhost:9200/_cluster/health?pretty '
If you get the following output, you are successful.
{"Cluster_Name": "Elasticsearch", "status": "Green", "timed_out": false, "Number_of_nodes": 2, "Number_of_data_nodes": 2, "Active_primary_shards": 5, "Active_shards": Ten, "Relocating_shards": 0, "Initializing_shards": 0, "unassigned_ Shards ": 0}
======================================================================
nutch2.2.1
1. Download the tar package and unzip the website
2. Modify the source code
Here to vomit trough Nutch open-source Daniel, so obvious bug you issued a version, and a few versions are not changed, if you have your reason, you should document the description of well, for Mao I can not find your official statement?
Enter the Src/java/org/apache/nutch/crawl directory, modify the public map<string,object> run (map<string) in the Generatorjob.java, Object> args) method.
Add the following three lines
? generate?batchid?int?randomseed?=? Math.Abs (new?) Random (). Nextint ());?? String?batchid?=? (curtime?/?1000)? +? " -"+?randomseed;?? Getconf (). Set (Batch_id,?batchid);?
If not, nutch generate will be reported nullpointerexception, really do not know their purpose? 3, copy hbase configuration file to Nutch??
Cp/home/hadoop/hbase-0.94.11/conf/hbase-site.xml/home/hadoop/nutch2.2.1/conf/?
4. Copy
hbase0.92
? JAR package to Nutch's lib directory This step is the key, Nutch gora0.3 is only supported to the highest hbase0.92, the default is hbase0.90, if you do not do this step, Nutch will use the default 0.90jar package to operate 0.94 hbase, resulting in a " Java.lang.IllegalArgumentException:Not a Host:port pair "a wonderful mistake (said to be a common error in the low version of the client operating a high-version server). But you can't just replace it with a 0.94 jar package, because it can lead to another wonderful mistake. "Java.lang.nosuchmethoderror:o Rg.apache.hadoop.hbase.HColumnDescriptor.setMaxVersions (I) V ", it is said that this error has been recorded in HBase official Jira,bug number: HBASE-8273. This means that the return value of the Setmaxversions function is changed. Do you know, these people do not have a point of object-oriented cooperative programming common sense Ah, you can not re-write a function ah ... So spit groove to spit groove, how to solve it, since everyone said 0.92 of the support is good, then I use 0.92 jar package to do the replacement test, away from 0.94 is a version, should not be too low version, perhaps can operate 0.94 of the library, this test is really become. The specific method is the official website next hbase0.92.2 version, the inside of the Hbase-0.92.2.jar file copy to/home/hadoop/nutch2.2.1/lib directory can? 5, modify the Nutch-site.xml?
<property><name>storage.data.store.class</name><value> org.apache.gora.hbase.store.hbasestore</value><description>default?class?for?storing?data</ description></property><property><name>http.agent.name</name><value>mozilla/ 5.0 (Macintosh; Intel Mac OS X 10_8_4) applewebkit/537.36 (khtml, like Gecko) chrome/28.0.1500.95 safari/537.36</value><!--this casual Fill in, I filled out my Chrome browser UA--></property>
? For a detailed explanation of each parameter of this file can refer to this Excel file--"Nutch configuration"? 6, modify the Ivy/ivy.xml?? The first is a regular modification to find
<dependency?org= "Org.apache.gora"? name= "Gora-hbase"? rev= "0.3" conf= "*->default"?/>
This line, remove the note?
Witnessing MiraclesChanges, let nutch2.2.1 support es0.90.5 modification. Found it
<dependency org= "Org.elasticsearch" name= "Elasticsearch" rev= "0.19.4" conf= "*->default"/>
This line, replace the Rev value 0.19.4 with the? 0.90.5? This is the strength of Ivy's package dependency management, and the time to execute ant is the time to witness miracles. If you do not do this step, in the Nutch elasticindex (indexing), will be reported masternotdiscoveredexception?7, modify the Conf/gora.properties file add a line:?
Gora.datastore.default=org.apache.gora.hbase.store.hbasestore
8, the first time to perform ant compilation Ant will be slower, because Ivy to download the dependency package, we carefully observe the output on the screen, you can see the compilation to Elasticsearch dependencies, Successfully down the 0.90.5 jar package, while the download of the lucene4.4.0 execution, you will see the Nutch directory of the runtime directory, which deploy for the distributed crawl, local crawl? Now, all the installation configuration is complete, Enjoy it!??
Iv. simply experience the crawl and retrieve process? 1. Create a directory urls2, write a torrent file in the URLs directory, name it a URL, write a page rich URL like http://blog.tianya.cn3, put the directory on Hadoop's HDFs???
Hadoop fs-copyfromlocal Urls/home/hadoop/urls
4, execute Nutch inject, inject the crawl seed page to HBase???
Bin/nutch Inject/home/hadoop/urls
After execution, you can see the table "webpage" in HBase? 5. Perform the Nutch crawl process to execute the following commands separately?
Bin/nutch GENERATE-TOPN 10bin/nutch?fetch?-allbin/nutch?parse?-allbin/nutch?updatedb
After the execution, you can go to hbase inside Scan webpage table, should already have more than hundred rows of results? 6. Index The Elasticsearch?
Bin/nutch elasticindex <cluster name>-all?
If the profile of ES has not been modified, here Cluster name should be "Elasticsearch" by default? 7. Use Curl to query?
Curl-xget ' Http://localhost:9200/_search?content=tianya '
If you want to query in Chinese, you can add Chinese word breaker, refer to "Distributed Search Elasticsearch Chinese word segmentation integration"??
v. SummaryThis blog although spit more, but I still quite respect some serious blog, seriously in the forum to answer the question of Daniel, can be configured to install the success, also by some Daniel Blog and Daniel answered the inspiration, here to thank these selfless people. The following will be in the actual experiments and projects to test the rationality and robustness of my configuration, in the future blog, will be a lot of records in the use of problems and solutions.
Who says their version is incompatible--hadoop1.2.1+hbase0.94.11+nutch2.2.1+el