How to use the search engine

Source: Internet
Author: User
Tags create url file md5 digest xsl web database

As a new open-source Web search engine, nutch provides a new choice besides commercial search engines. Individuals and enterprises can use nutch to build a search engine platform suitable for their own needs, providing a search service suitable for their own, without completely passive acceptance of various constraints of commercial search engines.

Nutch is based on Lucene. Lucene provides text indexing and search APIs for nutch. If you do not need to capture data, use Lucene. Common application scenarios: If you have a data source, you need to provide a search page for the data. In this case, the best way is to retrieve data directly from the database and use Lucene API to create an index. It is applicable to scenarios where you cannot directly obtain data websites in the database, or where data sources are scattered.

The work flow of the nutch can be divided into two major parts: the capture part and the search part. CaptureProgramCapture the page and reverse index the retrieved data. The search program then searches for the reverse index to answer the user's request. The index is the link between the two. Is a description of the entire workflow of the nutch.

 

First, create an empty URL database and add the starting root URLs to the URL database (step 1). Generate a fetchlist Based on the URL database in the newly created segment, stores the URLs to be crawled (step 2). crawls and downloads related webpage content from the Internet based on the fetchlist (Step 3 ), then, parse the captured content into text and data (step 4), extract the new webpage URL, and update the UR database (step 5 ), repeat steps 2-5 until the specified crawling depth is reached. The above process constitutes the entire crawling process of the nuttch. You can use a loop to describe it: Generate, capture, update, and loop.

After the crawling process is completed, reverse indexing is performed on the captured webpage, duplicate content and URLs are removed, and multiple indexes are merged to create a unified index database for the search, then, you can submit a search request through the nutch user interface provided by the Tomcat container. Then, Lucene queries the index database and returns the search results to the user to complete the search process.

1. Crawling enterprise intranet (http://www.my400800.cn)

The crawling enterprise intranet (intranet crawling) method is suitable for scenarios where a small number of web servers and the number of web pages is within one million. It uses the crawl command to crawl the network. Before crawling, You need to perform a series of configuration for the nutch. The process is as follows:

1. Data Capturing: create a directory and create a file containing the starting root URLs. We take crawling Sohu website (http://www.sohu.com) as an example to describe.

Create URL file list

Create a URL folder, open the URL folder, and create the URL. txt file in the file: http://www.sohu.com /.

Depending on the actual situation of the crawling website, you can add other URLs to the end of the file or add other files containing URLs to the URL directory.

Modify the conf/crawl-urlfilter.txt File

The file CONF/crawl-urlfilter.txt is mainly used to limit the crawling URL format, where the URL format is described using a regular expression. Replace my. domain. name with the domain name to be crawled, and remove the preceding annotations. Therefore, the domain name replacement form in this article is:

+ ^ Http: // ([a-z0-9] * \.) * sohu.com/

This configuration file can also set more information, such as the following, to set those files not to be captured

# Skip image and other suffixes we can't yet parse -\. (GIF | JPG | PNG | ICO | CSS | sit | EPS | WMF | zip | PPT | MPG | XLS | GZ | RPM | tgz | mov | mov | exe | JPEG | BMP | RAR | JS) $
Modify file CONF/nutch-site.xml
// This place must be modified before each capture. Otherwise, the file cannot be captured. <? XML version = "1.0"?> <? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?> <Configuration> <property> <Name> HTTP. agent. name </Name> <value> sohu.com </value> <description> sohu.com </description> </property> <Name> HTTP. agent. description </Name> <value> </value> <description> </property> <Name> HTTP. agent. URL </Name> <value> </value> <description> </property> <Name> HTTP. agent. email </Name> <value> </value> <description> </property> </configuration>
Start crawling

After the configuration of the nutch is complete, run the crawler command to crawl.

# Bin/nutch crawl URLs-Dir crawl-depth 5-threads 4-topn 1000

The meanings of parameters in the command line are as follows:

Dir specifies the directory where crawling results are stored. Here DIR is crawl;
Depth specifies the depth to be crawled from the root URL. In this example, depth is set to 5;
N sets the Top N URLs for each layer crawling. In this example, the N value is set to 1000.

In addition, crawl has another parameter: threads, which sets the number of concurrent crawling processes. In the crawling process, you can view the crawling progress through the nutch log file. After crawling is complete, the results are stored in the logs directory. You can add

> & Amp; logs/Crawl. Log

For example:

Bin/nutch crawl URLs-Dir crawl-depth 2-threads 4-topn 1000> & logs/Crawl. Log

Directory generated after execution:

    • Crawdb. linkdb is the web link directory that stores the interconnection relationship between URLs and URLs. It serves as the basis for crawling and re-crawling. The page expires in 30 days by default.
    • Segments is the main directory that stores the captured webpage. The page content is in the format of bytes [] raw content and parsed text. Nutch crawls Based on the breadth-first principle. Therefore, each round of crawling generates a Segment directory.
    • Index is the Lucene index directory, which is the complete index after all indexes in indexs are merged. Note that the index file only indexes the page content and is not stored, therefore, you must access the segments directory to obtain the page content.
Run bin/nutch readdb crawl/crawldb/-Stats to view the crawling information.
Run the bin/nutch org. Apache. nutch. searcher. nutchbean string to perform a simple test.

2. Project deployment: copy the nutch-1.2 under the nutch-1.2.war folder to open Tomcat under the webapps file of Tomcat, The nutch-1.2.war will be automatically decompressed into the nutch-1.2 folder of the same name.

Configure nutch-1.2/WEB-INF/classes/nutch-site.xml

The modification is as follows:

<Configuration> <property> <Name> searcher. dir </Name> <value> E: \ cygwin \ nutch-1.2 \ crawl </value> </property> </configuration>

Note: e: \ cygwin \ nutch-1.2 \ crawl this path is the path where you previously captured data.

Chinese garbled characters

Configure server. XML in the conf folder of Tomcat.

Modify as follows:

<Connector Port = "8080" protocol = "HTTP/1.1" connectiontimeout = "20000" redirectport = "8443" uriencoding = "UTF-8" usebodyencodingforuri = "true"/>
Modify page elements to prevent Chinese garbled characters

In addition, you can create header.jspin under webapps \ nutch \ ZH \ include to copy header.html, but add:

<% @ Page contenttype = "text/html; charset = UTF-8" pageencoding = "UTF-8" %>

In webapps \ nutch \ search. jsp, locate and modify it:

<% String pathl = language + "/include/header. JSP "; system. out. println (pathl); %> <JSP: Include page = "<% = pathl %>"/>

3. Capture directory Analysis

A total of five folders are generated:

Crawldb: stores the download URL and the download date. It is used to update the check time on the page.

Linkdb: stores the URL interconnection relationships, which are obtained after the download is complete.

Segments: stores captured pages. The number of subdirectories below is related to the number of retrieved page layers. Generally, each page layer stores a sub-directory independently. The sub-directory name is time, which is convenient for management. if a page is not crawled, a directory is generated, such as 20101222185215 (abbreviated by time ),. each sub-directory contains six sub-folders as follows:

    • Content: content of each download page.
    • Crawl_fetch: the status of each download URL.
    • Crawl_generate: Set of URLs to be downloaded.
    • Crawl_parse: contains the external link library used to update crawldb.
    • Parse_data: contains the external links and metadata parsed by each URL.
    • Parse_text: contains the text content of each parsed URL.

Indexs: stores the independent index directory for each download.

Index: it is the index directory of Lucene and the complete index after all indexes in the indexes directory are merged. Note that the index file only indexes the page content and is not stored, therefore, you must access the segments directory to obtain the page content.

2. crawling the entire Internet

Crawling the entire Internet (whole-web crawling) is a large-scale network crawling, which is more controllable than the first crawling method, using low-level commands such as inject, generate, fetch, and updatedb, the crawling volume is large and may take several machines for weeks to complete.

1. Glossary:

Web Database: pages known to nutch and the links in these pages (the injector adds pages to the pages through dmoz, dmoz (the Open Directory Project/ODP) is a set of manually edited and managed directories that provide results or data for search engines .) The content stored in webdb includes URL, MD5 Digest of content, outlinks, number of links of page, and capture information. You can decide whether to re-crawl the content. The score of page determines the importance of the page)

Segment Set: a set of pages that are crawled and indexed as the same unit. It includes the following types:

A set of names of these pages in fetchlist.
Fetcher output: a collection of these page files
Index: Lucene format index output

2. Capture Data and create Web databases and segments

First, you need to download an object that contains massive URLs. After the download is complete, copy it to the main directory of the nutch and decompress the file.

Download and unzip the package in Linux:

Content. RDF. u8 contains about 3 million URLs. Here, only one of the 50 thousand URLs are randomly extracted for crawling. Like the first method, you must first create a file containing the starting root URL and its parent directory.

# Mkdir URLs # bin/nutch org. Apache. nutch. Tools. dsf-parser content. RDF. u8-subset 50000> URLs/urllist

Add these URLs to the crawldb using the inject command of nutch. Here, the directory crawl is the root directory for crawling data storage.

# Bin/nutch inject crawl/crawldb URLs

Then, edit the file CONF/nutch-site.xml, the content and method is similar to "crawling enterprise intranet", skipped here. Then, start crawling and crawling. You can write the whole crawling command as a shell script. You only need to execute this script each time you crawl, that is, the process of generating → capturing → updating. You can run this script repeatedly as needed for crawling and crawling. The script example and corresponding descriptions are as follows:

Finally, perform the index. After crawling, You need to index the retrieved content for search and query. The process is as follows:

# Create an index # bin/nutch invertlinks crawl/linkdb crawl/segments/*/put all links upside down # bin/nutch index crawl/indexes crawl/crawler LDB crawl/linkdb crawl/segments /*

3. Data deployment and query: After an index is created, you can deploy and query the index in a single network. The process is not described here.

 

 

Instructions on manual commands captured by nutch

Recently, I have been studying nutch and found information about crawling the entire network using underlying commands.

First obtain the URL set, use the content.example.txt file under the http://rdf.dmoz.org/rdf/ directory for testing, create a folder dmoz

Command: Bin/nutch org. Apache. nutch. Tools. d1_parser content.example.txt> dmoz/URLs

Inject the website to the crawldb database:

Command: Bin/nutch inject crawl/crawldb dmoz

Create a capture list:

Command: Bin/nutch generate crawl/crawldb crawl/segments

Save the files under segments to variable S1 for future calls:

Command: S1 = 'LS-D crawl/segments/2 * | tail-1'

Command: Echo $ S1

Note 'is not a single quotation mark, but the upper left corner and ~ The one-key location

Run fetcher to obtain the URL Information:

Command: Bin/nutch fetch $ S1

Update the database and save the obtained page information to the database:

Command: Bin/nutch updatedb crawl/crawldb $ S1

The first capture ends.

Next, select the URL with the top 10 points for the second and third captures:

Command: Bin/nutch generate crawl/crawldb crawl/segments-topn 10

Command: S2 = 'LS-D crawl/segments/2 * | tail-1'

Command: Echo $ S2

Command: Bin/nutch fetch $ S2

Command: Bin/nutch updatedb crawl/crawldb $ S2

Command: Bin/nutch generate crawl/crawldb crawl/segments-topn 10

Command: S3 = 'LS-D crawl/segments/2 * | tail-1'

Command: Echo $ S3

Command: Bin/nutch fetch $ S3

Command: Bin/nutch updatedb crawl/crawldb $ S3

Update the linkdb Database Based on segments:

Command: Bin/nutch invertlinks crawl/linkdb crawl/segments /*

Index creation:

Command: Bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments /*

You can use this command to query:

Command: Bin/nutch org. Apache. nutch. searcher. nutchbean FAQ the FAQ here indicates the keyword to be searched.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.