Apsaravideo for MySQL is deployed and related issues (Chinese garbled characters, etc.) are fixed.

Source: Internet
Author: User
Tags xsl

Introduction
Nutch is an open-source Web search engine that provides high-quality search services.
It is a good full-text search solution for some internal systems or small and medium websites.
Deployment of nutch
The latest version of nutch can be found on the official website of nutch.. After downloading, decompress the package and you can use it.Because I am using a Windows system, we will deploy it in windows.
Web Crawler settings
Nutch itself contains a crawler indexing the target site and a Web interface for searching. Before using the query interface, you need to set up a nutch crawler to crawl the target site.
Description of some configuration files:

  • Nutch \ conf \ nutch-default.xml
  • Set HTTP. Agent. NameIf HTTP. Agent. NameIf it is null, The crawler cannot be started normally. You can set any name you like, for example, Vik-robot.
  • Indexer. mergefactor/indexer. minmergedocs both values are changed to 500. The larger the values of these two parameters, the higher the performance, the more memory consumed. If the value is too large, memory overflow may occur. In actual usage, the maximum memory usage under the current parameter is 3xxm.
  • HTTP. Timeout indicates the maximum timeout wait time. If the accessed connection does not respond for a long time, it will be discarded.
  • DB. Max. outlinks. Per. Page this parameter indicates the maximum number of external connections supported by a single page.If it is your internal system, set a large number.
  • Nutch-default.xmlExplanation
  • Create the file nutch \ URLs and enter the starting URL of the crawler. For example:
    Http://mysite.com/
  • Nutch \ conf \ crawl-urlfilter.txtThis file is used to set the filtering relationship of URLs to be indexed. Each row has one filter condition.-indicates that the URL is not included, and + indicates that the URL is included.
  • [? *! @ =]. This row indicates that all dynamic URLs are not crawled.Most systems now have a lot of dynamic URLs. This filtering condition may make you unable to catch any content.
  • Set URL filtering relationships for each system. The specific setting method varies with different application systems. Here we use the popular forum discuz as an example. In this URL Filter, only the part list and post content are captured.
    #SkipFile :,FTP :,&Mailto:URLs
    -^ (File | FTP | mailto ):
    #SkipImageAndOtherSuffixesWeCan'tYetParse
    -\. (GIF | JPG | PNG | ICO | CSS | sit | EPS | WMF | zip | PPT | MPG | XLS | GZ | RPM | tgz | mov | mov | exe | JPEG | BMP) $
    #SkipURLsContainingCertainCharactersAsProbableQueries,Etc.
    #-[? *! @ =]
    #SkipURLsWithSlash-delimitedSegmentThatRepeats3 +Times,ToBreakLoops
    -. * (/. + ?) /.*? \ 1 /.*? \ 1/
    #AcceptHostsInMy. domain. Name
    # + ^ Http: // ([a-z0-9] * \.) * My. domain. Name/
    # Discuz
    + ^ Http://mysite.com/discuz/index.php$
    + ^ Http://mysite.com/discuz/forumdisplay.php \? FID = \ D + $
    + ^ Http://mysite.com/discuz/forumdisplay.php \? FID = \ D + & page = \ D + $
    + ^ Http://mysite.com/discuz/viewthread.php \? Tid = \ D + & extra = Page % 3d \ D + $
    + ^ Http://mysite.com/discuz/viewthread.php \? Tid = \ D + & extra = Page % 3d \ D + & page = \ D + $
    #SkipEverythingElse
    -.
  • Nutch \ conf \ regex-urlfilter.txtI don't know whether the configuration file works, but it is recommended to comment out ,-[? *! @ =].
  • Crawler execution

  • Because the script of nutch is Shell in Linux, cygwin must be used for execution in windows. For the specific use of cygwin..., go to the Internet and find other articles. Official cygwin website
  • RunShBin/nutchCrawlURLs-DirCrawl-Threads2-Depth100-Topn1000000> &Crawl. LogComplete website crawling and indexing.
  • ThreadsThe number of Web Page crawling threads. Theoretically, the more threads, the faster the speed. However, too many threads will bring a great burden to the server and affect normal use.
  • DepthPage capture depth.
  • TopnMaximum number of pages captured at each layer
  • Crawl. LogLog storage files
  • After execution, a crawl directory is generated under the root directory of the nutch, which stores the index file.
  • Note: Before executing the command, Delete the existing crawl folder. If the crawl file already exists, the command cannot be executed normally.
  • Windows scheduled task Creation
    After the index is created, the system does not automatically update the index. You need to use Windows scheduled tasks to create scheduled tasks to regularly update the index.
  • Specific Practices to be continued
  • Web search interface deployment
    Developed using Java, the Web interface of nutch needs to run in the corresponding Web Container. Tomcat 6 is used here.
    Deploy to Tomcat

  1. Copy the nutch-0.9.war to tomcat6 \ webapps and run tomcat6 \ bin \ Startup. BAT to start Tomcat.
  2. Tomcat automatically decompress the war file. Modify the file tomcat6 \ webapps \ nutch \ WEB-INF \ Classes \ nutch-site.xml to set the location of the index file for the nutch.
    <? XMLVersion = "1.0"?>
    <? XML-stylesheetType = "text/XSL"Href = "configuration. XSL"?>
    <Configuration>
    <Property>
    <Name> searcher. dir </Name>
    <Value> D: \ appserv \ nutch \ crawl \ </value>
    </Property>
    </Configuration>
  3. Restart Tomcat and test the log search function. If no exception occurs, the service runs normally.

Problem fixes

  • Some Chinese characters on the search page are garbled. This problem is mainly caused by JSP: include. The inclusion file nutch \ ZH \ include \ header.html is converted from UTF-8 to GBK to fix this issue.
  • Garbled characters appear in Chinese search. Modify tomcat configuration file atat6 \ conf \ Server. xml. Added uriencoding/usebodyencodingforuri.
    <ConnectorPort = "8080"Protocol = "HTTP/1.1"
    Connectiontimeout = "20000"
    Redirectport = "8443"
    Uriencoding = "UTF-8"
    Usebodyencodingforuri = "true"/>
  • Fixed the issue of garbled web snapshots. Modify tomcat6 \ webapps \ nutch \ cached. jsp on the page and set content=NewString (bean. getcontent (details) to content=NewString (bean. getcontent (details), "UTF-8 ").
  • Apache integration. Modify the Apache configuration file \ conf \ httpd. conf and add the following Configuration:
    LoadmoduleProxy_moduleModules/mod_proxy.so
    LoadmoduleProxy_http_moduleModules/mod_proxy_http.so
    <IfmoduleMod_proxy.c>
    Proxypass/NutchHttp: // localhost: 8080/nutch
    Proxypassreverse/NutchHttp: // localhost: 8080/nutch
    </Ifmodule>
  • URL problems. After integration with Apache, the URL is incorrect. The prefix of the URL you see is the URL configured in proxypass. Currently, there is no better solution. You can only manually modify all problem JSP pages. Use the findstr command/SRequesturi*. Jsp finds all problem pages. In stringBase=Requesturi. substring (0,Requesturi. lastindexof ('/');=Base. Replace ("localhost: 8080 ","Mysite.com");, replace the wrong URL with the correct URL address.
  • Delete tomcat6 \ webapps \ nutch \ cached. jsp and disable the Web snapshot function. Because some pages do not have access permissions, the snapshot function is disabled.
  • Chinese problem Modification
    By default, nutch supports Chinese search, but it only uses ampersand for Chinese word segmentation. For example, if "China" does not use double quotation marks, all webpages containing "China" and "country" will be returned. For ease of use, the system automatically adds double quotation marks to the search content.
  • Modify the Tomcat 6 \ webapps \ nutch \ search. jsp file. Added the function for formatting search characters and processing querystring.
    <%!
    PublicStaticStringFormat_query_str (stringS){
    S=S. Replace (""","\" "). Replace (" ","\"");//Process Chinese Characters
    If(S. indexof ("\"")>-1){//If it contains ", the processing will not continue.
    ReturnS;
    }
    String []SS=S. Split ("");
    StringRet_s="";
    For(StringStr:SS){
    If(Str. Trim (). Equals ("")){
    Continue;
    }
    If(Str. indexof ("-")=0){
    Str="-\""+Str. substring (1)+"\"";
    }Else{
    Str="\""+Str+"\"";
    }
    Ret_s+ =Str+"";
    }
    ReturnRet_s.trim ();
    }
    %>
    Querystring=Format_query_str (querystring );
  • Search for help

  • The usage is similar to that of a common search engine. Multiple keywords are supported, and spaces are used to separate multiple keywords.
  • You can use ampersand to separate Chinese characters. Therefore, you are advised to add double quotation marks for Chinese search. For example, if "China" is searched without double quotation marks, all webpages containing "China" and "country" will be returned.
  • You can add or subtract numbers before a word to prevent it from appearing in the search results,For example,Search football-NFLWill find the discussion football,But there is no web page for "NFL.
  • English words are not case sensitive,Therefore, searchEquivalent to search.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.