Apsaravideo for MySQL is deployed and related issues (Chinese garbled characters, etc.) are fixed.

Last Update:2018-12-04 Source: Internet

Author: User

Tags xsl

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction
Nutch is an open-source Web search engine that provides high-quality search services.
It is a good full-text search solution for some internal systems or small and medium websites.
Deployment of nutch
The latest version of nutch can be found on the official website of nutch.. After downloading, decompress the package and you can use it.Because I am using a Windows system, we will deploy it in windows.
Web Crawler settings
Nutch itself contains a crawler indexing the target site and a Web interface for searching. Before using the query interface, you need to set up a nutch crawler to crawl the target site.
Description of some configuration files:

Nutch \ conf \ nutch-default.xml
Set HTTP. Agent. NameIf HTTP. Agent. NameIf it is null, The crawler cannot be started normally. You can set any name you like, for example, Vik-robot.
Indexer. mergefactor/indexer. minmergedocs both values are changed to 500. The larger the values of these two parameters, the higher the performance, the more memory consumed. If the value is too large, memory overflow may occur. In actual usage, the maximum memory usage under the current parameter is 3xxm.
HTTP. Timeout indicates the maximum timeout wait time. If the accessed connection does not respond for a long time, it will be discarded.
DB. Max. outlinks. Per. Page this parameter indicates the maximum number of external connections supported by a single page.If it is your internal system, set a large number.
Nutch-default.xmlExplanation
Create the file nutch \ URLs and enter the starting URL of the crawler. For example:
Http://mysite.com/
Nutch \ conf \ crawl-urlfilter.txtThis file is used to set the filtering relationship of URLs to be indexed. Each row has one filter condition.-indicates that the URL is not included, and + indicates that the URL is included.
[? *! @ =]. This row indicates that all dynamic URLs are not crawled.Most systems now have a lot of dynamic URLs. This filtering condition may make you unable to catch any content.
Set URL filtering relationships for each system. The specific setting method varies with different application systems. Here we use the popular forum discuz as an example. In this URL Filter, only the part list and post content are captured.
#SkipFile :,FTP :,&Mailto:URLs
-^ (File | FTP | mailto ):
#SkipImageAndOtherSuffixesWeCan'tYetParse
-\. (GIF | JPG | PNG | ICO | CSS | sit | EPS | WMF | zip | PPT | MPG | XLS | GZ | RPM | tgz | mov | mov | exe | JPEG | BMP) $
#SkipURLsContainingCertainCharactersAsProbableQueries,Etc.
#-[? *! @ =]
#SkipURLsWithSlash-delimitedSegmentThatRepeats3 +Times,ToBreakLoops
-. * (/. + ?) /.*? \ 1 /.*? \ 1/
#AcceptHostsInMy. domain. Name
# + ^ Http: // ([a-z0-9] * \.) * My. domain. Name/
# Discuz
+ ^ Http://mysite.com/discuz/index.php$
+ ^ Http://mysite.com/discuz/forumdisplay.php \? FID = \ D + $
+ ^ Http://mysite.com/discuz/forumdisplay.php \? FID = \ D + & page = \ D + $
+ ^ Http://mysite.com/discuz/viewthread.php \? Tid = \ D + & extra = Page % 3d \ D + $
+ ^ Http://mysite.com/discuz/viewthread.php \? Tid = \ D + & extra = Page % 3d \ D + & page = \ D + $
#SkipEverythingElse
-.
Nutch \ conf \ regex-urlfilter.txtI don't know whether the configuration file works, but it is recommended to comment out ,-[? *! @ =].
Crawler execution
Because the script of nutch is Shell in Linux, cygwin must be used for execution in windows. For the specific use of cygwin..., go to the Internet and find other articles. Official cygwin website
RunShBin/nutchCrawlURLs-DirCrawl-Threads2-Depth100-Topn1000000> &Crawl. LogComplete website crawling and indexing.
ThreadsThe number of Web Page crawling threads. Theoretically, the more threads, the faster the speed. However, too many threads will bring a great burden to the server and affect normal use.
DepthPage capture depth.
TopnMaximum number of pages captured at each layer
Crawl. LogLog storage files
After execution, a crawl directory is generated under the root directory of the nutch, which stores the index file.
Note: Before executing the command, Delete the existing crawl folder. If the crawl file already exists, the command cannot be executed normally.
Windows scheduled task Creation
After the index is created, the system does not automatically update the index. You need to use Windows scheduled tasks to create scheduled tasks to regularly update the index.
Specific Practices to be continued
Web search interface deployment
Developed using Java, the Web interface of nutch needs to run in the corresponding Web Container. Tomcat 6 is used here.
Deploy to Tomcat

Copy the nutch-0.9.war to tomcat6 \ webapps and run tomcat6 \ bin \ Startup. BAT to start Tomcat.
Tomcat automatically decompress the war file. Modify the file tomcat6 \ webapps \ nutch \ WEB-INF \ Classes \ nutch-site.xml to set the location of the index file for the nutch.
<? XMLVersion = "1.0"?>
<? XML-stylesheetType = "text/XSL"Href = "configuration. XSL"?>
<Configuration>
<Property>
<Name> searcher. dir </Name>
<Value> D: \ appserv \ nutch \ crawl \ </value>
</Property>
</Configuration>
Restart Tomcat and test the log search function. If no exception occurs, the service runs normally.

Problem fixes

Some Chinese characters on the search page are garbled. This problem is mainly caused by JSP: include. The inclusion file nutch \ ZH \ include \ header.html is converted from UTF-8 to GBK to fix this issue.
Garbled characters appear in Chinese search. Modify tomcat configuration file atat6 \ conf \ Server. xml. Added uriencoding/usebodyencodingforuri.
<ConnectorPort = "8080"Protocol = "HTTP/1.1"
Connectiontimeout = "20000"
Redirectport = "8443"
Uriencoding = "UTF-8"
Usebodyencodingforuri = "true"/>
Fixed the issue of garbled web snapshots. Modify tomcat6 \ webapps \ nutch \ cached. jsp on the page and set content=NewString (bean. getcontent (details) to content=NewString (bean. getcontent (details), "UTF-8 ").
Apache integration. Modify the Apache configuration file \ conf \ httpd. conf and add the following Configuration:
LoadmoduleProxy_moduleModules/mod_proxy.so
LoadmoduleProxy_http_moduleModules/mod_proxy_http.so
<IfmoduleMod_proxy.c>
Proxypass/NutchHttp: // localhost: 8080/nutch
Proxypassreverse/NutchHttp: // localhost: 8080/nutch
</Ifmodule>
URL problems. After integration with Apache, the URL is incorrect. The prefix of the URL you see is the URL configured in proxypass. Currently, there is no better solution. You can only manually modify all problem JSP pages. Use the findstr command/SRequesturi*. Jsp finds all problem pages. In stringBase=Requesturi. substring (0,Requesturi. lastindexof ('/');=Base. Replace ("localhost: 8080 ","Mysite.com");, replace the wrong URL with the correct URL address.
Delete tomcat6 \ webapps \ nutch \ cached. jsp and disable the Web snapshot function. Because some pages do not have access permissions, the snapshot function is disabled.
Chinese problem Modification
By default, nutch supports Chinese search, but it only uses ampersand for Chinese word segmentation. For example, if "China" does not use double quotation marks, all webpages containing "China" and "country" will be returned. For ease of use, the system automatically adds double quotation marks to the search content.
Modify the Tomcat 6 \ webapps \ nutch \ search. jsp file. Added the function for formatting search characters and processing querystring.
<%!
PublicStaticStringFormat_query_str (stringS){
S=S. Replace (""","\" "). Replace (" ","\"");//Process Chinese Characters
If(S. indexof ("\"")>-1){//If it contains ", the processing will not continue.
ReturnS;
}
String []SS=S. Split ("");
StringRet_s="";
For(StringStr:SS){
If(Str. Trim (). Equals ("")){
Continue;
}
If(Str. indexof ("-")=0){
Str="-\""+Str. substring (1)+"\"";
}Else{
Str="\""+Str+"\"";
}
Ret_s+ =Str+"";
}
ReturnRet_s.trim ();
}
%>
Querystring=Format_query_str (querystring );
Search for help
The usage is similar to that of a common search engine. Multiple keywords are supported, and spaces are used to separate multiple keywords.
You can use ampersand to separate Chinese characters. Therefore, you are advised to add double quotation marks for Chinese search. For example, if "China" is searched without double quotation marks, all webpages containing "China" and "country" will be returned.
You can add or subtract numbers before a word to prevent it from appearing in the search results,For example,Search football-NFLWill find the discussion football,But there is no web page for "NFL.
English words are not case sensitive,Therefore, searchEquivalent to search.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More