[Study Notes of Peking University Skynet search engine Tse] section 1st-Environment Construction

Last Update:2018-12-03 Source: Internet

Author: User

Tags gz file

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I recently read the book "Search Engine-principles, technologies and systems" and downloaded the source code of the prototype system-Peking University Skynet search engine TSE. There was no foundation for search engines or web development experience before, and everything was learned from scratch. In order to record the learned knowledge for future use, we also wish to share with you and study it together, organize the learning process into notes, and publish it on this blog. I am not a master. I am a cainiao and have learned from scratch. If there are any mistakes in this article, please criticize and point out that you are welcome to study and discuss them together. (Qq: 44452114, Weibo: http://weibo.com/liubing2000 ).

1. Introduction

Download the source code of TSE at http://sewm.pku.edu.cn/book /. There are many types of files in it. First, you should have a rough understanding of these files.

First, readme.txt is an introduction to the software package. You can first look at the introduction of the file to the source code of the system. Actually, reading readme.txt is a good habit.

Second, tse.mp3and tse3are the recordings and lectures provided by Yi Hongfei, one of the authors of the original book, on the TSE system. You can first compare the lectures with the explanations provided by Mr. Yan and have a preliminary understanding of the system principles;

Third, tse_tutorial.pdf is written in Hou Rui's learning process, mainly about the environment building process. However, I don't think the introduction is detailed. New users (new users of the system) cannot have a clear understanding of the introduction, so the operation may not be successful. Of course, you can refer to it for reference;

Fourth, the "Search Engine-principles, technologies and systems" (hereinafter referred to as "Search") describes the workflow of search engines in three stages: for the browser pre-processing and query service, the index.xxxxxx-xx.linux.tar.gz file corresponds to the preprocessing and query service subsystem of the TSE system;

Fifth, tse.xxxxxx-xxxx.linux.tar.gz corresponds to the web page collection subsystem of the TSE system (also known as the web page capture program, also known as the crawler program ).

The response package contains the result data of the webpage collection and preprocessing module. After you decompress index.xxxxxx-xxxx.linux.tar.gz, you can see a data directory that stores the results of the webpage collection and preprocessing module, where Tianwang. raw.2559638448 is the original webpage data captured by the webpage collection module (stored in Skynet format), Sun. iidx is the inverted index keyword index file (that is, the ing of all the keywords in the original file to the web page containing the keyword), Doc. idx is a webpage index file (ing the webpage ID to the location stored in the original webpage data file ). If you cannot understand these data files, it does not matter. We will introduce them in detail in section 2.

It is easy to understand that the TSE query service subsystem can run independently, because the three stages of the search engine can be independent of each other, just like the commercial search engine, after the web page Collection Subsystem runs once, it collects a lot of web page data and can provide the query service after preprocessing. Of course, the query here is based on the collected web page data, instead of capturing webpages in real time on the network for each query service, it cannot be consistent with the real-time webpage data on the network. Re-capture the web page.

Since the query service sub-system can run independently, We can first run the query service sub-system, regardless of the pre-processing and web page collection processes, we can say that I will explain it in reverse mode, first, the query service, then preprocessing, and then web page collection. First, we will study the index.xxxxxx-xx.linux.tar.gz package.

2. Install apache2

After the query service sub-system runs, it shows the effect of 1, which is nothing different from commercial search engines.

Figure 1

Since a web application is displayed, you need to set up a Web server. The TSE system uses the apache2 server. I am afraid to comment on the content of web development, because I have never done Web development myself, I cannot mislead you here. I just want to share my operation process with you here. If there are any mistakes, I will criticize and correct them. For details about how to install and configure apache2 in Ubuntu, refer to the following two articles:

Http://www.blogjava.net/duanzhimin528/archive/2010/03/05/314564.html,

Http://blog.csdn.net/rookieding/article/details/7314054

My system is ubuntu10.10. The installation command is sudo apt-Get install apache2. If the installation fails, run: sudo apt-Get update to update the source and try again.

After the installation is successful, three directories are very important, which are described as follows:

(1)/etc/init. d/. The apache2 file in this directory is used to start, stop, and restart the apache2 server.

Start: sudo/etc/init. d/apache2 start

Stop: sudo/etc/init. d/apache2 stop

Restart: sudo
/Etc/init. d/apache2 restart

(2)/etc/apache2/. This directory mainly contains the configuration-related files of The apache2 server.

Apache2.conf is the main configuration file, and httpd. conf is the user configuration file. After installation, httpd. conf is empty, and there are no configuration items described in the online documentation.

The sites-available Directory is available site configuration, and the sites-enable directory is enabled site configuration. You can see that the sites-enable directory has only one link file, which is linked .. /sites-available/Default. Therefore, this default file is the default site configuration file. The main content is as follows:

<Virtualhost *: 80>
Serveradmin webmaster @ localhost

DocumentRoot/var/www/configure the address of the website's web document root directory, which will be modified later!

......

(3)/var/www/. The directory is described above.Location of the website's web document root directory

The default website webpage file will be placed here, but you can modify the above configuration file. The TSE system will modify the file, which will be introduced later!

After apache2 is successfully installed, it is not modified. Open the browser to access http: // localhost/. If it works is displayed! As shown in.

Figure 2

Why is such a webpage displayed here? Previously, we introduced that the root directory of the website's web document is located at/var/www/by default. Open this directory to find an index.html file, which is the file displayed on the webpage page, VIM
When index.html opens the file, you can see that the content in the file is indeed displayed on the webpage.

3. Create a server directory and prepare files

Start to build your own web server, prepare the page files and background search programs of the TSE search engine, and place these files in the corresponding directory of the Web server. The default root directory of the web site is/var/www/Taobao ).

(1) If the following directory does not exist, create/var/www/html/,/var/www/html/YC-cgi-bin/index /, /var/www/YC/TSE /;

(2) Compile the source code:

Tar zxvf index.XXXXXX-XXXX.Linux.tar.gz Decompression

CD index to enter the INDEX DIRECTORY

Make for compilation and generation

(3) Copy all the files in index/public_html to/var/www/html/images and. jpg, that is, the image files on the page;

(4) Copy all the files in index/public_html to/var/www/html/YC/TSE/. I don't understand why I need to do this!

(5)tse_tutorial.pdf write "put the make file in/var/www/html/YC-cgi-bin/index/", in fact, you do not need to copy all the files in the index directory. Instead, you only need to copy the executable program and data files generated after make (including the chseg directory, data directory, and words. dict link file), source file (. H and. CPP) and the target file (. o) is not required. Of course, you can copy all the files if you do not understand them or are too troublesome. The/var/www/html/YC-cgi-bin/index/directory is actually the path of the CGI program of the site, that is, the Directory of the executable program that implements the search function. You can search for related materials about CGI.

4. Modify the Web Server Configuration

(1) configuring character sets

Set the default Character Set, define the default character set that the server returns to the client (because the Western Europe UTF-8 is the Apache default character set, so when accessing a Web page with Chinese characters there will be garbled characters, as long as the character set is changed to gb2312, restart the apache service ). -- Reference from: http://www.blogjava.net/duanzhimin528/archive/2010/03/05/314564.html.

Unlike what tse_tutorial.pdf says, instead of modifying/etc/httpd/CONF/httpd. conf, modify/etc/apache2/CONF. d/charset and add adddefacharcharset to the file.
Gb2312.

(2) configure the web site root directory and CGI program directory

Because 3 has put the webpage file in/var/www/html/, that is, the Web site uses this directory as the root directory, You need to modify the configuration file of apache2 for configuration. 3. Place the executable program (CGI program) in/var/www/html/YC-cgi-bin/index/. Therefore, you need to configure the configuration file of apache2 to modify the CGI program directory. Unlike tse_tutorial.pdf, instead of modifying/etc/httpd/CONF/httpd. conf, modify/etc/apache2/sites-available/default as follows:

DocumentRoot/var/www/
<Directory/>
Options followsymlinks
AllowOverride none
</Directory>
<Directory/var/www/>
Changed:
DocumentRoot/var/www/html/
<Directory/>
Options followsymlinks
AllowOverride none
</Directory>
<Directory/var/www/html/>

ScriptAlias/cgi-bin/
/Usr/lib/cgi-bin/
<Directory "/usr/lib/cgi-bin">
Changed:
ScriptAlias/YC-cgi-bin/index/
/Var/www/html/YC-cgi-bin/index/
<Directory "/var/www/html/YC-cgi-bin/index/">

Now, the TSE query service subsystem has been set up and apache2 is restarted:

Sudo/etc/init. d/apache2 restart.

Open your browser and enter "http: // localhost/" to display the search page of Peking University Skynet (1 ).

As to why the page shown in Figure 1 is displayed, because the root directory of the server is configured as/var/www/html/In apache2, The index.html in this directory is the default page file.

By:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More