Tutorials | Application example Linux based search engine implementation
Search engine is to provide users with quick access to web information tools, the main function is the system through the user input keywords, search back-end Web database, the relevant web page links and summary information feedback to the user. From the scope of the search is generally divided into site search and global web search. With the rapid increase in the number of Web pages, search engines have become the means to access information online, each large web site has provided web data search services, and there have been many for large sites to provide professional search engine services, such as Google for Yahoo to provide search services, For Sina and 263 and other domestic websites to provide services, such as Baidu company. Professional search service costs high and free search engine software is basically based on English search, so it is not suitable for intranet environment (such as campus network, etc.) needs.
The basic composition of search engine is generally divided into three parts: Web-page collection program, Web page back-end data organization storage, Web page data retrieval. The key factor determining the search engine quality is the response time of the data query, that is how to organize the large amount of Web page data to satisfy the Full-text search.
Gnu/linux as an excellent network operating system, its distribution has integrated a large number of network applications, such as Web server (Apache + PHP), directory Server (OpenLDAP), scripting language (Perl), Web page collection program (Wget). So, by concentrating them on the application, we can realize a simple and efficient search engine server.
I. Basic composition and methods of use
1. Web Data collection
Wget program is an excellent web collection program, it uses multithreading design can easily mirror the content of the site to the local directory, and can flexibly customize the collection of Web page types, recursive collection levels, directory limits, collection time and so on. Through the dedicated collection program to complete the collection of Web pages, both reduce the difficulty of the design and improve the performance of the system. To reduce the size of local data, you can only collect HTML files, txt files, scripts, ASP, and PHP using only the default results, rather than collecting graphics files or other data files.
2. Web Data filtering
Because of the large number of tags in HTML files, such as <body><table>, which have no actual search value, the data collected must be filtered before you join the database. Perl, as a widely used scripting language, has a very powerful and rich library of programs to easily filter Web pages. By using the Html-parser library, the text data, header data, link data and so on can be easily extracted from the Web page. The library can be downloaded in www.cpan.net, and the Perl program that the Web site collects covers a wide range of events, far beyond our own.
3, Directory Services
The directory service is a service that needs to be developed for a large number of data retrieval, first appearing in the X.500 protocol set, and later extended into TCP/IP to develop into an LDAP (lightweight Directory acess Protocol) protocol, Its relevant standards for the 1995 RFC1777 and 1997 developed the RFC2251 and so on. The LDAP protocol has been widely used by companies such as Sun, Lotus and Microsoft in their related products as industrial standards, but the dedicated Windows platform directory server is rare, OpenLDAP is a free directory server running on UNIX systems, and its products have excellent performance, has been collected by a number of Linux distributions (Redhat, Mandrake, etc.) and provides development interfaces including C, Perl, PHP, and so on.
Using directory service technology instead of common relational database as the back-end access platform of web data is mainly based on the technical advantage of directory service. The directory service simplifies the data processing type, it removes the time-consuming transaction mechanism of the common relational database, but uses the strategy of global substitution to update the data, and its application focuses on the retrieval service of a large amount of data (General Data update and retrieval frequency ratio is more than 1:10), emphasizing retrieval speed and full-text query, Provide a complete data backup, very suitable for search engine services such as the needs. From the point of view of directory service technology, it is not difficult to see its advantage in data retrieval, its time lags far behind that of relational database, and it actually reflects the principle of optimizing data solution according to specific problems. This contrasts with the current wide range of processing methods that involve a large amount of data processing required SQL Server.
By choosing the Mature directory service technology to improve the efficiency of Web query, the data processing ability can be improved concisely and effectively. This also fully demonstrates the advantages of the Gnu/linux system running open software, after all, it is not easy to get the directory server running on other platforms.
4. Query program Design
The front end of the search engine is a Web page that the user submits to the Web server for processing by entering keywords into a particular Web page. PHP scripts that run on the Apache Web server can perform query work on keywords by running their associated LDAP functions. The main work is to construct queries based on keywords, submit queries to the directory server, display query results, and so on. Linux + Apache + PHP as a widely used Web server, compared with the Winnt + IIS + ASP Its performance is not inferior, in the current Linux distribution has integrated Apache + PHP and the default LDAP, PGSQL, IMAP and other modules.
5. Planning Tasks
Search engine Web page data collection, data filtering, directory database, etc. should be automatic completion, in the UNIX system has a cron process to specialize in a specific time scheduling tasks, in order not to affect the operation of the system, you can generally put these work late into the night.
Ii. specific steps and matters for attention
1. Configure wget Software
The package has been integrated in the Redhat 6.2 release and can be installed directly. Edit the site address that needs to be mirrored as a file. The file is read through the-i parameter, a local download directory is specified for the mirrored site, and in order to avoid duplicate references to the links in the intranet, the data in the site is generally mirrored, and the depth of its mirrors can be specified based on the specific circumstances of the site.
2. Configure OPENLDAP Service
OPENLDAP-1.2.9 has been integrated in the Redhat 6.2 release and its configuration files are stored in the/ETC/OPENLDAP directory. The main configuration file is slapd.conf, the key to turn on the index option critical to the retrieval speed, you can use the Setup tool to boot LDAP as the default service after booting the system.
The LDAP Service can store data in the form of a text file, the LDIF file format. This approach enables you to efficiently update directory service data, noting that the LDIF format is delimited by a blank row, and that the directory service needs to be paused when the LDIF format data is imported into the catalog database by running LDIF2LBM.
3, the preparation of data filtering and LDIF file generation script
To easily filter Web page data, you can invoke Perl's Html-parser library function, which needs to be compiled after it is downloaded, and a related Htext,htitle program is generated in the EG directory, which can be run by calling an external program in Perl. A temporary file is generated by a redirection method for its filtered results. This search engine design directory data attributes have DN, link, title, Modifydate, contents, the DN through link for uniqueness identification, will filter the content of the Web page through/usr/sbin/ The LDIF program is automatically encoded and placed in the LDIF file.
The basic LDIF file format is as follows:
Dn:dc=27jd,dc=zzb
Objectclass:top
Objectclass:organization
dn:link= http://freemail.27jd.zzh/index.html, DC=27JD, DC=ZZB
Link:http://freemail.27jd.zzh/index.html
Title:webmail Home
MODIFYDATE:2001 Year February 8
Contents::
cgpxzwjtywls1vfsswokcgokikhvoag7ttotyrntw1dlym1hawzptc2zoagh7ydo0t
Kqyerh69pkz+qhisfpdxrsb29rxetww6o6u01uudogznjlzw1hawwumjdqzc56emjq
T1azoibmcm
Vlbwfpbc4yn2pklnp6ykrouya6idexljk5ljy0ljiy4sru08o7p6o6bwfpbgd1zxn00
8o7p7/awe
6jum1hawxndwvzdnlr16ky4dpdu6cg08o7p8p7okagznjlzw1hawwumjdqzc56emk/
2sHuOqChoa
Agikhyzog5qbf+zvegofkzo7z7zsrm4ich8s2o0bbcvkhyicch8sq1z9burcdtikhywftr1
Lk+of
Igofk8vmr1sr/w99kzsb7ptc2z08nk1nhpvlzk9bk/zfjc59bq0ms9qmgius3orlukcgok
Cqakcg
o=
Objectclass:webpage
The basic slapd.conf files are as follows:
DefaultAccess Read
Include/etc/openldap/slapd.at.conf
#include/etc/openldap/slapd.oc.conf
Schemacheck off
SizeLimit 20000
Pidfile/var/run/slapd.pid
Argsfile/var/run/slapd.args
#######################################################################
# LDBM Database Definitions
#######################################################################
Database ldbm
Dbcachesize 1000000
Index Contents,title
Suffix "dc=27jd, dc=zzb"
Directory/usr/tmp
RootDN "Cn=root,dc=27jd, Dc=zzb"
ROOTPW Secret
By filtering the local HTML file directory of 40,000 pages (about 300M), the LDIF file generated is about 180M, and if you take only the first 400 characters of the text data as the content of the Web page, the resulting file is about 35M.
4. Configure PHP+LDAP Service
The PHP3 and PHP-LDAP modules have been integrated in the Redhat6.2 and are installed into the/usr/lib/apache directory when the full installation is selected, and check for dynamic extensions in/etc/httpd/php3.ini Extensions) is selected in the extension=ldap.so. PHP3 provides rich LDAP access functions to facilitate the search function of directory data. More information about Apach + PHP programming, not to repeat. Note that the LDAP search function in PHP3 ldap_search cannot handle the maximum retrieval data that its return result exceeds the directory service setting, so you can set a larger data limit (sizelimit) in SLAPD configuration file according to the situation. This problem has been resolved in the PHP4.
5, Task scheduling
Crond has been integrated in Redhat6.2 and started after the default installation. Its related configuration files are/etc/crontab,/etc/cron.daily,/etc/cron.hourly,/etc/weekly,/etc/monthly, you only need to update the frequency of the data, Web page collection, web filtering, Generate LDIF files, stop directory services, update catalog data, restart directory services, and put them in the appropriate directory as a simple shell program.
Iii. Effect and thinking
The above simple introduction of our search engine implementation methods and considerations, this is only our knowledge of the gnu/linux very superficial design of the directory services to meet the needs of the intranet search engine system, and can not represent the gnu/linux and its integration of the vast number of software real strength.
By testing on a single installation of SPARC Ultra 250 on Redhat Linux 6.2, when searching for catalog data with 40,000 pages, the search engine response rate based on the above method is typically around 3 seconds, and the catalog data is completely updated about 4 hours, To meet the needs of the intranet. In fact, the key to restricting search response speed is that the PHP3 ldap_search function does not provide data restriction, resulting in slower system response when the query result set is too large, because the query results are actually very small each time the user can browse, and every query on the server side always returns the full result. The Ldap_search in PHP4 can effectively resolve the problem by specifying the SizeLimit parameter.
The application scope of directory services is very extensive, in fact, as a large information site to improve customer access efficiency, more or less the use of directory services technology. Directory services According to the specific application requirements of the optimization design method, we decide the application of the development of the system is undoubtedly an inspiration, it should be said in the domain based on index information LDAP services far superior to the traditional relational database system.
The design of Web server based on Gnu/linux can fully realize the charm and strength of open source code, it can simplify the designing of the system, improve the work efficiency greatly, and reduce the cost of the system effectively. Program design from the beginning of all the complex and tedious duplication of work, simplified as problem abstraction, function decomposition, search resources, the combination of four parts, more emphasis on the system of understanding, open vision and learning ability, at the same time open source code for further optimization of the system provides a solid foundation
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.