Implementation of search engine based on Linux
Search engine is to provide users with quick access to web information tools, its main function is the system through the user input keywords, retrieve back-end Web database, links and summaries of relevant web pages to the user feedback. From the scope of the search is generally divided into the site Web search and global web search. With the rapid increase in the number of Web pages, search engines have become the Internet to query information must means, each large web site has provided web data search services, and there are many large sites to provide professional search engine services, such as Google for Yahoo to provide search services, For Sina and 263 and other domestic websites to provide services such as Baidu Company. Professional Search service expensive and free search engine software is basically based on English search, so it is not suitable for intranet environment (such as Campus network) needs.
The basic composition of search engine is generally divided into three parts: Web page collection program, Web page back-end data organization storage and Web page data retrieval. The key factor that determines the quality of search engine is the response time of data query, that is, how to organize the large amount of web data that satisfies full-text retrieval.
Gnu/linux as a good network operating system, its distribution has a large number of network application software, such as Web server (Apache + PHP), directory Server (OpenLDAP), scripting language (Perl), Web page collection program (Wget) and so on. So, by centralizing them, you can implement a simple, efficient search engine server.
I. Basic composition and method of use
1. Web Data collection
Wget program is a good web-based collection program, it uses multi-threaded design can easily mirror the content of the site to a local directory, and can flexibly customize the type of collection of Web pages, recursive collection hierarchy, directory limits, collection time and so on. Through a dedicated collection process to complete the collection of Web pages, both reduce the difficulty of the design and improve the performance of the system. To reduce the size of your local data, you can only collect HTML files, txt files, scripts, ASP, and PHP that can be queried, using only the default results, rather than collecting form files or other data files.
2. Web Data filtering
Because there are a lot of tags in the HTML file, such as
http://www.bkjia.com/PHPjc/631823.html www.bkjia.com true http://www.bkjia.com/PHPjc/631823.html techarticle Search engine based on Linux search engine is to provide users with quick access to web information tools, its main function is the system through the user input keywords, retrieve back-end Web database ...
and so on, these tagged data have no actual search value, so the collected data must be filtered before joining the database. Perl, as a widely used scripting language, has a very powerful and rich library of libraries that can be easily filtered through the web. By using the Html-parser library, you can easily extract the text data, title data, linked data, etc. contained in the Web page. The library can be downloaded in www.cpan.net, and the Web site collects a wide range of Perl programs, far beyond our own. 3, Directory Service directory service is for a large number of data retrieval needs to develop services, the first appeared in the X.500 protocol set, and later extended to TCP/IP to develop into the LDAP (lightweight Directory Acess Protocol) protocol, Its relevant standards for the 1995 established RFC1777 and 1997 developed RFC2251 and so on. LDAP protocol has been used as industry standard by Sun, Lotus, Microsoft and other companies widely used in its related products, but the dedicated Windows-based platform directory server is less visible, OpenLDAP is a free directory server running on UNIX systems, the performance of its products is excellent, has been collected by a number of Linux distributions (Redhat, Mandrake, etc.) and provides development interfaces including C, Perl, PHP, and so on.