PHP Tutorial. Application Example 15. The search engine based on Linux is a tool for users to quickly obtain webpage information. its main function is that the system retrieves the back-end web database based on the Linux search engine by entering keywords.
A search engine is a tool for users to quickly obtain webpage information. its main function is to retrieve the back-end web database by entering keywords, feed back the link and summary of the webpage to the user. The search range is generally divided into intra-site webpage search and global webpage search. With the rapid increase in the number of web pages, search engines have become an essential means to query information on the internet. various large websites have already provided web page data search services, in addition, many companies have appeared to provide professional search engine services for large websites, such as Google, Baidu, which provides services for domestic websites such as Sina and 263. Professional search services are expensive and free search engine software is basically based on English retrieval, so it is not suitable for the needs of Intranet environments (such as campus networks.
The basic components of a search engine are generally divided into three parts: webpage collection program, webpage back-end data organization and storage, and webpage data retrieval. The key factor determining whether a search engine is good or bad is the response time of data queries, that is, how to organize a large amount of web page data to meet the full-text search needs.
As an excellent network operating system, GNU/Linux integrates a large number of network application software, such as Web servers (Apache + PHP) and directory servers (OpenLDAP) script language (Perl), web page collection program (Wget), etc. Therefore, you can implement a simple and efficient search engine server by applying them in a centralized manner.
I. basic components and usage
1. webpage data collection
Wget is an excellent web page collection program. it uses multi-threaded design to easily mirror website content to a local directory, in addition, you can flexibly customize the types, recursive collection layers, directory quotas, and collection time of web pages. Using a dedicated collection program to collect webpages not only reduces the design difficulty but also improves the system performance. To reduce the size of local data, only html files, txt files, script programs asp and php that can be queried can be collected. only the default results are used, instead of collecting shape files or other data files.
2. webpage data filtering
Because html files contain a large number of tags, such
The ghost search engine is a tool for users to quickly obtain webpage information. its main function is that the system retrieves the backend web database by entering keywords...
These labeled data has no actual search value. Therefore, you must filter the collected data before joining the database. As a widely used scripting language, Perl has a very powerful and rich library that can easily filter webpages. You can use the HTML-Parser library to conveniently extract text data, title data, and link data contained in a webpage. This library can be downloaded from www.cpan.net, and the Perl programs collected by this website are far beyond our scope.3. directory serviceThe Directory service is a service that needs to be developed for a large amount of data retrieval. it first appeared in The X.500 Protocol set and later expanded to TCP/IP to become the Lightweight Directory Acess Protocol, the related standards are RFC1777 set in 1995 and rfc2133 set in 1997. LDAP has been widely used in related products by companies such as Sun, Lotus, and Microsoft as industrial standards. however, dedicated directory servers based on Windows platforms are rare, openLDAP is a directory server running on Unix systems for free. its products have excellent performance and have been collected by many Linux distributions (such as Redhat and Mandrake ), it also provides development interfaces including C, Perl, and PHP.