1. What is PHPdig?
PHPdig is a popular vertical search engine product in Foreign Countries (rather than a product, it is better to say it is a different from the traditional search engine search technology), written in PHP language, using PHP program running efficiency, this greatly improves the search response speed. It can search the Internet like Google, Baidu, and other search engines. The search content includes txt, doc, xls, and, pdf and other files with powerful content search and file parsing functions. Like traditional search engines, PHPdig includes the following three basic technologies:
1. Spider Technology
2. Web structured information extraction technology or metadata collection technology
3. Word Segmentation and Indexing Technology
Unlike traditional search engines, PHPdig is suitable for more professional and deeper personalized search engines. Using PHPdig to create vertical search engines for a certain field is the best choice.
2. How to obtain this PHPdig?
PHPdig is a free product (requires copyright), the latest version is phpdig-1.8.9 in order to avoid Apache and MYSQL version compatibility issues, it is recommended to use a lower-level version, its website address is: http://www.phpdig.net, is: http://www.phpdig.net/navigation.php? Action = download explains that I have tried the phpdig-1.8.9 version, but there are a lot of problems, using PHPdig-1.8.8 is less of a problem.
3. Steps
1. Get the product
Access http://www.phpdig.net/navigation.php? Action = download PHPdig-1.8.8 to the desktop, decompress to Apache Server html directory, the general path is: D: \ usr \ www \ html \, (if you do not install Apache server please install in advance, we recommend that you use Mappm-Server v1.1.9 Final. The Mappm-Server is installed in a silly way and can be done once to facilitate debugging and running the PHP/CGI MySQL program ).
2. Run and configure the PHPdig Database
Open your browser and enter http: // localhost/phpdig/and press Enter. All PHPdig files and folders are listed on the page. Find the default homepage file (default, index ), click search. when an error occurs in the PHP file, the following error occurs: Unable to connect to database: Check the connection script. It indicates that the database connection cannot be completed. We have not completed the PHPdig database configuration. Go to the admin directory and find install. PHP file, click Run. At first glance, the full English interface (note that all PHPdig versions currently do not support the Chinese interface). It does not matter, if you have a hand-written experience in Chinese, you may wish to hand it out, here provides a download of my own hand-written cn-language.php document (copy it to the locales directory ). In addition, you need to modify the config.php file (language modified character and style.css file (font modification and style modification) under the includesdirectory ).
After entering install. php, the system requires us to enter the PHPdig management username and password. By default, all of them are admin. The following interface is displayed (after Chinese version ):
(Figure 1)
The following information is required:
If you are testing locally, enter the default Server name localhost (localhost is the default Server name under Mappm-Server, that is, the default Server name of mysql, mappm-Server built-in mysql database) the default port of the database Server is 3126, which can be left blank. The sock protocol of the database is empty by default, and the default user name is root (Mappm-Server Default User Name ), the password is the user password you entered when installing Mappm-Server. The PHPdig database name is phpdig by default and can be modified at will. You can also add a prefix to the data table in the database, which is blank by default.
If you want to upload data to a web server connected to the Internet, ask the server provider for the mysql server name or IP address, database server port, sock protocol, user name, and password, the Database Name and data table prefix settings are the same as above.
For the four single-choice buttons on the right, you can choose the default "create database" for the first time (installation), depending on the situation"
After confirming that the above information is correct, click the Install button. If the connection to the database fails, the error message "cannot connect to the Database" is displayed. If the connection to the database succeeds, the system directly jumps to the Management page, as shown in:
(Figure 2)
3. Interface Introduction
Area 1 is a text input area. By default, there are three lines of text, all starting with http, as you can see, enter the website address of the website where you want to spider (only one spider website is recommended at a time ).
Area 2 is the spider option. The search depth refers to the number of directories to which the spider belongs. The number of links on each page is the maximum number of links that the pointer can capture on a webpage. The default value is 0, indicating that the site is a full-site spider.
Area 3 displays the database status information, including the website, keywords, indexes, and website information of the spider.
Area 4 is a drop-down list that lists the URLs of spider sites. Select a site and clear and update it in Area 5.
Area 5 not only provides clearance and update operations for the sites selected in area 4, but also provides related statistical information portals and spider control.
4. Run spider for a specific site
If you are interested in the content of the Tianji software channel, you can create a search engine that is more professional than google to search for the content of Tianji software, your search engine will be more comprehensive and deeper than google. Next we will take the spider Tianji software channel content as an example to introduce how to spider a website.
1) enter the http://soft.yesky.com in area 1 in Figure 2, search depth and the number of links per page remain 0 by default
2) Click the spider button, the page jumps to the spider information page, the program starts the content of the spider site http://soft.yesky.com automatically.
Note: The spider website is very slow. If there is too much content on the website, the process may last for several hours to a day, but you don't have to worry about the script running timeout, because the system's timeout time is set to a maximum of 48 hours. In this process, you can also interrupt the running of the spider Program and restart the website where the spider Program is not running completely. Note that if you accidentally close the spider running page during this process, but the system does not actually stop the spider, the system resources are still being consumed. You can re-open the spider page and click the stop spider link to release system resources.
(Figure 3)
5. Search Using PHPdig
After a period of time, the result of spider program running is to capture the information on the http://soft.yesky.com website to the server database, mainly the peer content title information, keyword information and page address information, etc, you can access the search. php is searched.
(Figure 4)
You can select the number of entries displayed in the search results. You can select fuzzy search or exact search. In addition, you can select search for a site. By default, all sites that have been searched by the spider are searched.
(Figure 5)
Is the search result page for "QQ2006.
6. Existing Problems
Due to PHPdig language settings, system word segmentation, and MYSQL database character processing problems, PHPdig still has many uncertainties in searching Chinese words, these things need to be further solved and improved. You are welcome to discuss these things in the PHPdig community.