Search engine technology core revealed _php tutorial

Source: Internet
Author: User
Editor's note: This is a wonderful programming teaching article, not only detailed analysis of the principles of the search engine, but also provides the author's own use of PHP to compile some of the ideas of the search engine. The whole article in layman's terms, I believe whether it is a master or rookie, can get a lot of inspiration from it.

When it comes to web search engines, most people will think of Yahoo. Indeed, Yahoo has created an internet search era. However, Yahoo's current technology for searching the web is not the company's original development. In August 2000, Yahoo adopted the technology of Google (www.google.com), a venture company created by students at Stanford University. The reason is simple: Google's search engine is faster and more accurate than Yahoo's previously used technology to search for the information it needs.

Let us design, develop a strong, efficient search engine and database for a short period of time in terms of technology, money, etc. is not possible, but since Yahoo is using other people's technology, then we can also use other people's existing search engine site?

Analysis of programming ideas

We can think of this: simulate a query, send a search engine site the appropriate format of the command, and then return the search results, the results of the HTML code analysis, stripping the extra characters and code, and finally in the format needed to display in our own website page.

So the point is that we have to select a search message that is accurate (so our search will be more meaningful), fast (because we analyze the search results and show the need for extra time), the search results are concise (easy for HTML source code analysis and stripping), As a result of a new generation of search engine Google's various good features, here we choose it as an example, to see how to use PHP in the background to Google (www.google.com) search, foreground personalized display of the process.

Let's look at the composition of Google's query commands first. Enter the www.google.com website, enter "ABCD" in the Query field, click the Query button, we can find the browser address bar becomes: "Http://www.google.com/search?q=abcd&btnG=Google% Cb%d1%cb%f7&hl=zh-cn&lr= ", it can be seen that Google is through the form of the get way to pass query parameters and submit query commands. We can use the file () function in PHP to simulate this query process.

Understanding the file () function

Syntax: Array file (string filename);

The return value is an array that reads all of the files into the arrays variable. The files here can be local or remote, and remote files must indicate the protocol being used. For example: Result=file ("http://www.google.com/search?q=abcd&btnG=Google%CB%D1%CB%F7&hl=zh-CN&lr="), This statement simulates the process of querying the word "ABCD" on Google, and passes the search results back to the array variable result with each behavior element. Because the file read here is remote, the protocol name "http://" cannot be missing.

If you want to allow the user to enter search characters for any search, we can make an input text box and a Submit button, and replace the searched character "ABCD" in the above with a variable:
Echo ';

if (isset (keywords))//After submission, PHP generates variable kwywords, which requires the following program to run after commit
{
UrlEncode (keywords); URL-encode user-entered content
Result=file ("http://www.google.com/search?q=". Keywords. " &btng=google%cb%d1%cb%f7&hl=zh-cn&lr= ");
Variable substitution of query statements, saving query results in array variable result
Result_string=join ("", result); Merges the array $result into a string, with space between each array element and
...//further processing
}
?>

The above program has been able to query the user input content, and the returned results are synthesized a string variable $result_string. Note that to use the UrlEncode () function to URL-encode user input, you can correctly query the input characters, spaces, and other special characters, and do so as realistically as possible to simulate Google's query commands to ensure the correctness of the search results.

An analysis of Google

For the sake of understanding, now assume that what we really need is: the title of the search result. URLs and profiles, etc., this is a concise and typical requirement. So all we have to do is remove the headers and footnotes from Google's search results, including a Google logo, a re-search input box, and a search result description, and strip the original HTML formatting tags from the rest of the search results and replace them with the format we want.

To do this, we must carefully analyze the HTML source of Google search results to find the rules. It is not hard to find that the body of search results in Google is always included in the source code of the first

Mark and Countdown second

Mark, and the second from the bottom

Tag immediately after the table character, and this combination "



All of the following procedures are followed sequentially in the "Further processing" section of the above procedure.

result_string = Strstr (result_string, "

"); Take result_string from the first one

Start the string to remove the Google header
Position= Strpos (result_string, "

The location of the table symbol
result_string= substr (result_string,0, position);//interception of the first

String before the table symbol to remove footnotes

Application and implementation

OK, now that we have the useful HTML source code backbone, the remaining question is how to display the content autonomously. Let's analyze these search results and find that each entry is also very regular.
Separate, that is, each into a paragraph, according to this feature we use the explode () function to cut each entry:

Syntax: Explode (string separator, string string);

Returns an array that is saved in the array by the small string of separator cut.

So:
Result_array=explode ("

", result_string); With a string "

"Cut the results.

We get an array of Result_array, where each element is a search result entry. All we have to do is study each entry and its HTML display format code, and then replace it as required. The following loop is used to process each entry in the Result_array.
for (i=0; I {
...//Handling each entry
}

For each entry, it's easy to find some features: Each entry consists of a title, a summary, a description, a category, a URL, and every part wraps, which includes
Tag, then split again: (The following handler is placed in the loop above)
Every_item=explode ("
", result_array[i]);

So we get an array of Every_item, where Every_item[0] is the header, Every_item[1] and every_item[2] Two behavior summaries, every_item[3] and every_item[4] and so on the head if it contains "Introduction:", "< font size=-1 color= #6f6f6f > Category:</font>" character, is an introduction or category (because some result entry does not have the item), if the header contains "< font Color=green > "is definitely the URL, this comparison to determine that we often use regular expressions (slightly), if you want to replace is also very convenient, such as the title of the $every_item[0], which itself is linked, we would like to modify the link property, let it open the link in a new window:
Echo Eregi_replace (' {
...//handle each item in each entry except for the first item (the first item is the title, already shown)
...//More format modification
}

This modifies the link properties, and many of the other display formats can be modified, stripped, and replaced with regular replacement eregi_replace ().

So far we've got every item in each search, and can change the format of each item, and even put a nice table on it. However, a good program should be able to adapt to a variety of operating environment, and here is no exception, we actually just discussed the search results of HTML stripping a framework approach, really to do perfect, but also to consider a lot of content, such as to show how much search results, divided into how many pages, etc. You can even shaving Google-related "categories," "Profiles," and other code, so that customers simply do not see the original site. However, these content and requirements can be stripped by parsing HTML. Now everyone can do their own, do a very rich personalized search engine.

Excerpt from: http://tech.163.com/tm/010228/010228_15747.html
Author: Maxid

http://www.bkjia.com/PHPjc/315014.html www.bkjia.com true http://www.bkjia.com/PHPjc/315014.html techarticle Editor's note: This is a wonderful programming teaching article, not only detailed analysis of the principles of the search engine, but also provides the author's own use of PHP to compile some of the ideas of the search engine. The whole ...

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.