Search Engine Technology Core disclosure (PHP)

Source: Internet
Author: User
Tags format array contains explode query variable urlencode
Search engine Author: Sand Rain

Editor's note: This is a wonderful programming teaching article, not only detailed analysis of the search engine principles, but also provides the author of the use of PHP to compile a search engine some ideas. The whole article in a simple way, I believe that whether the master or rookie, can get a lot of inspiration.

When it comes to web search engines, most people will think of Yahoo. Indeed, Yahoo has created an internet search era. However, Yahoo's current technology for searching the web is not the company it developed itself. In August 2000, Yahoo adopted the technology of Google (www.google.com), a venture company created by Stanford University students. The reason is very simple, Google's search engine than Yahoo previously used the technology to faster, more accurate search of the required information.

Let us design, develop a strong, efficient search engine and database may be in a short period of time in the technical, financial and other aspects is not possible, but since Yahoo is using other people's technology, then we can also use other people's existing search engine site?

Analysis of programming ideas

We can imagine: simulate a query, to a search engine site issued a corresponding format Search command, and then return the search results, the results of the HTML code analysis, stripped of the extra characters and code, and finally in the format required to display in our own website page.

This way, the point is that we have to select a search that is accurate (so that our search will be more meaningful), faster (because we analyze search results and show extra time), and search results are concise (easy for HTML source code analysis and stripping), As a result of a new generation of search engine Google's various fine features, here we choose it as an example, to see how PHP to achieve the background to Google (www.google.com) search, personalized display in the foreground of the process.

Let's take a look at the composition of Google's query commands. Enter the www.google.com website, enter "ABCD" in the query bar, click the Query button, we can find that the browser's address bar becomes: "Http://www.google.com/search?q=abcd&btnG=Google% Cb%d1%cb%f7&hl=zh-cn&lr= ", visible, Google is through the form of get way to pass query parameters and submit query command. We can use the file () function in PHP to simulate this query process.

Understanding the file () function

Syntax: Array file (string filename);

The return value is an array, and all the files are read into the array variable. The files here can be local or remote, and the remote file must indicate the protocol used. For example: Result=file ("http://www.google.com/search?q=abcd&btnG=Google%CB%D1%CB%F7&hl=zh-CN&lr="), The statement simulates the process of querying the word "ABCD" on Google and returns the search results to the array variable result with each behavior element. The protocol name "http://" cannot be missing because the file read here is remote.

If you want the user to enter search characters for any search, we can make an input text box and submit button, and the search character "ABCD" above is replaced with the variable:
<?php
Echo ' <form> '; form with no parameters, default commit method for get, submit to itself
Echo ' <input type= ' text "name=" keywords ">"; Construct a text input box
Echo ' <input type= "submit" value= "Query" > "; Construct a Submit Query button
Echo ' </form> ';

if (isset (keywords))//submit PHP will generate variable kwywords, which requires the following program to run after submission
{
UrlEncode (keywords); URL code for user input
Result=file ("http://www.google.com/search?q=". Keywords.) &btng=google%cb%d1%cb%f7&hl=zh-cn&lr= ");
Variable substitution of query statements to save query results in the array variable result
Result_string=join ("", result); Merges the array $result into strings, with spaces between the elements of each array
..//Further processing
}
?>

The above program has been able to query by user input and to synthesize the returned result into a string variable $result_string. Please note that to use the UrlEncode () function to URL-encode user input, you can normally query the input characters, spaces, and other special characters, so as to simulate Google's query command as realistically as possible to ensure the correctness of the search results.

The analysis of Google

For the sake of understanding, now let's assume that what we really need is the title of the search result. Web sites and profiles, etc., which is a concise and typical requirement. So all we have to do is remove the headers and footnotes from Google's search results, including a Google logo, a search-again input box, and a description of the search results, and remove the original HTML formatting tags from the remaining search results and replace them with the format we want.

To do this, we must carefully analyze the Google search results of the HTML source code, find the rules. It's not hard to find that the text of Google's search results is always contained between the first <p> tag of the source code and the penultimate <p> tag, and the penultimate <p> tag immediately follows the table character, and the combination "<p> <table "only once in the source code, using this feature, we can remove Google's headers and footnotes."

All of the following procedures are followed by "further processing" in the above procedure.

result_string = Strstr (result_string, "<p>"); Take result_string from the first <p> start string to remove the Google header
Position= Strpos (result_string, "<p>table" position of the symbol
result_string= substr (result_string,0, position);//intercept the string before the first <p>table symbol to remove the footnote

Application and implementation

OK, now that we have a useful HTML source backbone, the remaining question is how to display the content autonomously. Let's analyze these search results entries and find that each entry is very regular
Separate, that is, each paragraph, by this feature we use the explode () function to cut each entry:

Syntax: Explode (string separator, string string);

Returns an array that is saved in an array by pressing the separator of each small string.

So:
Result_array=explode ("<p>", result_string); Cut the results with the string "<p>"

We get an array of Result_array, where each element is a search result entry. All we have to do is study each entry and its HTML display format code, and then replace it as required. The following loops are used to process each entry in the Result_array.
For (i=0 I {
..//Process each item
}

For each item, we can easily find some features: Each entry consists of a title, a summary, a profile, a category, a Web address, and so on, each part wraps, that is, the <br> tag, and then splits again: (The following handler is placed in the loop above)
Every_item=explode ("<br>", result_array[i]);

So we get an array of Every_item, where Every_item[0] is the title, Every_item[1] and every_item[2] Two acts summary, Every_item[3 and every_item[4] and so on the head if it contains "<font size=-1 color= #6f6f6f > Profile:</font>", "< font size=-1 color= #6f6f6f > Category:</font>" character, is an introduction or category (because some of the resulting entries do not have the item), if the head contains "< font color=green>" is definitely the Web site, this comparison to determine that we often use regular expressions (slightly), if it is convenient to replace, such as the $every_item containing the title [0], which itself is linked, we would like to modify this link property to have it open the link in a new window:
Echo Eregi_replace (' {
...//processing each entry to remove the first item (first title, already displayed)
..//More formatting changes
}

This modifies the link properties, and many of the remaining display formats are modified, stripped, and replaced with a regular replacement eregi_replace ().

So far we've got every item in each search entry, and can change the format of each item arbitrarily, and even give him a nice table. However, a good program should be able to adapt to a variety of operating environment, here is no exception, we actually just discussed the search results of the HTML split a framework method, really want to do perfect, but also to consider a lot of content, such as to show the total number of search results, divided into many pages, etc. You can even excluding Google-related "categories", "Introduction" and other code, so that customers do not see the original site. However, these content and requirements we can be stripped through the analysis of HTML. Now everyone can do it all by themselves, do a very personalized search engine.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.