PHP Development Search Engine Technology tutorials

Source: Internet
Author: User
Tags format contains explode header query variable urlencode
Let us design, develop a strong, efficient search engine and database may be in a short period of time in the technical, financial and other aspects is not possible, but since Yahoo is using other people's technology, then we can also use other people's existing search engine site?

Analysis of programming ideas

We can imagine: simulate a query, to a search engine site issued a corresponding format Search command, and then return the search results, the results of the HTML code Analysis, stripped of the extra characters and code, and finally in the format required to display in our own website page.

This way, the point is that we have to select a search that is accurate (so that our search will be more meaningful), faster (because we analyze search results and show extra time), and search results are concise (easy for HTML source code analysis and stripping), Because of the new generation of search engine Google's various fine features, here we choose it as an example, to see how to achieve the background with PHP Google search, personalized display of the foreground of the process.

Let's take a look at the composition of Google's query commands. Enter the Google Web site, in the query bar to enter "ABCD", click the Query button, we can find the browser's address bar into: "http://www.google.com/search?q=abcd&btnG=Google%CB%D1%CB% F7&hl=zh-cn&lr= ", visible, Google is through the form of get way to pass query parameters and submit query command. We can use the file () function in PHP to simulate this query process.

Understanding the file () function

Syntax: Array file (string filename);

The return value is an array, and all the files are read into the array variable. The files here can be local or remote, and the remote file must indicate the protocol used. For example: Result=file ("Http://www.google.com/search?q=a ... mp;hl=zh-cn&lr="), which simulates the process of querying the word "ABCD" on Google, and returns the search results to the array variable result in each behavior element. The protocol name "http://" cannot be missing because the file read here is remote.

If you want the user to enter search characters for any search, we can make an input text box and submit button, and the search character "ABCD" above is replaced with the variable:

Echo '

'; file://form with no parameters, default commit method for get, submit to itself

Echo '; file://constructs a text input box

Echo '; file://constructs a Submit query button

Echo '

';

if (isset (keywords)) file://is submitted, PHP generates a variable kwywords that requires the following program to run after submission

{

UrlEncode (keywords); file://URL encoding of user input

Result=file ("http://www.google.com/search?q=". Keywords.) &btng=google%cb%d1%cb%f7&hl=zh-cn&lr= ");

file://A variable substitution of the query statement to save the query results in the array variable result

Result_string=join ("", result); file://the array $result into strings, with spaces between the elements of each array

... file://further processing

}

? ﹥

The above program has been able to query by user input and to synthesize the returned result into a string variable $result_string. Please note that to use the UrlEncode () function to URL-encode user input, you can normally query the input characters, spaces, and other special characters, so as to simulate Google's query command as realistically as possible to ensure the correctness of the search results.

The analysis of Google

For the sake of understanding, now let's assume that what we really need is the title of the search result. Web sites and profiles, etc., which is a concise and typical requirement. So all we have to do is remove the headers and footnotes from Google's search results, including a Google logo, a search-again input box, and a description of the search results, and remove the original HTML formatting tags from the remaining search results and replace them with the format we want.

To do this, we must carefully analyze the Google search results of the HTML source code, find the rules. It's not hard to find that the text in Google's search results is always included in the source's first

Mark and second in the penultimate

Between the tags, and the penultimate second

Tag immediately after the table character, and this combination "

All of the following procedures are followed by "further processing" in the above procedure.

result_string = Strstr (result_string, "");

FILE://takes result_string from the first after the string to remove the Google header

Position= Strpos (result_string, "position of the table symbol"

result_string= substr (result_string,0, position);//The string before the first table symbol is intercepted to remove the footnote

Application and implementation

Now that we have a useful HTML source backbone, the remaining question is how to display the content autonomously. We then analyze the search results entries and find that each entry is also very regularly delimited, that is, in a paragraph, by which we use the explode () function to cut each entry:

Syntax: Explode (string separator, string string);

Returns an array that is saved in an array by pressing the separator of each small string.

So:

Result_array=explode ("", result_string); file://the result with the string ""

We get an array of Result_array, where each element is a search result entry. All we have to do is study each entry and its HTML display format code, and then replace it as required. The following loops are used to process each entry in the Result_array.

For (i=0 I {

... file://process each entry

}

For each item, we can easily find some features: Each entry consists of a title, a summary, a profile, a category, a URL, and so on, each part wraps, that is, the tag, and then splits again: (The following handler is placed in the loop above)

Every_item=explode ("", result_array[I]);

So we get an array of Every_item, where Every_item[0] is the title, Every_item[1] and every_item[2] Two acts summary, Every_item[3 and every_item[4] and so on the head if it contains "Introduction:", "< font size=-1 color= #6f6f6f > Category:</font>" characters, is an introduction or category (because some of the resulting entries do not have the item), if the header contains "< font Color=green > "is definitely the Web site, this comparison to determine that we often use regular expressions (slightly), if it is convenient to replace, such as the inclusion of the title of $every_item[0], which itself is linked, we would like to modify the link property, so that it in a new window to open the link:

Echo Eregi_replace (' {

... file://process each entry to remove the first item (first title, already displayed)

... file://more formatting changes

}

This modifies the link properties, and many of the remaining display formats are modified, stripped, and replaced with a regular replacement eregi_replace ().

So far we've got every item in each search entry, and can change the format of each item arbitrarily, and even give him a nice table. However, a good program should be able to adapt to a variety of operating environment, here is no exception, we actually just discussed the search results of the HTML split a framework method, really want to do perfect, but also to consider a lot of content, such as to show the total number of search results, divided into many pages, etc. You can even excluding Google-related "categories", "Introduction" and other code, so that customers do not see the original site. However, these content and requirements we can be stripped through the analysis of HTML. Now everyone can do their own, do a very personalized search engine.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.