Baidu, Google search engine principle

Source: Internet
Author: User
Tags html page website server

First section search engine principle

1. Basic Concepts

From the Chinese wiki Wikipedia explanation: (network) search engine refers to the automatic collection of information from the Internet, after a certain collation, to provide users with a system of query.
From the English wiki encyclopedia explanation: Web Search Engines Providean interface to search for information in the world wideweb.information may Consi St of web pages, images and other types Offiles. (Web search engine provides an interface for users to find information on the Internet, including web pages, pictures, and other types of documents)

2. Classification

Depending on how they work, they can be divided into two basic categories: full-Text search engine (Fulltextsearch engine) and catalog directory.

Classification of the directory is through the manual collection and collation of the site data to form a database, such as Yahoo China and the domestic Sohu, Sina, netease category directory. In addition, some of the navigation sites on the Internet can also be attributed to the original catalogue, such as "home of the Website" (http://www.hao123.com/).

The full-text search engine analyzes the hyperlinks of Web pages automatically, relies on hyperlinks and HTML code analysis to obtain the content of Web pages, and organizes the indexes according to the rules of pre-design, for the user to inquire.

The difference between the two can be summed up in a nutshell: The catalog is a manual way to establish the index of the site, full-text search is an automatic way to establish the index page. (Some people often compare search engines to database retrieval, which is actually wrong).

3. How the full-text search works

Full-Text search engine general information collection, indexing, search three parts, detailed can be by the Finder, parser, indexer, the user interface and other 5 parts of the composition

(1) Information Acquisition (webcrawling): The work of information collection is done by the search engine and the analyzer, search engines use called Web crawler (crawlers), A Web spider (spider) or an automated search bot called a network Robot (robots) to query hyperlinks on a Web page.

To further explain: "Robot" is actually some Web-based programs, by requesting HTML pages on the Web site to collect the HTML page, it traverses the entire web space within the specified range, from one Web page to another, moving from one site to another site, Add the collected pages to the Web page database. "Robot" every encounter a new page, all its internal links to search, so theoretically, if the "robot" to create an appropriate initial page set, from the initial page set, traversing all the links, "robot" will be able to capture the entire web space page.

Many open-source bots can be found in some open source communities after the Internet.

Key point 1: The core lies in HTML analysis, so rigorous, structured, readable, error-less HTML code, more easily by the acquisition of robot analysis and acquisition. For example, a page exists <body such as a label or no </body>

Key point 2: Search robot has a dedicated search link library, when searching for the same hyperlink, will automatically compare the content and size of the old and new pages, if consistent, not collected. Therefore, some people worry about whether the revised webpage can be included, which is superfluous.

(2) index (indexing): The process of organizing information by search engines is called "indexing". The search engine not only needs to save the collected information, but also to arrange them according to certain rules. Indexes can be in a large, universal database, such as Oracle, Sybase, etc., or they can be stored in their own defined file format. Index is a more complex part of the search, involving the web page structure analysis, Word segmentation, sorting and other techniques, good index can greatly improve the retrieval speed.

Key point 1: Although the search engine now supports incremental indexing, index creation still takes a long time, and the search engine periodically updates the index, so even if the crawler has been there, there will be a certain interval of time when we can search the page.

Key point 2: Indexing is an important sign of good or bad search.

(3) Search (searching): The user sends a query to the search engine, the search engine accepts the query and returns the data to the user. Some systems in the return of the results before the relevance of the page is calculated and evaluated, and according to the relevance of the ranking, the correlation is large in front, the correlation is small in the back; some systems have calculated the page level of each page before user query (PageRank), When the query results are returned, the page level is placed in front and the page level is small.

Key point 1: Different search engines have different collation, so in different search engines to search for the same keywords, sorting is different.


The second section Baidu search engine work Way

I know Baidu search: Because of the work of the relationship, niche has been fortunate to have been using Baidu's Blackstone Enterprise search engine (the department has been laid off, mainly Baidu's strategy began to move closer to Google, no longer sold separately search engines, to search services), according to Baidu's sales staff, Blackstone Search core and large search the same, It's only possible that the version is slightly lower, so I have reason to believe that the search works in a similar way. Here are some simple introductions and points to note:

1, about the site search Update frequency

Baidu Search can set the site's update frequency and time, generally for large web site update frequency quickly, and will be dedicated to the independent crawler to track, but Baidu is more diligent, small and medium-sized websites are generally updated daily. Therefore, if you want your site to be updated faster, preferably in a large directory (such as Yahoosina NetEase) have your links, or Baidu's own related sites, there is a hyperlink to your site, or your site in some large sites, such as large sites blog.

2, about the depth of the acquisition

Baidu Search can define the depth of the acquisition, that is not necessarily Baidu will retrieve the entire content of your site, it is possible to index only the content of your site's homepage, especially for small sites.

3, about the frequent non-access to the site collection

Baidu for the site of the break is a special judgment, if once found a site, especially some small and medium-sized website, Baidu automatically stop to send crawlers to these sites, so choose a good server, keep the site 24 hours unblocked very important.

4, about the replacement of the IP site

Baidu Search can be based on the domain name or IP address, if it is a domain name, will automatically resolve to the corresponding IP address, so there will be 2 problems, the first is if your site and others use the same IP address, if someone else's website was punished by Baidu, your site will be implicated, the second is if you replace the IP address, Baidu will find that your domain name and previous IP address does not correspond, will also refuse to send a crawler to your site. Therefore, do not arbitrarily change the IP address, if possible to maximize the IP, to maintain the stability of the site is important.

5, about the static and dynamic site collection

Many people worry about is not like asp?id=, such as the page is difficult to collect, HTML page such as easy to be collected, in fact, the situation did not think so bad, most of the current search engine support dynamic site collection and retrieval, including the need to log on the site can be retrieved, So do not have to worry about their own dynamic website search engine is not recognized, Baidu Search for dynamic support can be customized. However, if possible, generate static pages as much as possible. Also, for most search engines, the script jumps (JS), frame,

Flash hyperlinks, dynamic pages containing illegal characters of the page helpless.

6, about the disappearance of the index

As mentioned earlier, the search index needs to be created, generally good search, the index is a text file, not a database, so the index need to delete a record, is not a convenient thing. For example, Baidu needs to use specialized tools to manually delete an index record. According to Baidu employees, Baidu has a group of people responsible for this matter-received complaints, delete records, manual. Of course, you can delete all indexes under a rule directly, that is, you can delete all the indexes under a certain site. There is also a mechanism (unverified), that is, for outdated web pages and cheat pages (mainly page titles, keywords and content does not match), in the process of rebuilding the index will also be deleted.

7, about the weight

Baidu search is not as good as Google's ideal, mainly to identify the title and source of the article, as long as it is not the same, it will not automatically go to the weight, so do not have to worry about the collection of the content of the same and quickly be searched for punishment, Google's different, the title of the same is included in the not many.

Add a sentence, do not think of the search engine so smart, basically are in accordance with certain rules and formulas, want to not be punished by the search engine, avoid these rules can.


Third section of Google search ranking technology

For the search, Google is stronger than Baidu, the main reason is that Google is more impartial, and Baidu has a lot of artificial factors (which is also in line with our national conditions), Google is just because of his ranking technology PageRank.

Many people know that PageRank is the quality level of the website, the smaller the site, the better. In fact, PageRank is based on a special formula to calculate, when we search for keywords in Google, page ranking of small pages will be more forward, this formula is not human intervention, so just.

PageRank's original idea came from the management of the paper file, we know that every paper at the end of the reference, if an article is cited by different papers several times, you can think this article is an excellent article.

Similarly, simply put, PageRank can make an objective evaluation of the importance of Web pages. PageRank does not calculate the number of direct links, but instead interprets the link from page A to page B as a vote by page a against page B. In this way, PageRank will evaluate the importance of the page based on the number of votes it receives on page B. In addition, PageRank will evaluate the importance of each polling page because some pages are considered to be of high value, so that the pages it links to can be of higher value.

Page Rank's formula is omitted here, saying that the main factors affecting page rank

1. The number of hyperlinks to your website (your site is referenced by others), the larger the value, the more important your site, the more popular, is that other sites are links, or recommend links to your site;

2, hyperlinks to the importance of your site, meaning that a good quality site has a hyperlink to your site, indicating that your site is also excellent.

3, web-specific factors: including the content of the Web page, title and URL, that is, the keywords and location of the page.


Fourth how new sites respond to searches

The following is a summary of the above analysis:

1, search engine Why do not include your site, there are the following possible (not absolute, according to their respective circumstances different)

(1) There is no link to the island of the Web page, not included in the site point to your hyperlinks, search engines can not find you;
(2) The nature of the Web site and file types (such as Flash, JS jump, some dynamic Web pages, frame, etc.) search engine is not recognized;
(3) Your website server has been punished by the search engine, and does not include the same IP content;
(4) Recently replaced the IP address of the server, the search engine needs a certain amount of time to re-acquisition;
(5) Server instability, frequent downtime, or not withstand the pressure of crawler acquisition;
(6) The page code is inferior, the search can not correctly analyze the page content, please learn at least the basic syntax of HTML, the proposed use of XHTML;
(7) The website uses robots (robots.txt) agreement to reject search engine crawl webpage;
(8) The use of keywords cheat web pages, the keywords and content of Web pages seriously mismatched, or some keywords density is too large;
(9) pages of illegal content;
(10) The same website has a large number of the same title of the page, or the title of the page has no actual meaning;

2, how to do the new station is correct (for reference only)

(1) and excellent website exchange links;
(2) Widely login to the list of websites of various big websites;
(3) More to the quality of the forum to speak, speak to have quality, it is best not to reply, in the speech left their website address;
(4) Apply for the Big website blog (Sina, NetEase, CSDN), and promote their own website in the blog;
(5) Using a good construction station program, it is best to generate static pages and automatically generate keywords;
(6) Attention to the title of each page, as well as the

For example, "an open source Jabber (XMPP)-based solution for internal instant messaging services";

Title Section:<title> Solution for building an internal instant messaging service based on open source Jabber (XMPP)-a column for expendable-csdnblog</title>
Keywords section: <meta name= "keywords" ccolor: #c00000 "> Installation," >
Article Description: <metaname= "description" Ccolor: #c00000 "> is a well-known Instant Messaging server, it is a free open source software that allows users to rack up their own instant messaging servers that can be applied on the Internet, can also be applied in LAN.

XMPP (Extensible Messaging Field Protocol) is an Extensible Markup Language (XML)-based protocol that is used for instant messaging (IM) and online field probing. It facilitates the server's

And the quasi-immediate operation between This agreement may eventually allow Internet users to send instant messages to anyone else on the internet, even if their operating systems and browsers are different. The technology of XMPP comes from

In Jabber, in fact it is the core protocol of Jabber, so XMPP is sometimes mistakenly referred to as Jabber. Jabber is an IM application based on the XMPP protocol, and XMPP supports many applications in addition to Jabber.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.