Search engine Basic Work principle

Source: Internet
Author: User
Tags web database

The basic working principle of the search engine includes the following three processes: First discovers, collects the webpage information in the Internet, simultaneously extracts the information and organizes the index library; then, by retrieving the query key words from the user input, the retrieval document is quickly checked out in the index library, and the relevance degree of the document and query is evaluated. Sorts the results that will be output and returns the results of the query to the user.

working principle Edit 1, crawl the Web page. Each independent search engine has its own web crawler crawler (spider). Spider spiders crawl through hyperlinks in Web pages, crawling from this site to another site, and crawling through hyperlinks to crawl more pages. The crawled page is called a Web snapshot. Since the application of hyperlinks in the Internet is very common, in theory, from a certain range of Web pages, you can collect most of the Web pages. 2, processing Web pages. After the search engine catches the webpage, also has to do a lot of preprocessing work, can provide the retrieval service. Among them, the most important thing is to extract keywords, build index library and index. Others include the removal of duplicate pages, Word segmentation (Chinese), judging page types, analyzing hyperlinks, calculating the importance/richness of web pages, etc. 3, provide search services. User input keywords to search, search engine from the index database to find matching the keyword of the page, in order to make it easy for users to judge, in addition to the page title and URL, but also provide a section from the Web page summary and other information.Search engine Basic work Principle2 search engine editing In the Search engine classification section We mentioned that the full text search engine extracts information from the website to establish the Web database concept. The automatic information collection function of search engine is divided into two kinds. One is the regular search, that is, every time (such as Google is generally 28 days), the search engine sent the "spider" Program, a certain IP address within the range of Internet sites to search, once found a new site, it will automatically extract the site's information and URLs to join their own database. The other is to submit a website search, that is, the site owner to submit a website to the search engine, it in a certain time (2 days to months) directed to your site to send a "spider" Program, scan your site and the relevant information into the database, in case users query. Because the search engine indexing rules have changed a lot, unsolicited URLs do not guarantee that your site can enter the search engine database, so the best way is to get some external links, so that search engines have more opportunities to find you and automatically included in your site. When the user search for information by keyword, search engine will be in the database search, if found with the user requirements of the content of the site, then use a special algorithm-usually according to the page of the matching degree of keywords, the occurrence of the location, frequency, link quality, etc.-to calculate the relevance of each page and ranking level, These web links are then returned to the user in order, depending on the degree of correlation.3 Directory index editing Compared to full-text search engines, there are many differences in directory indexing. First, the search engine belongs to the automatic site retrieval, and the directory index is completely dependent on manual operation. After the user submits the site, the directory editor will personally browse your site and then decide whether to accept your site based on a set of self-judged criteria or even the subjective impressions of the editorial staff. Second, when the search engine contains the site, as long as the site itself does not violate the relevant rules, generally can log on successfully. The directory index requires a much higher number of sites, and sometimes even multiple logons may not be successful. Especially for super indexes like Yahoo!, login is more difficult. In addition, when logging into search engines, we generally do not consider the classification of the site, and the login directory index must be placed in a most appropriate directory (directory). Finally, the search engine in the relevant information on the website is automatically extracted from the user's page, so the user's perspective, we have more autonomy, and the directory index requires the need to manually fill in the site information, but also a variety of restrictions. What's more, if the staff think you submit the site directory, the site information is inappropriate, he can at any time adjust it, of course, will not consult with you beforehand. Directory index, as the name implies is to classify the site in the corresponding directory, so users in the query information, you can choose keyword Search, also can be categorized by layer to find the directory. If search by keyword, the result of return is the same as the search engine, but also according to the degree of information associated with the site, but there are more human factors. If you search by a hierarchical directory, the ranking of the sites in a directory is determined by the order of the title letters (with exceptions). At present, the search engine and the directory index have the tendency to merge and infiltrate each other. It turns out that some purely full-text search engines now also offer catalog searches, such as Google's use of Open Directory catalogs to provide categorical queries. And like Yahoo! These old directory index through with Google and other search engines to expand the search scope (note), in the default search mode, some directory class search engine first returned to their own directory matching site, such as domestic Sohu, Sina, NetEase, etc. while others default is the Web search, such as Yahoo. The new competitiveness through the search engine marketing law in-depth study that: Search engine promotion is based on the promotion of website content-this is the core idea of search engine marketing. This sentence is very simple, if careful analysis will find that this sentence does contain the general law of search engine promotion. The author puts forward a point of view in the article of "The content popularization thought of website promotion strategy": "The website content is not only the life source of the large-scale ICP website, but also is of vital importance to the net marketing effect of the enterprise website." Because the website content itself is also an effective website promotion means, but this kind of promotion needs to rely on the search engine this information retrieval tool, therefore the website content promotion strategy actually is the search engine promotion strategy concrete shouldUse. 4 Baidu Google edit Query processing and Word segmentation technology with the rise of the search economy, people began to pay more attention to the world's major search engine performance, technology and daily traffic. As a business, will be based on the popularity of search engines and daily traffic to choose whether to run ads, etc., as ordinary netizens, according to the performance and technology of search engines to choose their favorite engine to find information, as a technical staff, will be representative of the search engine as the research object. The rise of the search engine economy has once again proved to the people the huge business opportunities hidden in the network. The internet leaves the search for nothing but empty, cluttered data, and a lot of gold mines waiting to be mined. But how to design an efficient search engine? We can use Baidu's technical means to explore how to design a practical search engine. Search engine involves many technical points, such as query processing, sorting algorithm, page crawl algorithm, cache mechanism, anti-spam and so on. These technical details, as a commercial company's search engine service providers such as Baidu, Google and so will not be made public. We can see the existing search engine as a black box, by submitting input to the black box, and roughly judging the unknown technical details in the black box based on the output returned by the black box. Query processing and Word segmentation is an essential work for Chinese search engine, and Baidu, as a typical Chinese search engine, has always emphasized that its "Chinese processing" has the key technologies and advantages that other search engines do not have. So let's take a look at what the so-called core technologies are used by Baidu. We are divided into two parts to tell: Query Processing/chinese word segmentation. First, the query processing users to the search engine to submit queries, search engines generally accept the user query to do some processing, and then in the index database to extract relevant information. So Baidu in the acceptance of the user query after doing what work? 1, assume that the user submitted more than one query string, such as "Information retrieval theory tool." Then the search engine first do is based on delimiters such as space, punctuation, the query string into a number of sub-query string, such as the above query will be resolved to: three substrings; this is simple, let's go down. 2, if the query submitted to duplicate content, search engine how to deal with it? For example, the query "theory tool theory", Baidu is to repeat the string as if only once, that is, to deal with the equivalent of "theoretical tools", and Google is obviously not to merge, but will be repeated query substring weight increase to deal with. So how did you come to this conclusion? We can submit the "theory tool" to Baidu, return 341,000 documents, roughly look at the first page of the return content. Ok. Continue, we submit to Google Query "theory tool theory", look at the return results, still so many return documents, of course, this does not explain too many problems, then look at the first page to return the results of the sort, see? The order has not changed at all, and GOOGLE has some sort of change, which means that Baidu is repeating the queryand into a processing, and the sequential order between the strings is basically not considered (Google is considering the order relationship). 3, if the submission of Chinese query contains English words, the search engine is how to deal with it? For example, the query "movie bt download", Baidu's method is the Chinese string in English as a whole reservation, and this as a breakpoint will be cut apart Chinese, so the above query is cut, whether in the middle of the English Dictionary can be found in the word or random characters, will be treated as a whole. As for why, you can see the results by checking the "movie dfdfdf download". Of course, if the query contains numbers, this is the case.5 Optimizing core editing 1, the site's program structure to try to keep concise, remove the fancy code, you can try to use JS call. This point is very important, the search engine optimization itself and the user experience interoperability, mutual integration, and cumbersome code will not only affect the loading speed of the site page, but also let users feel the pressure of Alexander, it is not advisable. 2. Do the details of SEO in the foot station. From a large extent, is to the procedural structure of the simplification of the revision, and the details, is in the URL static, title, keyword, description, of course, keyword has not been the search engine to cast any attention. 3. Do all aspects of SEO outside the station. Including the exchange of links and the attention of the ordinary outside the chain should be how to operate, control and so on. Do not want to take the shortcut, think of mass outside the chain or a one-time purchase of a large number of high-weight outside the chain, these are the wrong thinking and strategy seo. We generally recommend that you do general traditional SEO. 4, each seoer should recognize that the most fundamental purpose of the search engine optimization is to obtain the user, so the user experience can not be ignored. The reason we get rankings on search engines is also because we provide valuable content to our users. Therefore, in the direction of development, whether you are webmaster or professional seoer, should be to do users, products, services to develop the aspects, and should not be limited to SEO, this point is very important.6SEO optimized editing Site URLThe website creation has the good description, the specification, the simple URL, facilitates the user more convenient memory and the Judgment webpage content, also facilitates the search engine to crawl your website more effectively. At the beginning of the site design, there should be a reasonable URL planning. Processing mode: 1. Only normal form URLs are used in the system, so that users are not exposed to the non-normal URLs. 2. Do not put the session ID, statistical code, and other unnecessary content in the URL. 3. Different forms of url,301 permanently jump to normal form. 4. Prevent the user from wrong and enable the alternate domain name, 301 permanent jump to the primary domain name. 5. Use robots.txt to prevent Baiduspider from capturing the form you don't want to show to your users.Title InformationThe title of the Web page is used to tell the user and search engine the main content of this page, and when the user search in Baidu Web page, the title will be the most important content displayed in the summary. Search engine in judging a page content weights, title is one of the main reference information. Description suggestions: 1. Home page: Website name or website name _ provide service introduction or product introduction. 2. Channel page: Channel name _ Site name. 3. Article page: Article Title_ channel name _ website name. Note: 1. The title should be subject to clarity and contain the most important content of the page. 2. Concise refinement, do not list the content of the Web page is not related to information. 3. User browsing is usually left to right, and important content should be placed in the top position of the title. 4. Use the user-familiar language description. If you have both Chinese and English website names, try to use the one you know as the title description.Meta InformationMeta description is part of the META tag and is located in the Picture AltIt is recommended to add ALT to the picture. Because this can be in the slow speed of the picture will not show users to understand the picture to convey information, but also to let the search engine to understand the content of the picture. In the same vein, when using a picture for navigation, you can also use the ALT comment to tell the search engine what the page content is pointing to.Flash InformationBaiduspider can only read the text content, Flash, pictures and other non-text content temporarily can not be processed, placed in Flash, the text in the picture, Baidu can not identify. So if you must use Flash, it is recommended that you add comment information to the OBJECT tag. This information is considered to be a description of flash. Let the search engine better understand the content of your flash.Frame FrameIt is not recommended to use frame and IFRAME framework, and content displayed through IFRAME may be discarded by Baidu.

Search engine Basic Work principle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.