Second Lesson notes: Search engine basics and working principles

Source: Internet
Author: User
Keywords Search

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

Hello everyone, I am specialized in SEO, for several months have been in the maintenance and optimization of the massage list www.yziyuan.com this site, and summed up a lot of experience and knowledge. Today to share is "search engine basics and work principle", this is the most basic concept,

The first part: What is a search engine?

1, definition?

Official definition:

Search engine refers to a certain strategy, the use of specific computer programs to collect information from the Internet, in the organization and processing of information, to provide users with retrieval services, users to retrieve relevant information to display to the user's system. Baidu and Google are the representatives of search engines.

My understanding is:

Search engine According to the rules to set the content of the target site, such as user search, can better display to the user want content! Such a service process is through a tool called search engine to achieve!

2, classification?

(1) Full-text indexing:

From the entire Internet to extract the information of each website (mainly Web page text), through their own search program (Indexer), commonly known as "Spider" (Spider) program or "Robot" (Robot) program to establish a database, search results directly from its own database. And can retrieve the record which matches the user query condition, returns the result in a certain arrangement order. Full-text search engine is currently widely used in the mainstream search engine, the foreign representative has Google, the domestic has a well-known Baidu.

SEO should focus on this type of search engine:

Keywords matching degree, appearance position, frequency, link quality--

Because: When users search for information by keyword, search engine will be in the database search, if found with the user requirements of the content of the site, then use a special algorithm-usually based on the page to calculate the relevance and ranking of the pages, and then according to the degree of relevance, in order to return these links to the user. This engine is characterized by a high rate of search.

(2) Directory index

Users can find the necessary information in accordance with the category directory, and do not rely on keywords (Keywords) for inquiries. Although there is a search function, but in strict sense can not be called a real search engine, just categorized by directory site links list. The most representative directory index is Yahoo, Sina classification directory search, hao123.

(3) Meta search engine

(META search Engine) after accepting a user query request, search on multiple search engines and return the results to the user. The famous meta search engine has InfoSpace, Dogpile, Vivisimo and so on, Chinese search engine is representative of search engine. In the order of search results, some directly arrange search results by source, such as dogpile, while others rearrange the results according to custom rules, such as Vivisimo.

(4) Vertical search engine

is a kind of search engine that rises gradually 2006 years later. Unlike the general web search engine, vertical search focuses on specific search areas and search requirements (e.g., ticket search, travel search, Life search, novel search, video search, etc.) and has a better user experience in its specific search area. Compared with universal search thousands of retrieval servers, vertical search requires low cost of hardware, specific user requirements, and diverse query methods. High accuracy!

(5) Collection search engine: The search engine similar to meta search engine, the difference is that it does not call multiple search engines at the same time search, but by the user from a number of search engines available, such as HotBot at the end of 2002 launched the search engine.

(6) Portal Search engine

Msnsearch, for example, is characterized by neither a catalog nor a Web database, and its search results are entirely from other search engines.

(7) Free link list

Free Links list (FFA for all links): Generally simply scrolling link entries, a few have a simple category, but the scale is much smaller than the Yahoo! Directory index.

Summary: Do SEO can not be separated from the search engine, in a sense, SEO is in and search engine game process! We do SEO, although do not need to write code, and do not need to understand the technical details of the search engine, but some basic knowledge of the search engine should also understand the benefits of our optimization work! Only understand the search engine of these basic concepts, we are doing site optimization to be able to do!

Part Two: The History of search engine

Now the speed of Internet development, network resources far beyond the human think and control, if there is no search engine we can not find what we want! In particular, the development of social networks, such as Facebook,twitter, Weibo, and mobile applications are booming, whether from the number of users, The traffic of the website see? or the influence of the society far more than Yahoo, Google and other former internet giants! So what does that have to do with SEO? As long as there is a network of places there are search, there are search from the seo!

So what is the history of the search engine? What is the need and help for us to do website optimization? Don't we understand the development of the search engine can not do SEO? In fact, it is not so, understand the history of the search engine for us to do a good job of website optimization will have greater help!

The history of the search engine I will not be in detail here to explain, if interested students can go to the search engine through this site to learn!

Here's a look at the value of search engine growth:

These data illustrate a few questions:

(1), the search market is still at the zenith of the speedy hair! For doing SEO from the great Opportunity and gold mine

(2) Through the growth of search engines, the search engine Company a large part of the revenue from the network advertising, of which the proportion of SEM is still very high, SEO is the natural search rankings. No need for a lot of money can achieve the same value;

(3) Other search more and more close users, SEO to display the talent of the place more up!

(4) Through the competition of different platforms, social concerns, physical integrity, the future to do SEO business is also a good gospel!

Summary:

Through the development of the search engine we can easily judge the importance of SEO in the future! Understanding the history of search engines helps SEO personnel to understand the development and transformation of SEM more profoundly! This also helps us to grasp the future direction very well! Only when it's synchronized with the Times, We can continue to progress! The development of the search engine is fast. SEO is a good thing, as long as the search places will have rankings, ranking will be used to SEO technology! What we have to do is to constantly pay attention to these dynamic changes in order to better play the advantages of SEO!

As we all know, the development speed of the Internet is not general fast! With the development of the Internet, the value of the search engine has been soaring. Why do you use this kind of search technology? How did it come about? For example: Our library is the Treasure House, when the library books and documents with increasing time, there will inevitably be a problem, find difficulties, management is also very difficult, this time how to do? Through Directory management, We can to the library so the file to achieve regular management, in fact, our search engine principle is originated from this traditional file retrieval technology! So what is the real principle of search engine? We continue to look down:

Part Three: The working principle of search engine:

The working principle of a search engine can be divided into three stages:

(1) Crawl and crawl:

Is the search engine spiders through the Tracking Link address Site page, the access to the Site page HTML code into their own database.

Crawl and crawl is the first step of search engine work, mainly completes the data collection task;

Explain several key words:

1, Spider:

(1) Definition: I call him it is called crawl Web data when the executor, in fact it is a computer program, because this work process and the actual spider is very similar, the professional call it a search engine spider!

(2) Working process: Spider program to the Web page to send access requests, the server will return the HTML code, the spider program to the received code into the original page of the database. When spiders visit any one site, they will first access the robots.txt file under the root of the site! If the robots.txt file prohibits search engines from crawling certain files or directories, spiders will comply with these prohibitions and not crawl the banned URLs.

(3) Common search engine spider name:

Baidu Spider, Yahoo Chinese spider, Google spider, Microsoft Bing Spider, Sogou spider, search spiders, Youdao spiders and so on!

2, tracking Links

We all know that the whole internet is made up of connected Web pages! Pages and pages are linked to the link, search engines for faster collection of site data, is through search engine spiders to track the links on the Site page, from a page crawl to the next page! This process is the same as spiders crawling in a spider's web! So spiders can quickly crawl through the Web page of the entire internet!

Depending on the link structure of the site we can divide the spider's crawling route into two types: deep crawling and breadth crawling

A: Deep crawl: Spiders crawl along the link of the found page until there is no other link before, then return to the first page, along another link and then crawl forward!

B: Breadth Crawling: Spiders find multiple links on a page, do not follow a link has been crawling forward, but the page all the first layer of links are crawling again, and then continue to follow the second layer of the link found on the page to crawl to the third level page! Keep going on like this ...

Therefore, we do the site, the structure of the site must have these two layouts, optimize the site page when also must do two kinds of link layout! Such a structure is the search engine spider group like!

3, directional optimization technology to seduce spiders

A: Read the site and the weight of the page to optimize the number of visitors to improve the spider.

B: Do a good job of the page with the new frequency and content quality

C: Add Import Links

D: Distance from the homepage of the click Distance, the closer the distance from the homepage, the higher the page weight, the greater the chance of the spider crawling

4, Address library,

Search engine will build a storage page address library, the purpose is to avoid the search engine spiders Crawl and crawl Web site, this address library has been crawled pages, there have been found after the page has not been crawled!

The URL in this address library must be spiders crawl it? The answer is No.

has the manual input Seed website address, also has the webmaster through the search engine webpage submits the form to submit in the website!

Another need to pay attention to is: URL submitted, also not necessarily can be included, this see you submit the weight of the page how! But search engine spider still follow the link to grab the page!

5, File storage

Search engine spiders crawl to the page is stored in this original page database! Each URL has a unique file number!

6, Detection of replication content

There are a lot of webmasters have encountered such a problem: the Web page found in the spider to crawl, but the page has not been included, do not know what to do! In fact, very simple, it is likely that spiders crawl your Web site when found a lot of weight than the lower content, such as: reproduced or false original content is, the spider will leave! Your website is not included! Spiders crawl the content of the page will also be a certain degree of replication content detection!

(2) pretreatment

This process refers to: Indexing program on the database spider crawl to the site page to deal with, mainly do text extraction, Chinese word segmentation, indexing and other work;

This process is to play a role in the bridge, because the search engine database is too much data, when the user input keyword in the search box will not return to the ranking results, but often we feel very quickly, in fact, the key role is to preprocess the process! And crawling crawl process He is also in the background to finish early!

Some people think that preprocessing is the index, in fact, is not the case, indexing is only a major step in preprocessing, then what is the index? An index is a structure that sorts the values of one or more columns in a list of databases!

Five jobs before indexing:

1, extract text:

We know spiders are crawling with all the HTML code of the page, which actually contains a lot of information: there are text, CSS properties, a large number of HTML format tags, javascript program! But the latter two are not able to participate in the ranking content, In other words, in addition to the other things are removed, the process is to remove the process, also called the process of extracting text, namely: Extract can be used for ranking processing site page text content!

Note: Search engine In addition to extract the visible text, you can also put forward the following invisible text content, such as: Meta tags in the text content, picture substitution text, Flash file alternative text, link anchor text, etc.

2, Chinese participle

We all know that Chinese sentences and English sentences have a different place, not the difference between letters and Chinese characters, but, English words and words are separated from each other, Chinese sentences, words and words do not have a separator, a sentence in the words are linked together! So, At this time the search engine must first distinguish which words constitute a word, which word itself is a word! For example: "Bo on Down jacket" will be divided into "Bo secretary" and "down jacket" two words;

There are generally two methods of Chinese participle:

A: Dictionary matching: also divided into positive and reverse matching!

B: According to the search statistics

These two are often mixed in use! And in Baidu and Google in the relevance of word segmentation is sometimes not the same, for example: Search engine optimization, in Baidu is a complete word, and in Google it is divided into "search" "Engine" "optimization" three parts, so in doing optimization must pay attention to the choice of keyword characteristics, Later we will explain the word selection techniques in detail.

Note: If we have to make a combination of words, do not let search engine segmentation technology to separate it?

We can do this: in the title of the page, H1 label, use bold form to appear this keyword! This is appropriate to remind the search engine, the search engine will know that our word is a combination will not separate it!

3, to stop the word

What is a stop word? Is that the number of times in the page, there is no substantive impact on the content words, such as: "", "" "" "The" "," "Ah" "Ha" "Ah" and other exclamations, "thus" "" but "" but "such as adverbs or prepositions, these words are called stop words! In English there are: The,a, An,to,of, wait.

Search engines to stop words for two main purposes:

One is: To make the index data subject more prominent, reduce the amount of unnecessary calculation;

The other is: check your content and the other database content has a lot of repeatability

Here need to remind everyone is: do not casually in the online copy an article plus a few stop words on the stick to their website up, after learning the top, you should know what meaning!

4, remove the noise:

The noise here is not what we call noise, it refers to a kind of rubbish, that is, superfluous words! These words are generally included in the copyright notice text, navigation bar, ads! The noise is eliminated to enable the page to display the subject content better:

For example: Blog "category" "Historical archive" and so on!

5, go Heavy (Chong)

What does that mean? If a similar article appears on different sites or links to different addresses, the search engine will think it is a file, it does not like the duplication of content, so it will not crawl! Before indexing, you need to identify and remove duplicate content, called "Go heavy"!

Search engine how to go heavy! Technology we don't need to master, but we have to pay attention to a few key points:

A: Simple increase "" "", "" get "is very easy to be recognized, be sure to use carefully!

B: Copy Other people's articles, a simple exchange of paragraph structure! This kind of false original also must use cautiously!

This is because: this operation can not change the specific keywords of the article, the above practices always escape the search engine to the weight of the algorithm.

After the above five steps, search engine can get unique, can reflect the main content of the page, in terms of the content.

Then search engine program will be extracted to the top of the keyword through the word segmentation program to divide, each site page is converted into a set of keyword composition! At the same time, record each keyword on the page on the frequency, number of times, format (such as: Title tag, bold, H tags, anchor text, etc.) location (paragraph), These have been weighed in the form of the record down! And then put it in a place where this place is dedicated to the thesaurus structure of these combinations--index library! Also make "Thesaurus indexing form"

What is a forward index:

Each folder corresponds to an ID, the file content is represented as a collection of keywords! In the search Engine index library, this time the keyword has not been translated into the keyword ID, such a data structure called forward Index!

Draw a picture for everyone to understand:

File ID

Content

Folder 1

Keyword 1, keyword 2, keyword 7, keyword 10 ... Key words L

Folder 2

Keyword 2, keyword 7, keyword 30 ... Key Words M

Folder 3

Keyword 2, keyword 70, keyword 35 ... Keyword n

......

........................

Folder 7

Keyword 2, keyword 7, ... Keyword X

......

..................

Folder X

Keyword 7, keyword 50., keyword y

What is an inverted index?

Because the forward index cannot be used directly for rankings! For example: Users search for a keyword 2, if only from the forward index, can only find the folder containing the keyword, can not actually return to the rankings; this is where the inverted index is used.

In the inverted index of keywords into a primary key, each keyword corresponding to a series of files, each file appears to search the keyword, so that users search a keyword, the sorting program can be in the inverted table to find this keyword corresponding file!

Please look at the picture in detail:

Keywords

File

Key Words 1

Document 1 document 2, document 17, document 110 ... File L

Key Words 2

Document 2, document 7, document 30 ... File b

Key Words 3

Document 2, document 7, document 30 ... File U

......

........................

Key Words 6

Document 21, document 70, document 300 ... File K

......

..................

Key Words 7

Document 12, document 27, document 3 ... File L

Processing of special documents:

Search engine In addition to crawl HTNL files can also crawl the following file types: Pdf,word,wps,ppt,txt, such as these files, but the attention is: search engines can not crawl pictures, videos, flash such text content, also can not execute scripts and programs! So in SEO, Use these on your site as little as possible!

Calculation of link relationships:

When the search engine crawls the page, you must also calculate in advance which links on the page are pointing to which pages. Each page has what to import links, links use what anchor text, it is these complex links to form the link between the site and page link weight! For example: Google's PR value is the important embodiment of these relationships, the following detailed!

Ranking:

The ranking process is the process of interacting with the user: refers to when the user input keywords, the ranking program calls the index database inside the data, the calculation of relevance in accordance with a certain format to generate search results page!

1, the search word processing process

A: Chinese participle;

B: to stop the word;

C: Instruction Processing: Search engine default processing is between the use of "and" logic, such as users in the search "website construction", the search engine defaults for users want to find both "site" contains "Construction" Page!

A common search instruction has a minus sign, and what search instructions? How to use the search instructions, later will be a detailed section to explain!

D: If the user entered the obvious wrong word or English words, the search engine will prompt the user to correct the use of words or spelling! For example: Search "Build station skill number"

E: The integration of search triggers! For example, search for stars, there will be pictures, videos and other content! suitable for hot topic;

2, how does the file match proceed?

This is done quickly in the inverted index: look at the picture

Keywords

File

Key Words 1

Document 1 document 2, document 17, document 110 ... File L

Key Words 2

Document 1, document 7, document 30 ... File b

Key Words 3

Document 2, document 7, document 30 ... File U

......

........................

Key Words 6

Document 21, document 70, document 300 ... File K

......

..................

Key Words 7

Document 12, document 27, document 3 ... File L

If the user searches for a word, both the keyword 2, also has the keyword 3, then the group will be very accurate to find both the keyword 2, also contains the keyword 3 file, and then return!

3, how is the initial subset selected?

There are tens of thousands of pages on the Internet, search a keyword out of the page will have tens of thousands, if the search engine is directly to calculate the relevance of the page, it is too time-consuming! In fact, users do not need to see these thousands of pages, users only need one or two useful pages! At this time, the search engine will be based on the user's search term of the selected 100 files, and then return, then in the end is the choice of which 100? This depends on your site page and user search keywords related matching degree! The high weight of the page will go to the search engine's primary focus!

4, Calculating correlation

After selecting a subset will calculate the relevance of the page, we do not need to know how the search engine is to calculate the relevance of the page, but our knowledge of the impact of the calculation page relevance, so that more help us to further optimize our website!

SEO attention to the following factors:

A: The usual degree of keywords: "Come on UFO"

B: Frequency and density

In the absence of keyword accumulation, it is generally believed that the search term in the page appears in the number and density of the higher, the description of the page and search terms of relevance is higher!

C: Keywords location and form

Position research mainly has: Homepage or level two page? Form Main research: Title tag, Bold, H1

d keyword Distance: for example, the search term is "website construction", if the page appears many times connected to the word "site construction", rather than separate "site" behind no construction, or there is no site construction!

E: Link Analysis and Page weight

Link and the weight of the relationship between the main anchor text, how many search terms are anchored text link to import the correlation is high

5, ranking filtration and adjustment

When a matching subset of files is selected, the general ranking is almost certain! The ranking filter is mainly aimed at those who rely on cheating, cheating is suspected of the site to adjust! Although the previous work at the end of these sites even if the weight and relevance is very high, But the search engine will also filter out these sites in the final step!

6, show the results of the rankings

The main display is: The original page title tag, description label, snapshot date and other data!

Note: Some websites are search engines need to invoke dynamically generated page summaries instead of calling the page itself!

7, the role of search engine caching:

Search engine will the user often search some words to record down, will these search ranking records stored in the search engine cache, when the user in a search this word, the search engine will directly call the contents of the cache! This shortens the search reflection time, greatly improves the ranking efficiency!

Summary:

Above is the search engine entire work process detailed introduction! These are just conceptual things, in fact, the search engine's work steps and algorithms far more complex than we think! But it doesn't matter, we do SEO as long as we understand the above basic concepts is enough! Search engine algorithms are still being optimized, Interested students can pay more attention to this, it also helps themselves in the optimization of a new breakthrough! The concept described above is roughly the basic principle of the mainstream search engine!

Xiao Xin

December 30, 2012 Sunday Night

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.