Behind the search engine-behind the support of this "simple" is a very complex search technology.
The further question is: What kind of company is a search engine company, and what are search engine employees doing? What is the profit model of search engine companies and what is their profit margin? Can a new search engine company be as successful as Baidu or Google? How do search engine companies compete with each other? What stage is the development of search engines? What is the use of search engines tomorrow ......
Open the Google or Baidu homepage and type any word you want to search for. In less than 10 seconds, the browser page will show you a lot of search results. A keyword "Baidu", a "Search" operation, is simple to the user's opinion that this is a natural function of the Internet.
However, behind the search engine-supporting this "Simplicity" is a very complex search technology.
As we all know, we live in the age of information explosion. Every day, the amount of information is too large to cover everyone. In the face of such complex fresh information and inventory information, how do people find useful or urgently needed information? Please search. Search engines came into being.
So what is a search engine? Someone once vividly said: "A search engine is like a huge vacuum that can suck everything from the Internet ocean, no matter how deep it is on the bottom of the ocean ." Although this explanation is not accurate, it is more vivid. Let's take a look at how the real search engine works.
Spider Program
In fact, the search engine does not capture webpages, but captures webpages. Continue to use the ocean as the image of the Internet. This ocean is composed of countless web pages, and web pages are connected by links to form a broad and boundless Internet "network ".
A tool used by the search engine to capture a webpage. People call it spiderpro-gram. It crawls from one webpage to another along the webpage link, and selectively crawls web pages back.
We know that every Internet page is written in HTML. The "Spider Program" does not access the page we see on a daily basis, but the background HTML source code. If "Spider" thinks this page is useful, it copies the HTML source code of the page, sends it back to the search engine server for storage, and then continues its journey to access the next page.
Theoretically, starting from a page, based on the link information of the page, the spider can access all the web pages on the Internet-as if you knew someone, you can establish a certain relationship with people you know, people you know, and people you know in the world. The working principle of the "Spider Program" is also like this.
Different search engines have different "Spider programs" and different capabilities. For example, the number of web pages that can be captured every day is an indicator. The ability to avoid duplicate web pages is an indicator. How to capture the latest web pages is also an indicator. Therefore, the size of "Spider" capabilities will first lead to differences between different search engines.
Index and sort
Although spider programs are important, this is not the core difference between different search engines. The core difference is that the search engine indexes and sets sorting rules for the captured pages.
The source code of the captured page is stored in a large server group of search engines. It is like thousands of books are scattered in a huge library. If these books are not indexed or sorted, finding a book is the same as finding a needle in a haystack. Indexing is to analyze, sort, and refine every word on each page, and store each page in different index libraries.
Obviously, manual analysis cannot be performed on the words on each page. This process is also completed by a program. The intermediate Word Segmentation technology is critical. For example, Word Segmentation technology directly affects whether a search engine will use a word such as "task, to create a web index that contains the words "director Li must come to a meeting" (early search engines, including Google, once experienced the word "task, find out the website containing "director Li must come to the meeting ).
After the index is sorted, that is, after you submit a search request to the search engine, the search engine should return the order of the search results to the user. Obviously, the user's most desired information should be at the forefront of the search results, but what kind of information is the user's most desired information? This is a matter of benevolence and wisdom.
In the search engine field, the famous sorting rule is Baidu founder Li Yanhong's "super chain analysis" (Li applied for a patent on related technologies in the United States ). The hyperlink analysis shows that the importance of a Web page is determined by the number of links from other web pages to the web page. This is a bit like determining whether a person is important or not. It depends on how many others in the world know him. Similar to the super-chain analysis sorting rules, Google also uses the rules based on the number of links to which the webpage is directed, creating its own unique PageRank technology.
It can be said that Google and Baidu are currently gaining popularity in their respective markets only by mastering the core technologies of Web indexing and sorting rules.
Anti-cheating
Crawls useful web pages on the Internet and creates a web index based on the sorting rules. When you use a search engine for retrieval, you can quickly find highly relevant web pages. However, if you rely solely on static spider programs to capture and sort rules, some malicious websites can use these rules for so-called website optimization. Southern Weekend gave a detailed introduction to this issue in the article "war between search engines and cheating websites.
From the above introduction, we can see that the search engine does not simply execute the simple "vacuum cleaner" function, first, it needs a powerful "Spider" to help it collect various webpage information that is growing on the Internet and is changing every day. At the same time, it requires a great deal of investment to purchase servers to save the information. Then, the search engine should index and sort all the collected webpages, and be always vigilant against various cheating behaviors.
Through such a search engine workflow, we can also see the resources required to operate a good search engine: "Spider Program ", "Word Segmentation technology", "sorting rules", and "anti-cheating programs" require a large number of programmers to keep working to improve the efficiency and functions of the software; storage of massive web page information and index information, enterprises need to invest a lot of money to deploy the server group. At present, Google has invested thousands of technical personnel in the development of search engine technologies worldwide, and has studied search technology in over 100 languages. Baidu currently has over 700 employees, more than half of technical staff study a single Chinese search technology.
Due to the complexity of the search engine industry, "Focus" is also an important factor on the basis of resource investment. "We will continue to focus on R & D and promotion in the Chinese search engine field ." Li Yanhong, president of Baidu, which has become an entrepreneur star, once said publicly.
With an understanding of the technical kernel of the search engine, it is not difficult to understand why the words "full, new, fast, and accurate" will become the criteria for measuring the quality of the search engine. "Full" means that the web pages collected in the index library should be full. According to the latest statistics, Baidu index Library collects 2 billion of the web pages in the 0.8 billion million text pages, google's number is 0.5 billion; "new" means to ensure that some of the latest web pages can be included in the index library. Currently, Baidu's index library will be updated every month, however, it is updated several times a day, and each time it updates a part of the latest web page. "fast" means the speed at which the search engine returns the results after the query is submitted, "accuracy" refers to the relevance and accuracy of search results.
After learning about search engines at the technical level, people can better understand how search engines make profits and why the market is so popular with search engine companies.