Search Engine Fundamentals

Source: Internet
Author: User

Summary: Recently read the "This is the search engine: core technical Details" a book, briefly make a record.

__________________________________________________

Directory

"1" Search engine overview

Basic technology of "2" search engine

Platform Foundation of "3" search engine

Improved optimization of "4" search results

__________________________________________________

"1" Search engine overview

In the past 15 years, the Internet information has expanded rapidly, by artificial way to screen to obtain useful information is no longer possible, so search engine emerged. According to its development, it can be divided into four times.

    • Category directory. Yahoo is the representative of this period, the use of pure artificial way to collect, high quality, but low efficiency.
    • Text retrieval. The use of the classic information retrieval model, mainly query the user key words Web content similarity, the capacity of the collection increased, but the quality is not very good. For example, early AltaVista.
    • Link analysis. Typical: Google's PageRank, greatly expanded the content of the Web page, quality has improved, followed by a variety of cheating methods.
    • User-centric? Most search engines now return the same results for the same query, but different users may be concerned about the difference, and the future may be more about the user's variability.

When it comes to development, we have to mention the three main goals of search engines, no matter where they go, the following three goals are always a good assessment:

    • MORE: How to put more relevant pages to ingest?
    • Faster: How can I quickly return results from hundreds of millions of pages?
    • More accurate: How to present the result that the user is most interested in?

Basic technology of "2" search engine

This part mainly from the following four parts to tell the basic technology of the search engine, these four parts are also the important link of the search engine.

    • Web crawler
    • Build an index
    • Content Search
    • Link analysis
2.1 Web crawler

Web crawler is a search engine download system, its role is the acquisition of content, means is in the World Wide Web through the link to constantly crawl collection of various types of Web pages. But the internet is a vast number of pages, and every day there are new content generated, according to the crawl target and scope, can be easily divided into the following categories of crawlers:

    • Bulk crawler: Clear crawl target and range, reach stop
    • Incremental crawler: In response to the constantly updated Web page, crawlers need to respond in a timely manner. General-purpose commercial engines are typically such
    • Vertical crawler: Only for a specific area of the crawler, according to the theme filter.

When crawling the Web page, how should the crawler determine the next goal? There are mainly the following strategies:

    • Width First: The simplest way to join a link in a page to a crawl queue in turn
    • Local Pagerank:pagerank is a page importance indicator, which determines the next crawl target based on local PageRank value in a certain period.
    • OPIC: When the current page is downloaded, average its importance to the included links, selecting the most important pages each time, without iterative calculations, faster
    • Big Station priority: The idea is simple, to the website as a unit to measure the importance of the page.

Next, a brief introduction to the search engine is an important problem: dark Web crawl. The so-called dark nets, refers to the conventional way is difficult to crawl the Web page, and in the network, such a network is a large number of existence. Some Web pages do not have a chain, and some of the main content is stored in the database (such as Ctrip), no links to these records. Dark Net mining is a major research focus of commercial search engine, Google is so, Baidu's "Aladdin" program is also in this.

2.2 Building an index

For the search engine, cable is one of the most important core technology, in the face of a huge amount of web content, how to quickly find all the pages containing user query words? Inverted indexes play a key role in them.

For a Web page, we think of it as a document, where the content is made up of words. In order to quickly give a document result for a user's search term, we create a word-document storage structure. Inverted indexes are a specific form of storage that implements the word-document matrix. By inverted index, you can quickly get a list of documents that contain the word based on the word. The inverted index consists of two parts: the word dictionary and the inverted file.

Word dictionaries are mainly two kinds of storage methods: loads hash links and tree structure.

Index Creation Method:

(1) two times document traversal

When you first scan a collection of documents, the method does not immediately start indexing, but instead collects some global statistics. For example, the document collection contains the number of documents N, the number of different words contained within the document collection, and the information df in how many documents each word appears. After obtaining the above 3 types of information, you can know the size of the final index, and then allocate enough space in memory to store the inverted index content. In the second scan, you start to really create the inverted table information for each word, that is, for a word, the document ID of each document that contains the word, and the number of occurrences of the word in the document TF

(2) Sorting method

The sorting method has improved this method, in the process of indexing, always allocates a fixed size space in memory to hold the intermediate result of dictionary information and index, when the allocated space is consumed by the light, the intermediate result is written to disk, and the space of middle result in memory is emptied. To be used as the storage area for the next round of index intermediate results. This method allows you to index a collection of documents of any size, because it requires only a fixed size of memory.

(3) Merge method

When the allocated memory quota is consumed by the light, the sorting method simply writes the intermediate results to the disk, and the dictionary information is maintained in memory, and as more and more documents are processed, the dictionary contains more and more dictionaries, so it occupies more and more memory, resulting in less memory available for later intermediate results. The merge method has been improved by the fact that every time the in-memory data is written to disk, all intermediate result information, including the dictionary, is written to disk so that all the contents of the memory can be emptied, and subsequent indexing can use all the fixed memory.

Index Update Policy:

    • Full rebuild
    • Re-merge Policy
    • In-place update policy
    • Hybrid strategy
2.3 Content Search

Content retrieval model is the theoretical basis of search engine ordering, which is used to calculate the relevance of Web pages and queries.

Common retrieval Models

    • Boolean model
    • Vector space Model
    • Probabilistic models
    • Language model
    • Machine learning Sequencing

Retrieval System Evaluation Index

    • Accuracy: ratio of related documents in search results A/(A+B)
    • Recall rate: The percentage of relevant documents in the results (A+C)
    • [Email protected]: number of related queries in the first 10 results
    • Map Indicator: Weighted by the order of the returned results, the reciprocal of the rank
Query related Query Independent
Within the search results A B
Not in search results C D

2.4 Link Analysis

Search engine to find a Web page to meet the user's request, the main consideration is two factors: on the one hand is the user's query and the content of the Web content similarity score, that is, the relevance of the Web page and query, on the other hand, through the link analysis method to calculate the score, that is, the importance of the Web page. Link analysis is a kind of method to get the importance of Web page through the network link structure.

Many link analysis algorithms, from the model, mainly divided into two categories:

    • Random walk: Jump from a page to a certain probability to the link it contains
    • Subset propagation: To give certain subsets of the transmission, according to specific conditions, the weight of the value to other pages

Common algorithms:

    • PageRank
    • HITS
    • SALSA
    • Subject Sensitive PageRank
    • Hilltop
Platform Foundation of "3" search engine

This section is mainly about search engine platform support, mainly cloud storage and cloud computing model.

For commercial search engines, large amounts of data need to be stored, and these massive amounts of data need to be processed. Cloud storage and cloud computing are solutions to this problem.

A large amount of data cannot exist on a single server, it must be distributed storage. When data is updated, this results in inconsistent data on multiple servers, and how to select a server issue.

Let us first introduce some basic principles :

(1) Cap principle

The CAP is the abbreviation for consistency,availability,partition tolerance, which is consistency, usability, and zoning tolerance.

For a data system, three principles cannot be combined. Cloud storage tends to focus on CAS, sacrificing partial consistency.

(2) Acid principle

This is the principle that the relational database takes. It is the abbreviation of atomicity,consistency,isolation,durability, that is, atomicity, consistency, transaction independence, persistence.

(3) Base principle

Large cloudy storage systems, unlike acid, sacrifice strong data consistency for high availability. Because users may not be able to provide service-sensitive changes to the data.

The three aspects of it are:

    • Basic available: basically Available
    • Flex Status: Soft state, no need to sync at any time
    • Final consistency: Even if data consistency as long as it is consistent within a certain time period
Google's cloud storage and cloud computing architecture

Cloud storage:

    • GFS File system: consists of the master server (master), the chunk server, and the GFS client
    • Chubby Lock Service: a coarse-grained lock service for distributed systems
    • BigTable: A structure or semi-structure storage model for massive data, essentially a three-dimensional mapping table, consisting of row primary key, column primary key, and time
    • Megastore: Suitable for real-time interaction, while GFS and bigtable are suitable for background processing

Cloud computing

    • Mapreduce
    • Percolator: Incremental mode, as a supplement to MapReduce
    • Pregel: Large scale graph calculation model

Other cloud storage Systems

    • Dynamo:amazon
    • Pnuts:yahoo!
    • Haystack:facebook
Improved optimization of "4" search results

As I said before, the three goals that search engines pursue are faster, more complete, and more accurate. But achieving these goals is not a very easy task and requires a lot of processing. This part mainly from the following aspect, how to improve search results of search engines, improve search quality, improve search performance.

4.1 Cheating Analysis

Cheating method

    • Content cheat: Set irrelevant keywords, content farm (lots of low quality content)
    • Link cheat: Link farm, link to each other ...
    • Page Hide cheat: cheat crawler, hide irrelevant keywords, redirect ...
    • WEB2.0 Cheating

The whole idea of anti-cheating

    • Trust propagation
    • Non-letter transmission
    • Anomaly Discovery

(1) The so-called Trust communication Model , the basic idea is as follows: In a huge amount of web data, through a certain technical means or artificial semi-artificial means, from which to screen out some completely trustworthy page, that is certainly not cheating page (can be understood as a white list), The algorithm takes these white list pages as the starting point, gives the white list the page Node high trust degree score, other pages whether cheats, must according to its and the White List node link relations to determine. A whitelist node spreads the trust score outward through a link relationship, and if a node finally gets a higher trust score than a certain threshold, it is considered no problem, and a page below that threshold is considered a cheat page.

(2) The non-trust propagation model is similar to the trust propagation model from the framework, and the biggest difference is that the initial page subset is not a trustworthy page node, but rather a collection of pages that are not trustworthy (which can be understood as a blacklist), instead of the page nodes being trusted. Give the Blacklist the page node does not trust the score, through the link relationship to propagate this distrust relationship, if the last page node's distrust score is greater than the set threshold, it will be considered a cheat page.

(3) The anomaly Discovery Model is also a highly abstract algorithm framework model, its basic hypothesis is that: cheating Web pages must exist in different from the normal characteristics of the page, this feature may be content, there may be links between the aspects. The process of making a specific algorithm is often to find a collection of cheating pages, analyze the characteristics of their anomalies, and then use these characteristics to identify the cheat page.

As long as the search results of manipulating search engines can bring benefits, then the motive of cheating will always exist, especially in the era of network marketing plays a more and more important role in the times. Cheating and anti-cheating is an interactive process of mutual inhibition and mutual promotion, and the story of "however persuasive, outsmart" is repeated. The above-mentioned content is mainly by technical means to carry out anti-cheating, and in fact, purely technical means at present can not completely solve the problem of cheating, we must combine the artificial means and technical means to achieve better anti-cheating effect. Technical means can be divided into relatively general means and a more special means, relatively universal means for possible new cheating techniques have a certain degree of prevention, but because of its versatility, so targeted is not strong, the effect of special cheating methods may not be good. And the special anti-cheating method is often hindsight, that is, only cheating has occurred and more serious, it is possible to summarize the characteristics of cheating, take the method of filtering afterwards. Artificial means and technical means have a strong complementarity, can be found in the new cheating mode, can be seen as a kind of prevention measures when cheating. Therefore, from the time dimension of the suppression of cheating methods, the general anti-cheating method is heavy in the prevention, artificial means in the discovery, and the special anti-cheating method in the post-processing, it has intrinsic relations and complementary relationship exists.

4.2 Analyzing User Intent

Accurate analysis of the user's search intentions is currently the focus of search engine research direction.

The user's intentions can be initially slightly divided into

    • Navigation type
    • Information type
    • Thing type

Search logs are an important source of data to exploit user intent

    • Click on the graph: the user click on the query results after the link may be more the result of his hope
    • Query answer: The user in a short period of continuous query word has relevance
    • Query graph: Building a structural relationship between user queries

The user may not be able to find the right search term, or the keyword input error, this time need to help users clarify the intent of the search.

The common methods are:

    • Related Search
    • Query Error correction

4.3 web page de-weight

Statistically, a considerable proportion of the Web pages are similar or identical, up to 29%. If the search returns a large number of similar pages, it obviously reduces the quality of the search results. In view of this phenomenon, the Web page to heavy is very necessary.

Web page de-weight is generally the crawler crawl to the Web page, before it is indexed. The de-weight algorithm should take into account accuracy and operational efficiency.

Typical Web page de-weight algorithm:

    • Feature Extraction
    • Document Fingerprint generation
    • Similarity calculation

Several typical de-weight algorithms:

    • Shingling algorithm: A sequential word sequence in a document as a feature
    • I-match algorithm: A global feature dictionary is first counted and then compared with the characteristics of a single document
    • Simhash algorithm: Probably the best de-weight algorithm at present
    • Spotsig algorithm

4.4 Caching mechanism

Caching mechanism can speed up the user and save computing resources

The goal of the caching system is to maximize the cache hit ratio and keep the cache consistent with the index

Cached objects are primarily web search results and query terms corresponding to the inverted list

Cache elimination strategy mainly includes dynamic policy and hybrid strategy

Search Engine Fundamentals

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.