Summary in Chapter 19th of Introduction to Information Retrieval

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

User query type
Shingling

I. Introduction to Web Search

Previously, we searched for traditional document sets, but Web search is completely different from traditional search, because the number of web document sets cannot be estimated and has various forms;

Generally, the Web is implemented through the B/S architecture. The client is a browser, the server is a Web server, and data is transmitted through HTTP;

When the browser sends a request and receives the server's response, the browser automatically blocks the incomprehensible parts;

The web document set is massive, but this information is useless if it cannot be searched. Therefore, web search is very important.

The document set for Web search must be relevant and authoritative;

The possible problem is that some web pages are composed of images without text;

Static Page: Fixed page;

Dynamic page: page for interacting with the database;

A web page set can be converted into a graph. A node indicates a webpage and an edge indicates a link;

Note: Web charts may not be strongly connected, that is, node A may not be able to reach Node B;

The number of web pages with inbound I is proportional to 1/(I ^ );

To make the user experience better, we generally need:

(1) make the search interface concise and try to make the searched webpage as concise as possible;

(2) As high as possible;

User query type

(1) Information Query: query the pages of related topics;

(2) Navigation query: query the official homepage of a word;

(3) Transaction query: You need to do one thing, such as downloading;

Ii. Cheating webpage

The main cause of Web Page cheating is economic benefits;

1. repeat keywords on the Web page to improve rankings

2. Duplicate keywords are used as the background color, so that users cannot see them, but they will be indexed by the search engine.

3. Paid indexing: pay for IR to rank its webpage top

4. Cloaking: The webpage crawled by the spider is inconsistent with the webpage accessed by the user's browser. For example, when the collector collects the data, it returns the relevant document, but the user gives another webpage while accessing it;

5. Doorway page: user --> bridge page --> commercial webpage; bridge page is related, but if you visit the bridge page, you will jump directly to another webpage;

3. Sponsorship search

On the webpage, the right half is provided to the sponsors. The more money the sponsors give, the higher the ranking;

Generally, the cost of using CPC (cost per click) is charged. If you click once, the sponsor will pay the IR company a sum of money;

Some people use click spam to make sponsors pay more for IR companies;

Iv. index scale comparison

Note:

1. The search engine will search for webpages not indexed;

2. The search engine does not return all index pages;

Therefore, it is difficult to accurately estimate the index size;

Note: A trap server may be entered when the search engine collects pages, that is, as long as the server is accessed, the server will automatically generate numerous webpages for collectors to collect;

Compare the index scale of two search engines

Given the two search engines E1 and E2, we randomly select webpages from E1 and E2, and compare the proportions of webpages extracted from E2 in E1, the proportion of webpages extracted by E1 in E2;

For example:

E1/e2 = y/x = (1/6)/(1/2) = 1/3; therefore, E1 indexes are relatively small;

Y indicates the ratio of E1 webpage to E2;

X indicates the ratio of E2 webpage to E1;

V. Random webpage Sampling Method

Previously, we used the random web page extraction method to estimate the index size of the search engine, but it was impossible to perform random sampling. Therefore, we used some random sampling techniques:

(1) random search: tracks a user's query records and randomly selects webpages from the query results;

(2) random IP address method (random IP address): randomly generates an IP address and the corresponding server to collect all webpages of the server;

(3) Random Walk: if the web is strongly connected, a rule can be found;

(4) random query: a random query is generated and submitted to E1. A webpage is extracted from the result and 6-8 low-frequency words are randomly extracted from the webpage, query in E2;

6. approximate repetition

40% of the web pages on the web are repeated, some are completely repeated, and some are approximately repeated. For example, the creation date is different, but the content is the same;

Search engines need to avoidIndexDuplicate page;

For completely duplicate web page detection, a fingerprint can be generated for each web page for comparison;

Shingling

The K-shingle of document D is similar to the concept of K-gram, which indicates a sequence composed of K consecutive word items, for example, the 3-shingle of a hello World is a hello World, hello World A, world a hello;

Method 1: the similarity between the two documents is calculated using the jaccard coefficient, but the calculation is too complicated;

Method 2:

We can do this using the random replacement method:

Steps:

(1) map each k-shingle in each document to a vertex through the hash function, for example, the blue vertex in the first line;

(2) randomly replace these points with new positions, such as red points;

(3) Retain the minimum vertex of the random replacement vertex, and take the minimum vertex of The Red vertex;

(4) When the last point position of the two documents is the same, it indicates that the documents are similar and repetitive;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Summary in Chapter 19th of Introduction to Information Retrieval

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Summary in Chapter 19th of Introduction to Information Retrieval

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support