Search Related Terms

Source: Internet
Author: User
Tags idf

From: http://banditjava.iteye.com/blog/253184

Recently, Brother monner shared a search engine article "principles, technologies, and systems", which is very rewarding. Next I will list the terms. Let's take a look.

Appendix. Terms
B:
Semi-structured data (semi-structured data). Compared with plain text, web page data is structured and displayed in HTML annotations; however, compared with the data of relational databases with strict theoretical models, this HTML annotation brings about a much weaker structure. Therefore, people call the data on the web as semi-structured data, this is the basic feature of Web data.

The Boolean model has different meanings in different scenarios in the information retrieval field. When we discuss a user submitting a query, it refers to an operational relationship required by each component of a query to form a final query result set for the query result subset; in the vector space model compared in the document, the Boolean model refers to the fact that each component of a document vector only has two values: 1 and 0, which respectively indicate the occurrence of the corresponding feature item.

C:
A recall measure used to determine the quality of the retrieval system, indicating the percentage of the number of documents retrieved by the system related to the query to the total number of documents related to the query.
Query: an expression of your information requirements based on the input language and rules provided by the information system. Common Input languages include keyword specifications and Boolean connectors.
Precision is a measure of the quality of the retrieval system. The number of documents retrieved by the system as a percentage of all documents retrieved, that is, the measurement that reflects the "correctness" of the search results.
A dictionary (Vocabulary), a set of all different word items in a document (or a collection of documents.
Term Frequency (TF or TF), Tf (I, j) indicates the number of times a word item Ti appears in a document DJ.
D:
Agent, or a proxy program. After receiving a user's request in an application, the agent can complete the task and return the result, programs, processes, or some systems that are not under user supervision. In, the agent is used to search for the content related to the keywords given by the user from the archive or information library, and is sometimes called intelligent agent ).

An inverted file (Inverted File) is a method for organizing and indexing files. In this method, a set of keywords is the basis. Each keyword in this set corresponds to a string of record items, each item contains a document number, information about the keyword in the document, and so on.

Inversed Document Frequency, IDF or IDF. Generally, IDF (Ti) is set to, where N is the total number of all documents, n) /log (Inni is the number of documents that contain word item Ti in N documents.

A dynamic web page that can be obtained only by submitting query information.
Dynamic abstract (Dynamic abstract) is a method for summarizing documents. When responding to a user's query, a search engine extracts the relevant text around the query word and returns it to the user based on the position where the query word appears in the document. Because a document contains different query words, dynamic Summarization Technology may form different abstract texts for the same document.

G:
Shared bag of words is the most basic assumption of information retrieval technology, that is, the meaning of a document can be expressed by the set of keywords it contains.
H:
Hypertext Markup Language (HTML) is one of the key web technologies. It provides a standard way to express Hypertext files in ASCII format.
Cache is a concept that often appears in the computer science field. Its basic meaning is to use the Locality Principle to implement an intermediate mechanism that matches two different speeds. It can appear between the CPU and Ram, or between the I/O operations of the application system and the disk. In search engines, various caches, including query cache, click cache, and inverted table cache, are designed in memory to alleviate the conflict between high query speed and low disk access speed.

J:
Static web page, which can be obtained without submitting query information.
The mirror web page has identical content without any modification.
Locality Principle is a property of program behavior. It includes temporal locality and spatial locality. The former means that if a data is accessed, it is likely to be accessed in the near future. The latter means that if a data is accessed just now, the data adjacent to it in location is likely to be accessed.

Denial of Service (DoS) is an attack that floods website servers with a large amount of information that requires responses, consuming network bandwidth or system resources, as a result, the network or system is overwhelmed to stop providing normal network services.

L:
Link Analysis: web pages and their links can be seen as a huge directed graph, link Analysis refers to the technology that uses the link information between webpages to judge their importance (or relevance. Common link information includes webpage outbound, inbound, and anchor text content. Common link analysis algorithms include PageRank, hits, salsa, phits, and Bayesian.

M:
MD5 (message digest 5) is a message digest algorithm used for message encoding. The MD5 algorithm is defined in rfc1321. Its basic function is to convert a message of any length into a 128-bit digest. The same probability of the digest for two different packets is extremely small, there is no relationship between the closeness of the two summaries and the closeness of the two packets.

Anchor text indicates the Link Description in HTML text, prompting readers of the nature or feature that the link points to the webpage. For example, a web page contains a <a href = "http://www.cctv.com"> News Channel </a>, the "News Channel" is the anchor text linked to href = "http://www.cctv.com" on this page.

A directory-based web page (hub page) that provides many hyperlinks pointing to other generic web pages. It corresponds to a trusted web page.

Q:
Zipf's law, a law of Word Frequency distribution proposed by American scholar G. K. Zipf in the 1940s S. It can be expressed as: if the frequency of occurrence of each word in a long document is counted, the frequency of occurrence of each word is sorted in descending order of the former and lower frequency words, use natural numbers to compile sequence numbers for these words, that is, the highest frequency word level is 1, and the second frequency level is 2 ,....... If F is used to represent the frequency, and R is used to represent the sequence number, F = C/R (C is a constant) exists ).

Word Segmentation (Word Segmentation) is mainly used in Chinese Information Processing, that is, to divide a sentence into a word sequence. For example, the "Network and Distributed System Laboratory" is segmented into "network and Distributed System Laboratory ".

Full Text Retrieval (full text retrieval) is a method of Text Information Retrieval (or a fine degree). It features that not only can every word in a document be retrieved, in addition, each appearance of a word can be retrieved.

Authority page (authoritative page) is a concept that corresponds to a directory web page. The content of a Web page usually has a specific topic and is linked by many other web pages.
S:
A hash table is a data structure that facilitates quick information search. When a hash is generated, a random index code is assigned to each data item in the table. This Random Index Code makes the data distribution more even, which may greatly save the time for subsequent searches.

Digital Library (Digital libarary) is a method for collecting, organizing, and displaying digital information objects and related information technologies that provide these objects to users. It includes services that allow users to locate, retrieve, and obtain information objects.

The search engine (SE) is an application software system on the Web. It collects and discovers information on the Web based on certain policies and processes and organizes the information, provides users with Web Information Query services.

Index term carrier (index term carrier). HTML Tag Information identifies the font, case, and other information of indexed words in a document.

T:
The Stop Word refers to the words that appear in the document. For example, the frequently used deprecated Words in English include the, A, and it. In Chinese, the frequently used deprecated words include "yes", ", and" location.

Throughput refers to the total number of tasks completed by the system per unit of time. For search engines, it refers to the maximum number of users that the system can query per unit of time (second.
U:
URL (Uniform Resource Locator) is a protocol (or description specification) used to locate information resources on the Internet. The positioning of a webpage is usually like "http: // host/path/file.html "URL, and FTP resources are described using a URL like" ftp: // host/path/File.

The URL domain name depth. The number of subdomains contained in the domain name part of the URL corresponding to the webpage.
URL directory depth. The directory hierarchy of the domain name is removed from the URL corresponding to the webpage, that is, the localpath section in url = Schema: // host/localpath. If the URL is a http://www.pku.edu.cn, the directory depth is 0; if it is http://www.pku.edu.cn/cs, the directory depth is 1.

W:
Page outdegree indicates the number of hyperlinks that a webpage directs to other webpages.
The noise cleaning process is used to identify and remove webpage noise. That is, information irrelevant to the subject content of the webpage, such as advertisement and copyright information, is removed.
Gatherer refers to the process or thread in the web page collection sub-system that crawls a Web page based on the URL. Generally, multiple gatherer jobs are started at the same time in a collection sub-system.
Page indegree refers to the number of hyperlinks that direct to a Web page throughout the network.
The web page Collection Subsystem (crawler system), especially in the search engine system, crawls web pages one by one on the Web page based on the link between HTML documents. In view of the way it crawls along the hyperchain on the web, this kind of program is also called Spider ). Crawler, Spider, robot, and BOT generally refer to the same thing.

Document Object Model (DOM). Dom converts an XML document into an object set and can process the object model at will. This mechanism is also called the "Random Access" protocol, because it can access any part of the data at any time and then modify, delete, or insert new data.

Automatic document classification (ATC): use a computer program to determine the affiliation between a specified document and a pre-defined document category.
X:
First in first out (FIFO) is a page replacement algorithm that selects the page that is first loaded into the primary storage for calling, or call out the page that stays on the master storage for the longest time.
Relevance ranking refers to the sorting of the results returned by the information retrieval system. The order of items reflects the results determined by the system and the relevance of the query.
Vector space model (VSM), based on the assumption of common words, a set of documents has a total word set of Σ, a document can be expressed by a vector, the element is a quantitative description of the occurrence of corresponding words in this document. A group of documents can be seen as several elements in a vector space, therefore, the concept of distance in vector space can be used to evaluate the similarity between the two documents.

Response time (Response Time) refers to the time taken between the time when a request (or inquiry) is submitted and the time when the response is displayed. For a search engine, it is the time that the user submits a query to see the returned results. In the practice of a search engine, because the time is related to the dynamically changing network status, the retrieval system usually uses the response time consumed to complete a query.

Deduplication (replicas or near-Replicas detection) to clear the image in the collection of the collected Web page or reprint the web page.
Protocol (Protocol), a set of rules that can coordinate the operations of various functional units for communication purposes.
Information Retrieval (IR) is used to organize and store information in a certain way, and identify the relevant information based on the user's needs.
An Information Retrieval Model (IR model) is a set of assumptions and algorithms used to sort document sets based on user queries. The IR model can be expressed as a triplet <D, Q, F, R (QI, DJ)>, where D is

Document Set, Q is a query set, F is a framework for modeling documents and queries, and R (QI, DJ) is a sort function, it assigns a sort value to the relevance between the query Qi and the document DJ. Common Information retrieval models include: Set Theory Model, algebra model, and probability model.

Y:
User query log is the information automatically recorded by the system when a user submits a query request, it includes keywords, submission time, user IP address, and page number submitted during user query (usually the query results are displayed on pages, 10 query results are displayed on each page, and the first query page number is 1, when a user turns pages, the page number is the result page number selected by the user.

User hit log is the information automatically recorded by the system when you browse the query results and click the page, it usually includes the time when a user clicks a page, the URL of the page to be clicked, the IP address of the user, the serial number of the clicked page (The position of the page in the query result), and the query word corresponding to the clicked page.

Metadata (meta data) describes the attributes of a type of resource (or object), locates and manages such resources, and facilitates data retrieval.
Meta Search Engine, also known as integrated search engine, sends user queries to multiple independent search engines to collect their results, then select and re-Sort according to a certain algorithm to form a final result and return it to the user.

Z:
Chinese Information Processing (Chinese Information Processing): it processes and operates information in Chinese languages such as voice, form, and meaning, it includes processing technologies for input, output, recognition, conversion, compression, storage, retrieval, analysis, understanding, and generation of words, words, phrases, sentences, and chapters.

Topic-specific/Focused crawling is a topic-oriented information collection system. Its main task is to use limited network bandwidth, storage capacity, and a small amount of time, capture as many webpages as possible that are closely related to the subject content.

The reprinted webpage (near-Replicas web page) has the same content but may have some additional editing information. Although the webpage has made some changes, its subject content has not changed; that is, the content of other texts is the same except for the webpage noise (such as advertisement and copyright information. A reprinted webpage is also called an approximate image webpage.

Least frequently used, LFU is a data replacement policy for cache content maintenance. When the cache is full and new data is coming in, it always removes the least frequently used data from existing data in the past. The granularity of Data replacement can be determined based on the Application Scenario.

Least recently used (least recently used, LRU), a data replacement policy for cache content maintenance. When the cache is full and new data is coming in, it always removes data that has not been used for the longest time in the past from existing data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.