1. TraditionalSearch engine Sorting AlgorithmOverview
1. 1 Overview of Search Engine sorting algorithms
The search engine query results are sorted by certain rules for users to view. This rule is the search engine sorting algorithm. currently, some common search engine sorting algorithms include direct hit sorting algorithm, PageRank, ranking bidding service, and word frequency location weighted sorting algorithm. the direct hit sorting algorithm is a dynamic sorting algorithm. The sorting results returned by the search engine are changed based on the user's clicks and Webpage Browsing time. pageRank is a sorting algorithm used by Google, a famous search engine. It uses the link structure of a webpage to calculate the PR value of a webpage for sorting. the spot price ranking service is a keyword ranking service for some websites. The search engine bills users by clicking (or by time period. the weighted Sorting Algorithm Based on Word Frequency and position is used to sort keywords. this article mainly discusses the weighted Sorting Algorithm for Word Frequency locations.
1. 2 weighted Sorting Algorithm for Word Frequency locations
Word Frequency location weighted sorting algorithm is a basic algorithm in web page sorting. The idea of Lucene, a famous open-source full-text search package, is to use the weighted Sorting Algorithm for Word Frequency location, lucene has been widely used in search engines. A large number of search engines, such as the German website search system ifinder and the open-source search engine nutch, sort searches based on Lucene. the weighted ranking algorithm of Word Frequency position uses the relevance of a keyword and a webpage as the sorting standard, and the relevance of a keyword in a webpage is obtained by weighted calculation of the frequency and position of the keyword in the webpage. the basic steps for implementing the sorting algorithm are as follows: collect webpages, parse all parts of webpages, filter out deprecated words, obtain keywords, obtain relevance between Query words and webpages Based on the location and frequency of keywords, and display them to users in order of relevance. it can be seen that the method obtains the keywords of the entire web page, and the calculation of relevance is greatly affected in the reality that a large number of advertisements and navigation information exist. if you can block webpages and remove noise information such as advertisements and Navigation Based on certain algorithms, the accuracy of relevance calculation will be improved effectively.
II. Introduction to Web Page Segmentation Algorithms
At present, there have been many studies on web page partitioning methods, among which the most effective is the vision based page segmentation proposed by Microsoft Asia Research Institute. I will only discuss VIPs. the web page structure in the VIPs algorithm is defined as follows: the Web page is considered as a triple.
View a webpage as a triple
Where
Indicates the set of all semantic blocks on a given page. These semantic blocks do not overlap with each other, and each semantic block can be defined as the three tuples described above.
This iteration loop.
Indicates the set of all separators on the current page. in fact, once the two semantic blocks on a page are determined, the separation bars between the two semantic blocks are also determined. of course, the separation bars in VIPs are not actually exists, but virtual. the separators include horizontal separators, including vertical separators. each separator has a certain width and height.
Describes the relationship between two semantic blocks in the set. This relationship can be described
Each Delta is a binary group, such as (v I, V J). It indicates that there is a shard between block VI and V J.
VIPs uses Visual prompts on web pages, such as background color, font color and size, borders, spacing between logical blocks, and logical blocks, and uses Dom trees to segment pages. it has three steps: Page block extraction, separation bar extraction, and semantic Block Reconstruction. these three steps are combined as a complete step for semantic block detection. web pages are first divided into several relatively large semantic blocks, and the layers of these semantic blocks will be recorded. we can continue the process of detecting every large semantic block until the DOC (deg REE of coherence) of the semantic block) the value reaches the predefined pdoc (permit Ted degr EE of coherence. the VIPs splitting effect is shown in Figure 1.
3. Search Engine Sorting Algorithm Improvement
3. 1 webpage purification rules
After a Web page is split into blocks, how to identify the noise in the semantic block such as advertisements and navigation bars is a key issue to purify the web page and improve the accuracy of Search Engine sorting. through a large amount of statistics and analysis, the author sums up three rules to identify the noise semantic block by using the number of texts and links of the Web block, the null and relative positions of the Web block, and the content attributes of the Web block. first, define the origin coordinates of the webpage window as the top left corner of the webpage. The horizontal coordinate X of the webpage block center is the horizontal coordinate of the center of the webpage in the window, y of the center of the webpage block, and m of the webpage width, webpage height n. the relative space location of a webpage block is used to define the space location of a webpage.
(1) A webpage is divided into blocks at the top, bottom, left, and right of the webpage. If the ratio of the number of texts to the number of links is less than F1, this block is a noise block.
(2) When a webpage block is in the middle and the ratio of the text quantity to the link quantity is less than F2, the block is a noise block.
(3) If the entire content of a webpage block is a flash file, the block is regarded as a noise block.
(4) The webpage contains a T ex t control with more than 3 lines and determines that the block is a noise block.
3. 2 improved Sorting Algorithm Description
This algorithm uses VIPs to block webpages and uses the developed rules to purify webpages so as to optimize the sorting algorithm. The specific algorithm is as follows:
Input: webpage library P and query Q, threshold F1, F2, R1, R 2, R3, R4.
Output: sorted page set sp.
(1) Unify the web page and normalize irregular tags in the web page.
(2) Use VIPs to segment the web page p I.
(3) use rules to determine and remove noise from webpages to purify webpages.
(4) filter deprecated words.
(5) Obtain keywords.
(6) Calculate the correlation between the query word Q and the web page p I using the location and frequency weighting of keywords.
(7) return sp.
From the algorithm, we can see that all documents are purified and then involved in relevance computing, which is calculated in real time during user search. formula (1) is a typical formula used to calculate the score of a document corresponding to a user's query keyword.
Formula: SJ indicates the document J's score corresponding to the user's query term; CJ indicates the number of all the entries available for query in document J; tij is the frequency of query word I appearing in document J; f I is the inverted frequency of query word I appearing in the document; B is the field parameter set during the indexing process, usually 1. 0. where
It is irrelevant to the document and does not affect the document ranking. in formula (2), l j is the length of document J. It can be seen that after the webpage is segmented and the webpage is purified, the effect on R is relatively large, and r is the formula (1) important factors that affect the sorting score. assume that there are two texts:
A. tx t: I am a student. Her E is an advert isement.
B. txt: I am a student.
According to formula (2), we can calculate R as 0. 312 5 and 0. 5. b. TXT is. TXT removes the content after the ad topic. the r of the TXT content is improved, and the overall sorting score is increased. after the web page is segmented and purified, the improved algorithm uses the purified web page instead of the entire web page for retrieval, improving the correctness of sorting.