[Modern information retrieval] search engine big job

Last Update:2015-01-02 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Modern information retrieval] search engine big job one, the topic request:

News search: Targeted collection of 3-4 sports news sites, to achieve the extraction, indexing and retrieval of information on these sites. The number of pages is not less than 100,000. The automatic clustering of similar news can be achieved by sorting attributes such as relevance, time and heat (which need to be defined by themselves).

Second, the problem analysis

Topic Analysis: We divide the task into four parts: the crawling of news data, the construction of inverted index, the realization of vector space model and the front-end interface.

Mainly divided into four modules: Web crawler, build index, document scoring, sorting display. Among the modules and modules include some sub-modules, including: Web information extraction, data storage, text analysis, TF-IDF weight calculation, vector space model modeling, correlation heat time sequencing, similar document clustering, related search recommendations. The following is a design drawing resulting from the entire search:

Third, web crawler

In this job, the web crawler part uses the Scrapy Open source framework.

Scrapy is a fast, high-level screen capture and web crawling framework developed by Python for crawling Web sites and extracting structured data from pages.

Flowchart of this module:

3.1 Defining the format of the extracted data

For the design of the crawler, we need to determine the format of the extracted data, we use the demand analysis, decided to extract the following
10 kinds of data. The details are as follows:
Artical: The body of the news, used for text analysis and index building;
ID: The unique identifier of the news, used in the crawl process to prevent repeated crawling of a news page;
Keyword: Keywords in the news, for related search recommendations;
Show: The number of comments on the news, for the definition of heat, easy to sort by the heat;
Reply: The number of replies to the news comment, used for the definition of heat, easy to sort by the heat;
Total: Number of participants in the news, for the definition of heat, easy to sort by the heat;
Source: Sources of news, for subsequent expansion, distinguishing sina,sohu,163, etc.;
Time: Press release date for subsequent sorting by time
Title: Headline of the news;
URL: A link to the news for the web display;

3.2 News Page parsing
After determining the format of the data, the next step is to grab a sports News page and parse the data corresponding to the above data format. We extracted the page information by introducing an HTML parsing package htmlxpathselector. Here is a simple example to illustrate:

Four, the construction of inverted index

Inverted index is an important link in information retrieval. In this module, there are four key steps: Extracting news content from JSON files, news content cutting words, statistical frequency and document frequencies, and calculating TF-IDF weights.

4.1 Document cut Words
In the common Chinese Word segmentation tool, Ikanalyzer word breaker is commonly used, the effect is good, convenient to expand, received a lot of praise. Therefore, in our scenario, the Ikanalyzer word breaker is selected. Ikanalyzer is an open source, lightweight Chinese word breaker toolkit based on Java language development. Starting with the release of the 1.0 version in December 2006, Ikanalyzer has launched 4 large versions.
Examples of participle effects:
The IK Analyzer 2012 version supports fine-grained segmentation and intelligent segmentation, and the following are examples of two ways of slicing.
Original Text 1:
Ikanalyzer is an open source, lightweight Chinese word breaker toolkit based on Java language development. Starting with the release of the 1.0 version in December 2006, Ikanalyzer has launched 3 large versions.
Intelligent Segmentation Results:
Ikanalyzer | is | a | open source | | Based on |java| language | Development | of | Lightweight | of | English | participle | Tool Kits | from | 2006 | December | Launch | Version 1.0 | Start | Ikanalyzer | Already | Push | Out of | 3 x | Big | Version
Maximum granularity of Word segmentation results:
Ikanalyzer | is | A | One | A | Open Source | of | Based on | java | language | Development | of | Lightweight | English | participle | Tool Kits | Tools | Package | from | 2006 | Year | 12 | Month | Launch | 1.0 | edition | Start |ikanalyzer | Already | Launch | Out of | 3 | A | Big | Version.
Installation Deployment method: Its installation deployment is very simple, the Ikanalyzer2012.jar is deployed in the project's Lib directory; IKAnalyzer.cfg.xml and stopword.dic files are placed in the class root directory (for we b project, usually the web-inf/classes directory, same as Hibernate, log4j and other configuration files).
Main process: The TXT document of the news body extracted from the previous step, all traversing the tangent words, and saving the results of the cut words in local.
4.2 Inverted Index build process

In this topic, the requirements are 10W documents, the amount of data is not too large. Therefore, we use the memory-based index construction method, the index construction process all put into memory for statistics, a single scan, a single statistic results, build the index.

Main flow: The 10W document of the participle in the previous step is traversed. In the loop, we count the position and number of occurrences of each word item in the document, and the total number of times in the 10W document. In the implementation, we use two HashMap to save the results, HASHMAP with the Word key, the document name, document location and other properties as values (in the middle with special symbols split open), a statistical document for each case, a statistic of all documents.

4.3 Calculating TD-IDF weights

The most famous weight calculation method in the field of information retrieval, TF-IDF weight calculation formula:

Where the DfT is the number of documents appearing in the word item T becomes the inverse document frequency, tft,d refers to the number of times that T appears in D, and is a document-related amount that becomes the term frequency. With the increase of the term frequency, the increasing of the rare degree of the word term increases.

V. Ranking of ratings

The similarity between the two vectors is calculated by using the vector space model to calculate the similarity between the document and the query, and the query and the document are represented as vectors in the term space. When calculating similarity, we use cosine similarity to calculate.

Document Vectorization method: Using TF-IDF weights for representation

Calculate the similarity formula:

Sorting method: According to the similarity calculation results, according to the score level to sort. At the same time, simple sorting can be done based on timing and heat.

Six, the previous section of the interface

The interface is simple and concise. Includes a user input interface and an output interface.

Effect:

At present, this abstract generation technology is still ugly.

All the code will be GitHub open source, not yet uploaded, updated immediately.

Simandou XIAOP

Source: http://www.cnblogs.com/panweishadow/
For non-commercial purposes, you are free to reprint, but please retain the original author information and article link URL.

[Modern information retrieval] search engine big job

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More