[Modern information retrieval] search engine big job one, the topic request:
News search: Targeted collection of 3-4 sports news sites, to achieve the extraction, indexing and retrieval of information on these sites. The number of pages is not less than 100,000. The automatic clustering of similar news can be achieved by sorting attributes such as relevance, time and heat (which need to be defined by themselves).
Second, the problem analysis
Topic Analysis: We divide the task into four parts: the crawling of news data, the construction of inverted index, the realization of vector space model and the front-end interface.
Mainly divided into four modules: Web crawler, build index, document scoring, sorting display. Among the modules and modules include some sub-modules, including: Web information extraction, data storage, text analysis, TF-IDF weight calculation, vector space model modeling, correlation heat time sequencing, similar document clustering, related search recommendations. The following is a design drawing resulting from the entire search:
Third, web crawler
In this job, the web crawler part uses the Scrapy Open source framework.
Scrapy is a fast, high-level screen capture and web crawling framework developed by Python for crawling Web sites and extracting structured data from pages.
Flowchart of this module:
3.1 Defining the format of the extracted data
For the design of the crawler, we need to determine the format of the extracted data, we use the demand analysis, decided to extract the following
10 kinds of data. The details are as follows:
Artical: The body of the news, used for text analysis and index building;
ID: The unique identifier of the news, used in the crawl process to prevent repeated crawling of a news page;
Keyword: Keywords in the news, for related search recommendations;
Show: The number of comments on the news, for the definition of heat, easy to sort by the heat;
Reply: The number of replies to the news comment, used for the definition of heat, easy to sort by the heat;
Total: Number of participants in the news, for the definition of heat, easy to sort by the heat;
Source: Sources of news, for subsequent expansion, distinguishing sina,sohu,163, etc.;
Time: Press release date for subsequent sorting by time
Title: Headline of the news;
URL: A link to the news for the web display;
3.2 News Page parsing
After determining the format of the data, the next step is to grab a sports News page and parse the data corresponding to the above data format. We extracted the page information by introducing an HTML parsing package htmlxpathselector. Here is a simple example to illustrate:
Four, the construction of inverted index
Inverted index is an important link in information retrieval. In this module, there are four key steps: Extracting news content from JSON files, news content cutting words, statistical frequency and document frequencies, and calculating TF-IDF weights.
4.1 Document cut Words
In the common Chinese Word segmentation tool, Ikanalyzer word breaker is commonly used, the effect is good, convenient to expand, received a lot of praise. Therefore, in our scenario, the Ikanalyzer word breaker is selected. Ikanalyzer is an open source, lightweight Chinese word breaker toolkit based on Java language development. Starting with the release of the 1.0 version in December 2006, Ikanalyzer has launched 4 large versions.
Examples of participle effects:
The IK Analyzer 2012 version supports fine-grained segmentation and intelligent segmentation, and the following are examples of two ways of slicing.
Original Text 1:
Ikanalyzer is an open source, lightweight Chinese word breaker toolkit based on Java language development. Starting with the release of the 1.0 version in December 2006, Ikanalyzer has launched 3 large versions.
Intelligent Segmentation Results:
Ikanalyzer | is | a | open source | | Based on |java| language | Development | of | Lightweight | of | English | participle | Tool Kits | from | 2006 | December | Launch | Version 1.0 | Start | Ikanalyzer | Already | Push | Out of | 3 x | Big | Version
Maximum granularity of Word segmentation results:
Ikanalyzer | is | A | One | A | Open Source | of | Based on | java | language | Development | of | Lightweight | English | participle | Tool Kits | Tools | Package | from | 2006 | Year | 12 | Month | Launch | 1.0 | edition | Start |ikanalyzer | Already | Launch | Out of | 3 | A | Big | Version.
Installation Deployment method: Its installation deployment is very simple, the Ikanalyzer2012.jar is deployed in the project's Lib directory; IKAnalyzer.cfg.xml and stopword.dic files are placed in the class root directory (for we b project, usually the web-inf/classes directory, same as Hibernate, log4j and other configuration files).
Main process: The TXT document of the news body extracted from the previous step, all traversing the tangent words, and saving the results of the cut words in local.
4.2 Inverted Index build process
In this topic, the requirements are 10W documents, the amount of data is not too large. Therefore, we use the memory-based index construction method, the index construction process all put into memory for statistics, a single scan, a single statistic results, build the index.
Main flow: The 10W document of the participle in the previous step is traversed. In the loop, we count the position and number of occurrences of each word item in the document, and the total number of times in the 10W document. In the implementation, we use two HashMap to save the results, HASHMAP with the Word key, the document name, document location and other properties as values (in the middle with special symbols split open), a statistical document for each case, a statistic of all documents.
4.3 Calculating TD-IDF weights
The most famous weight calculation method in the field of information retrieval, TF-IDF weight calculation formula:
Where the DfT is the number of documents appearing in the word item T becomes the inverse document frequency, tft,d refers to the number of times that T appears in D, and is a document-related amount that becomes the term frequency. With the increase of the term frequency, the increasing of the rare degree of the word term increases.
V. Ranking of ratings
The similarity between the two vectors is calculated by using the vector space model to calculate the similarity between the document and the query, and the query and the document are represented as vectors in the term space. When calculating similarity, we use cosine similarity to calculate.
Document Vectorization method: Using TF-IDF weights for representation
Calculate the similarity formula:
Sorting method: According to the similarity calculation results, according to the score level to sort. At the same time, simple sorting can be done based on timing and heat.
Six, the previous section of the interface
The interface is simple and concise. Includes a user input interface and an output interface.
Effect:
At present, this abstract generation technology is still ugly.
All the code will be GitHub open source, not yet uploaded, updated immediately.
Simandou XIAOP
Source: http://www.cnblogs.com/panweishadow/
For non-commercial purposes, you are free to reprint, but please retain the original author information and article link URL.
[Modern information retrieval] search engine big job