1 Introduction
This assignment was completed based on Lucene's "Tiger Flutter Basketball" website search engine, on its main three sections---"The Latest News" (main NBA News), "Tiger flutter pedestrian Street" (like bar paste nature), "Tiger flutter Wet" (basketball post area) for page analysis and indexing complete search engine.
1. 1 Design Purpose
search engine is a very useful program, you can make it easier and faster to achieve the target information search and retrieval, the program for the Tiger Flutter basketball site three sub-pages of the post title index, and can achieve the title of the target entry, time, source, and body content, as well as the original URL Url.
(1) when the program is running, the message " Please enter the module you want to query:1 for tiger flutter News 2 for the pedestrian Street 3 for SHH 4 for comprehensive search " in Java scanner Four search needs on behalf of users
(2) after entering the required number, the following message will be displayed " Please enter the crawler depth (Crawl page)", this information represents the number of pages that the user wants to search, the more pages, the more content produced. (equivalent to the specified number of pages searched)
(3) You will be prompted to enter the maximum number of displays, as the name implies, the maximum number of entries displayed by user input.
(4) Finally will let the user input query string, and then enter the enter can be based on the query string to find the corresponding number of entries.
(5) If you choose to search for Tiger flutter news, you will see the entry for the maximum number of entries, the source of the news, the time of the press release, the URL of the original page, and all the body content (the body content is formatted according to the original page).
(6) If you choose shh or tiger flutter pedestrian street search, it will show the maximum number of entries of the topic, post specific date, source, original URL, text content .
(7) If the user chooses 4 Comprehensive Search, then the search results (news,shh, pedestrian street) of the maximum number of entries will be displayed,and each entry will show its corresponding structure and body content. User-friendly selection and viewing.
1. 2 Design notes
This procedure uses Java programming language, editing, compiling and debugging under the idea platform.
JDK version 1.7, using external toolkits are: Jsoup, Lucene.
2
Overall Design
2. 1function Module Design
The main functions to be implemented in this program are:
(1) User can customize the search page
(2) Users can customize the depth of the crawler, that is, crawl pages
(3) The user can select the maximum number of displays
(4) Enter search string, according to search string display
The overall function of the program is shown in 1:
2.2 Flowchart Design
program Overall process 2 is as follows:
Figure 2 design overall flowchart
3
Detailed design 3. 1 Design Overview
Search engine design overall components of the four parts:
The first part is the crawler, the main is to search for the page to crawl, get its HTML information;
The second part is to filter, analyze and finally transform the HTML information into the required string;
The third part is the use of The Lucene Framework indexes the processed string information and searches according to the search string;
The forth part is to display the required information in the form of a string (body URL, body content, source, etc.).
3. 2 web crawler and its content extraction and analysis
1, initialization : Before doing this step, the content of Lucene needs to be initialized.
This is used in The index in RAM avoids the waste of space, and the remaining variables are established by the Lucene Manual specification.
2. Get: Get the HTML content under its corresponding URL via The Get method of Jsoup's Connect
Org.jsoup.nodes.Document doc = Jsoup. Connect (URL). get ();
3. Filter:
Main functions:
public static elements[] Getnews (Document doc)
function Purpose:
Use the Getnews function to get the collection of elements under the corresponding filter label by using the Doc object that you obtained as a parameter.
Function Core Code:
function Description: We find the label of the news title and URL by analyzing the HTML code of three pages
Here we take the tiger flutter news as an example to understand how we filter the label: (Other similar, not one page analysis)
from this we can see that all the tiger flutter news is placed under the class= "list-head" tab, and all other information is placed in the class= "other-info" tab. , so we created an array of elements to store the HTML code for the two tags and return them.
4. Analysis
Main functions:
public static hashmap<string,string> analyze (Elements[]e)
function Purpose: steps to get the A elements array that returns a value of type HashMap that is the URL (string) of the corresponding news content, with a value of information such as the corresponding news title and the source of the news event (string).
Implementation method: The first is the URL extraction, where regular expressions are used
String regex = "https://[\\w+\\.?" /?] +\\. [a-za-z]+];
to extract the corresponding tag URL, (because the second step is filtered, so you can guarantee that there is only one URL in each elements, so you can use regular expressions).
after use The Elements Method Eachtext () method gets the body content. Do some more tinkering, connect news time and other additions to the title.
Core Code :
Finally, return to the hash table News.
3. 3 Establishment and search of Lucene indexes
1. Building an Index
1. main functions:
public static void Creatindex (HashMap news,indexwriter W)
2. function: analyze the contents of the HashMap obtained in the previous step and index the different parts of them.
3, the function realization: uses the iterator to traverse HashMap's key value pair, takes out the inside content and adddoc the method to establish the index.
4.Adddoc Method: Add three parameters according to "url" "Newstitle" "Newsother" three fields to index
2. Search
1. main Functions
public static void Search (String querystr,intnum,directory index)
2. function : Search string querystr by searching for indexed index , get "hit item"
3, function implementation: According to the Lucene user manual code modification obtained:
3. 3 Display of content
Here I use two methods to get information for the correct display of the content, the first is the teacher's recommended method Boilerpipe, but found that this training method for some simple pages can actually implement the extract body, but to some complex pages (especially with many comments on the page) extracted from the body is empty, in order to solve this problem, I wrote a backup plan. The fallback scenario is enabled when the Boilerpipe method is detected to get the body empty.
The idea of alternatives is very simple, is to follow the page to find the contents of the label, get its body, and then use the string Replace, and other methods to modify it .
4
Test and run
4. 1Program Testing
After the basic completion of the program code, after continuous debugging and modification, the crawler finally can run as expected. The debugging process mainly includes the debugging of the display result format (need to remove the corresponding wrapping label, etc.), the debugging of the sequence, the debugging of displaying content integrity, etc.
4. 2program Run
The search for the tiger flutter SHH plate
the shh content shown is as follows:
Search for Tiger Flutter news:
Show Results:
Search for the tiger flutter Walking Street forum:
Search results:
Comprehensive Search:
Search results:
5. Summary
I've learned a lot from this experiment:
1, first understand the LUCENE search engine framework, some other search engines can be more handy;
2, through reading the document and the actual use, mastered the basic usage of jsoup;
3, enhance the Java programming ability.
The insufficiency of this experiment:
1, the code has redundancy, the next time you can use a lot of encapsulation and inheritance, make the code more readable;
2, did not do highlight;
3, only the analysis of 3 pages, after perfect can be more analysis of several pages (in fact, similar principles), increase the degree of code completion;
4, because many of the search is the name of the star, some irregular name,lcunene word breaker may not be able to give an accurate distinction, so sometimes there will be a search result mismatch.
(such as the search for "Kobe", there may be xxx higher than the XXX option)
(like this experiment, write this also count me so many years of tiger flutter jrs not white when it)
(to the source of the private poke, it is not put in this caught dead)
"Tiger flutter Basketball" web search engine based on Lucene Framework (Java edition)