Lucene-based case development: Collection of the overview page of the vertical and horizontal novels, lucene case

Source: Internet
Author: User

Lucene-based case development: Collection of the overview page of the vertical and horizontal novels, lucene case

Reprinted please indicate the source: http://blog.csdn.net/xiaojimanman/article/details/44851419

Http://www.llwjy.com/blogdetail/1b5ae17c513d127838c2e02102b5bb87.html

The personal blog website has been launched. Its website is www.llwjy.com ~ Thank you ~
Bytes -------------------------------------------------------------------------------------------------

In the previous blog, we have made a simple collection of the update list page of Chinese novels, and obtained the URL of the novel overview page, so this blog we will introduce the collection of profile information of Chinese novels, example address: http://book.zongheng.com/book/362857.html


PAGE analysis

Before starting, we recommend that you take a look at the overview page, just the area where the information we want to collect is located.


In this section, we need to obtain information such as the title, author name, category, word count, Introduction, latest chapter name, Chapter Page URL, and tag. On the page, right-click the page and check the source code of the page to find the following phenomenon:


In order to do 360 of seo, this article puts some key information about the novel into the head, which greatly reduces the complexity of our regular expressions. Because these regular expressions are similar, therefore, we only use the name of the book for a brief introduction. The rest of the regular expressions can be referred to the source code below. The title here is in33 rows, We need to extract the intermediateFeixianjueTherefore, the regular expression for extracting this information is "<meta name =" og: novel: book_name "content = "(.*?) "/>", Other information is similar to this regular expression. With this part of the source code, we can easily obtain the URL of the title, author name, latest chapter, Introduction, category, and Chapter list pages. For the fields of tag and word count, we need to continue to analyze the following source code. Through simple search, we can find the source code, which contains the required word count and tag attributes.


For the word count attribute, we can use a simple regular expression "<span itemprop =" wordCount "> (\ d *?) </Span> "get, but for the label attribute, we need to take two steps to get the desired content.

Step 1: Get the html code of keyword, that is234 rowsIn this step, the regular expression is "<div class =" keyword "> (.*?) </Div> ";

Step 2: Extract the html part obtained in step 1 to obtain the desired content. The regular expression in this step is "<a. *?> (.*?) </A> ".


Code Implementation

For web page information collection without updating the list, we inherit the crawler base class. For how to disguise, refer to the previous blog. Here we will focus on the two methods in the DoRegex class.

Method 1:

String getFirstString(String dealStr, String regexStr, int n)
The first parameter here is the string to be processed, that is, the source code of the webpage, and the second parameter is the regular expression of the content to be searched, the third parameter is the position of the content to be extracted in the regular expression. The function is to find the First Matching content from the specified string and return the specified extraction information.

Method 2:

String getString(String dealStr, String regexStr, String splitStr, int n)
The parameters 1st, 2, and 4 correspond to the 1st, 2, and 3 parameters in method 1 respectively. The meaning of the parameter splitStr is a separator, the function is used to search for the matching content of a regular expression in a specified string, which is separated by a specified separator.


Running result


Source code

Through the introduction of the above two methods, I believe that the following source code will be very simple.

/*** @ Description: Overview page */package com. lulei. crawl. novel. zongheng; import java. io. IOException; import java. util. hashMap; import com. lulei. crawl. crawlBase; import com. lulei. util. doRegex; import com. lulei. util. parseUtil; public class IntroPage extends CrawlBase {private static final String NAME = "<meta name = \" og: novel: book_name \ "content = \"(. *?) \ "/>"; Private static final String AUTHOR = "<meta name = \" og: novel: author \ "content = \"(.*?) \ "/>"; Private static final String DESC = "<meta property = \" og: description \ "content = \"(.*?) \ "/>"; Private static final String TYPE = "<meta name = \" og: novel: category \ "content = \"(.*?) \ "/>"; Private static final String LATESTCHAPTER = "<meta name = \" og: novel: latest_chapter_name \ "content = \"(.*?) \ "/>"; Private static final String CHAPTERLISTURL = "<meta name = \" og: novel: read_url \ "content = \"(.*?) \ "/>"; Private static final String WORDCOUNT = "<span itemprop = \" wordCount \ "> (\ d *?) </Span> "; private static final String KEYWORDS =" <div class = \ "keyword \"> (.*?) </Div> "; private static final String KEYWORD =" <a. *?> (.*?) </A> "; private String pageUrl; private static HashMap <String, String> params;/*** add relevant header information, camouflage requests */static {params = new HashMap <String, String> (); params. put ("Referer", "http://book.zongheng.com"); params. put ("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36 ");} public IntroPage (String url) throws IOException {readPageByGet (url, "UTF-8", params); this. pageUrl = url;}/*** @ return * @ Author: lulei * @ Description: Get the title */private String getName () {return DoRegex. getFirstString (getPageSourceCode (), NAME, 1);}/*** @ return * @ Author: lulei * @ Description: Get Author NAME */private String getAuthor () {return DoRegex. getFirstString (getPageSourceCode (), AUTHOR, 1);}/*** @ return * @ Author: lulei * @ Description: Book Introduction */private String getDesc () {return DoRegex. getFirstString (getPageSourceCode (), DESC, 1);}/*** @ return * @ Author: lulei * @ Description: book category */private String getType () {return DoRegex. getFirstString (getPageSourceCode (), TYPE, 1);}/*** @ return * @ Author: lulei * @ Description: latest chapter */private String getLatestChapter () {return DoRegex. getFirstString (getPageSourceCode (), LATESTCHAPTER, 1);}/*** @ return * @ Author: lulei * @ Description: Chapter list page Url */private String getChapterListUrl () {return DoRegex. getFirstString (getPageSourceCode (), CHAPTERLISTURL, 1);}/*** @ return * @ Author: lulei * @ Description: Word Count */private int getWordCount () {String wordCount = DoRegex. getFirstString (getPageSourceCode (), WORDCOUNT, 1); return ParseUtil. parseStringToInt (wordCount, 0);}/*** @ return * @ Author: lulei * @ Description: Tag */private String keyWords () {String keyHtml = DoRegex. getFirstString (getPageSourceCode (), KEYWORDS, 1); return DoRegex. getString (keyHtml, KEYWORD, "", 1);} public static void main (String [] args) throws IOException {// TODO Auto-generated method stub IntroPage intro = new IntroPage ("http://book.zongheng.com/book/362857.html"); System. out. println (intro. pageUrl); System. out. println (intro. getName (); System. out. println (intro. getAuthor (); System. out. println (intro. getDesc (); System. out. println (intro. getType (); System. out. println (intro. getLatestChapter (); System. out. println (intro. getChapterListUrl (); System. out. println (intro. getWordCount (); System. out. println (intro. keyWords ());}}

Bytes ----------------------------------------------------------------------------------------------------
Ps: I recently found that other websites may repost the blog, and there is no source link above. If you want to view it

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.