Lucene-based case development: Collection of the overview page of the vertical and horizontal novels, lucene case

Last Update:2015-04-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted please indicate the source: http://blog.csdn.net/xiaojimanman/article/details/44851419

Http://www.llwjy.com/blogdetail/1b5ae17c513d127838c2e02102b5bb87.html

The personal blog website has been launched. Its website is www.llwjy.com ~ Thank you ~
Bytes -------------------------------------------------------------------------------------------------

In the previous blog, we have made a simple collection of the update list page of Chinese novels, and obtained the URL of the novel overview page, so this blog we will introduce the collection of profile information of Chinese novels, example address: http://book.zongheng.com/book/362857.html

PAGE analysis

Before starting, we recommend that you take a look at the overview page, just the area where the information we want to collect is located.

In this section, we need to obtain information such as the title, author name, category, word count, Introduction, latest chapter name, Chapter Page URL, and tag. On the page, right-click the page and check the source code of the page to find the following phenomenon:

In order to do 360 of seo, this article puts some key information about the novel into the head, which greatly reduces the complexity of our regular expressions. Because these regular expressions are similar, therefore, we only use the name of the book for a brief introduction. The rest of the regular expressions can be referred to the source code below. The title here is in33 rows, We need to extract the intermediateFeixianjueTherefore, the regular expression for extracting this information is "<meta name =" og: novel: book_name "content = "(.*?) "/>", Other information is similar to this regular expression. With this part of the source code, we can easily obtain the URL of the title, author name, latest chapter, Introduction, category, and Chapter list pages. For the fields of tag and word count, we need to continue to analyze the following source code. Through simple search, we can find the source code, which contains the required word count and tag attributes.

For the word count attribute, we can use a simple regular expression "<span itemprop =" wordCount "> (\ d *?) </Span> "get, but for the label attribute, we need to take two steps to get the desired content.

Step 1: Get the html code of keyword, that is234 rowsIn this step, the regular expression is "<div class =" keyword "> (.*?) </Div> ";

Step 2: Extract the html part obtained in step 1 to obtain the desired content. The regular expression in this step is "<a. *?> (.*?) </A> ".

Code Implementation

For web page information collection without updating the list, we inherit the crawler base class. For how to disguise, refer to the previous blog. Here we will focus on the two methods in the DoRegex class.

Method 1:

String getFirstString(String dealStr, String regexStr, int n)

The first parameter here is the string to be processed, that is, the source code of the webpage, and the second parameter is the regular expression of the content to be searched, the third parameter is the position of the content to be extracted in the regular expression. The function is to find the First Matching content from the specified string and return the specified extraction information.

Method 2:

String getString(String dealStr, String regexStr, String splitStr, int n)

The parameters 1st, 2, and 4 correspond to the 1st, 2, and 3 parameters in method 1 respectively. The meaning of the parameter splitStr is a separator, the function is used to search for the matching content of a regular expression in a specified string, which is separated by a specified separator.

Running result

Source code

Through the introduction of the above two methods, I believe that the following source code will be very simple.

/*** @ Description: Overview page */package com. lulei. crawl. novel. zongheng; import java. io. IOException; import java. util. hashMap; import com. lulei. crawl. crawlBase; import com. lulei. util. doRegex; import com. lulei. util. parseUtil; public class IntroPage extends CrawlBase {private static final String NAME = "<meta name = \" og: novel: book_name \ "content = \"(. *?) \ "/>"; Private static final String AUTHOR = "<meta name = \" og: novel: author \ "content = \"(.*?) \ "/>"; Private static final String DESC = "<meta property = \" og: description \ "content = \"(.*?) \ "/>"; Private static final String TYPE = "<meta name = \" og: novel: category \ "content = \"(.*?) \ "/>"; Private static final String LATESTCHAPTER = "<meta name = \" og: novel: latest_chapter_name \ "content = \"(.*?) \ "/>"; Private static final String CHAPTERLISTURL = "<meta name = \" og: novel: read_url \ "content = \"(.*?) \ "/>"; Private static final String WORDCOUNT = "<span itemprop = \" wordCount \ "> (\ d *?) </Span> "; private static final String KEYWORDS =" <div class = \ "keyword \"> (.*?) </Div> "; private static final String KEYWORD =" <a. *?> (.*?) </A> "; private String pageUrl; private static HashMap <String, String> params;/*** add relevant header information, camouflage requests */static {params = new HashMap <String, String> (); params. put ("Referer", "http://book.zongheng.com"); params. put ("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36 ");} public IntroPage (String url) throws IOException {readPageByGet (url, "UTF-8", params); this. pageUrl = url;}/*** @ return * @ Author: lulei * @ Description: Get the title */private String getName () {return DoRegex. getFirstString (getPageSourceCode (), NAME, 1);}/*** @ return * @ Author: lulei * @ Description: Get Author NAME */private String getAuthor () {return DoRegex. getFirstString (getPageSourceCode (), AUTHOR, 1);}/*** @ return * @ Author: lulei * @ Description: Book Introduction */private String getDesc () {return DoRegex. getFirstString (getPageSourceCode (), DESC, 1);}/*** @ return * @ Author: lulei * @ Description: book category */private String getType () {return DoRegex. getFirstString (getPageSourceCode (), TYPE, 1);}/*** @ return * @ Author: lulei * @ Description: latest chapter */private String getLatestChapter () {return DoRegex. getFirstString (getPageSourceCode (), LATESTCHAPTER, 1);}/*** @ return * @ Author: lulei * @ Description: Chapter list page Url */private String getChapterListUrl () {return DoRegex. getFirstString (getPageSourceCode (), CHAPTERLISTURL, 1);}/*** @ return * @ Author: lulei * @ Description: Word Count */private int getWordCount () {String wordCount = DoRegex. getFirstString (getPageSourceCode (), WORDCOUNT, 1); return ParseUtil. parseStringToInt (wordCount, 0);}/*** @ return * @ Author: lulei * @ Description: Tag */private String keyWords () {String keyHtml = DoRegex. getFirstString (getPageSourceCode (), KEYWORDS, 1); return DoRegex. getString (keyHtml, KEYWORD, "", 1);} public static void main (String [] args) throws IOException {// TODO Auto-generated method stub IntroPage intro = new IntroPage ("http://book.zongheng.com/book/362857.html"); System. out. println (intro. pageUrl); System. out. println (intro. getName (); System. out. println (intro. getAuthor (); System. out. println (intro. getDesc (); System. out. println (intro. getType (); System. out. println (intro. getLatestChapter (); System. out. println (intro. getChapterListUrl (); System. out. println (intro. getWordCount (); System. out. println (intro. keyWords ());}}

Bytes ----------------------------------------------------------------------------------------------------
Ps: I recently found that other websites may repost the blog, and there is no source link above. If you want to view it

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene-based case development: Collection of the overview page of the vertical and horizontal novels, lucene case

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene-based case development: Collection of the overview page of the vertical and horizontal novels, lucene case

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support