Lucene-based Case development: Introduction to vertical and horizontal novels page capture

Last Update:2015-04-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint Please specify source: http://blog.csdn.net/xiaojimanman/article/details/44851419

Http://www.llwjy.com/blogdetail/1b5ae17c513d127838c2e02102b5bb87.html

Personal Blog Station has been online, the website www.llwjy.com ~ welcome you to vomit Groove ~
-------------------------------------------------------------------------------------------------

In the previous blog, we have made a simple collection of the Update list page of Chinese and English novels, obtained the URL of the introduction page of the novel, so this blog we introduce the introduction of Chinese and English novel page information collection, Case address: http://book.zongheng.com/book/362857.html

Page analysis

Before you start, it is recommended that individuals look at the profile page first, just the area where the information we are collecting is located.

In this section, we need to obtain information such as title, author name, classification, word count, Introduction, Latest chapter name, Chapter page URL, and tag. On the page, we use the right mouse button--View the Web page source code to find the following phenomenon

In order to do 360 SEO, the novel's key information into the head, so that we greatly reduce the complexity of our regular, because these are very similar, so just use the title to do a simple introduction, the rest of the regular can refer to the source code behind. Here's the title in the above 33 lines , we need to extract the middle of the flying Fairy tactic information, so we extract this information from the regular expression " <meta name=" Og:novel:book_name "content=" (.*?)" /> ", the other information is similar to this regular. This part of the source code we can easily get the title, the author name, the latest chapters, Introduction, Classification and chapter list page URL, for the label and word number of the two fields, we need to continue to analyze the following source code. With a simple lookup, we can find the source code in, here contains the number of words and tags we need two attributes.

For this attribute of Word count, we can use simple regular expression " <span itemprop=" WordCount "> (\d*?) </span> "GET, and for label this attribute, we need two steps to get what we want."

The first step : Get the HTML code where the keyword is located, that is, 234 lines in, the regular expression of this step is "<div class=" keyword "> (. *?) </div> ";

The second step : The first step to get the portion of the HTML to further extract, get the desired content, this step of the regular expression is " <a.*?> (. *?) </a> ".

Code implementation

For non-update list also web information collection, we unified inherit Crawlbase class, for how to disguise can refer to the previous blog, here is focused on the Doregex class of two methods

Method One:

String getfirststring (String dealstr, string regexstr, int n)

Here the first parameter is the string to be processed, here is the page source code, the second parameter is the regular expression to find the content, the third parameter is the position of the content to be extracted in the regular expression, the function is to find from the specified string and the regular first match the content, return the specified extraction information.

Method Two:

String getString (String dealstr, String regexstr, string splitstr, int n)

The 1th, 2, and 4 parameters here correspond to the 1th, 2, and 3 parameters of method one respectively, the meaning of the parameter splitstr is the delimiter, the function is to find in the specified string matches the regular expression in the content, separated by the specified delimiter.

Run results

Source

Through the introduction of the above two methods, I believe that the following source code will be very simple.

 /** * @Description: Introduction page */Package Com.lulei.crawl.novel.zongheng; Import Java.io.ioexception;import java.util.hashmap;import Com.lulei.crawl.crawlbase;import Com.lulei.util.DoRegex  ; Import Com.lulei.util.ParseUtil; public class Intropage extends Crawlbase {private static final String NAME = "<meta name=\" og:novel:book_name\ "content =\"(.*?) \ "/>";p rivate static final String AUTHOR = "<meta name=\" og:novel:author\ "content=\" (. *?) \ "/>";p rivate static final String DESC = "<meta property=\" og:description\ "content=\" (. *?) \ "/>";p rivate static final String TYPE = "<meta name=\" Og:novel:category\ "content=\" (. *?) \ "/>";p rivate static final String latestchapter = "<meta name=\" og:novel:latest_chapter_name\ "content=\" (. *?) \ "/>";p rivate static final String Chapterlisturl = "<meta name=\" og:novel:read_url\ "content=\" (. *?) \ "/>";p rivate static final String WORDCOUNT = "<span itemprop=\" wordcount\ "> (\\d*?) </span> ";p rivate static final StrinG KEYWORDS = "<div class=\" keyword\ "> (. *?) </div> ";p rivate static final String KEYWORD =" <a.*?> (. *?) </a> ";p rivate String pageurl;private static hashmap<string, string> params;/** * Adds related header information, disguises the request */static { params = new hashmap<string, string> ();p arams.put ("Referer", "http://book.zongheng.com");p arams.put (" User-agent "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/36.0.1985.125 safari/537.36 "); Public intropage (String URL) throws IOException {readpagebyget (URL, "Utf-8", params); this.pageurl = URL;} /** * @return * @Author: Lulei * @Description: Get the title */private String getName () {return doregex.getfirststring (getpagesour Cecode (), NAME, 1);} /** * @return * @Author: Lulei * @Description: Get author name */private String Getauthor () {return doregex.getfirststring (getpages Ourcecode (), AUTHOR, 1);} /** * @return * @Author: Lulei * @Description: Book Introduction */private String GetDesc () {return doregex.getfirststring (getpagesour Cecode (), DESC, 1);} /** * @return * @Author: Lulei * @Description: Book category */private String GetType () {return doregex.getfirststring (getpagesour Cecode (), TYPE, 1);} /** * @return * @Author: Lulei * @Description: Latest Chapter */private String Getlatestchapter () {return doregex.getfirststring (ge Tpagesourcecode (), Latestchapter, 1);} /** * @return * @Author: Lulei * @Description: Chapter list page URL */private String getchapterlisturl () {return Doregex.getfirststri Ng (Getpagesourcecode (), Chapterlisturl, 1);} /** * @return * @Author: Lulei * @Description: Word count */private int getwordcount () {String WordCount = doregex.getfirststring (Getpagesourcecode (), WORDCOUNT, 1); return Parseutil.parsestringtoint (WORDCOUNT, 0);} /** * @return * @Author: Lulei * @Description: Label */private string KeyWords () {String keyhtml = Doregex.getfirststring (ge Tpagesourcecode (), KEYWORDS, 1); return doregex.getstring (keyhtml, KEYWORD, "", 1);} public static void Main (string[] args) throws IOException {//TODO auto-generated method stub Intropage Intro = New Intropage ("http://book.zongheng.com/book/362857.html"); System.out.println (Intro.pageurl); System.out.println (Intro.getname ()); System.out.println (Intro.getauthor ()); System.out.println (Intro.getdesc ()); System.out.println (Intro.gettype ()); System.out.println (Intro.getlatestchapter ()); System.out.println (Intro.getchapterlisturl ()); System.out.println (Intro.getwordcount ()); System.out.println (Intro.keywords ());}}

----------------------------------------------------------------------------------------------------
PS: Recently found other sites may be reproduced on the blog, there is no source link, if you want to see more about Lucene-based case development please click here. Or visit the URL http://blog.csdn.net/xiaojimanman/article/category/2841877 or http://www.llwjy.com/blogtype/lucene.html

Lucene-based Case development: Introduction to vertical and horizontal novels page capture

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene-based Case development: Introduction to vertical and horizontal novels page capture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene-based Case development: Introduction to vertical and horizontal novels page capture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support