Lucene-based Case development: Introduction to vertical and horizontal novels page capture

Source: Internet
Author: User

Reprint Please specify source: http://blog.csdn.net/xiaojimanman/article/details/44851419

Http://www.llwjy.com/blogdetail/1b5ae17c513d127838c2e02102b5bb87.html

Personal Blog Station has been online, the website www.llwjy.com ~ welcome you to vomit Groove ~
-------------------------------------------------------------------------------------------------

In the previous blog, we have made a simple collection of the Update list page of Chinese and English novels, obtained the URL of the introduction page of the novel, so this blog we introduce the introduction of Chinese and English novel page information collection, Case address: http://book.zongheng.com/book/362857.html


Page analysis

Before you start, it is recommended that individuals look at the profile page first, just the area where the information we are collecting is located.


In this section, we need to obtain information such as title, author name, classification, word count, Introduction, Latest chapter name, Chapter page URL, and tag. On the page, we use the right mouse button--View the Web page source code to find the following phenomenon


In order to do 360 SEO, the novel's key information into the head, so that we greatly reduce the complexity of our regular, because these are very similar, so just use the title to do a simple introduction, the rest of the regular can refer to the source code behind. Here's the title in the above 33 lines , we need to extract the middle of the flying Fairy tactic information, so we extract this information from the regular expression " <meta name=" Og:novel:book_name "content=" (.*?)" /> ", the other information is similar to this regular. This part of the source code we can easily get the title, the author name, the latest chapters, Introduction, Classification and chapter list page URL, for the label and word number of the two fields, we need to continue to analyze the following source code. With a simple lookup, we can find the source code in, here contains the number of words and tags we need two attributes.


For this attribute of Word count, we can use simple regular expression " <span itemprop=" WordCount "> (\d*?) </span> "GET, and for label this attribute, we need two steps to get what we want."

The first step : Get the HTML code where the keyword is located, that is, 234 lines in, the regular expression of this step is "<div class=" keyword "> (. *?) </div> ";

The second step : The first step to get the portion of the HTML to further extract, get the desired content, this step of the regular expression is " <a.*?> (. *?) </a> ".


Code implementation

For non-update list also web information collection, we unified inherit Crawlbase class, for how to disguise can refer to the previous blog, here is focused on the Doregex class of two methods

Method One:

String getfirststring (String dealstr, string regexstr, int n)
Here the first parameter is the string to be processed, here is the page source code, the second parameter is the regular expression to find the content, the third parameter is the position of the content to be extracted in the regular expression, the function is to find from the specified string and the regular first match the content, return the specified extraction information.

Method Two:

String getString (String dealstr, String regexstr, string splitstr, int n)
The 1th, 2, and 4 parameters here correspond to the 1th, 2, and 3 parameters of method one respectively, the meaning of the parameter splitstr is the delimiter, the function is to find in the specified string matches the regular expression in the content, separated by the specified delimiter.


Run results


Source

Through the introduction of the above two methods, I believe that the following source code will be very simple.

 /** * @Description: Introduction page */Package Com.lulei.crawl.novel.zongheng; Import Java.io.ioexception;import java.util.hashmap;import Com.lulei.crawl.crawlbase;import Com.lulei.util.DoRegex  ; Import Com.lulei.util.ParseUtil; public class Intropage extends Crawlbase {private static final String NAME = "<meta name=\" og:novel:book_name\ "content =\"(.*?) \ "/>";p rivate static final String AUTHOR = "<meta name=\" og:novel:author\ "content=\" (. *?) \ "/>";p rivate static final String DESC = "<meta property=\" og:description\ "content=\" (. *?) \ "/>";p rivate static final String TYPE = "<meta name=\" Og:novel:category\ "content=\" (. *?) \ "/>";p rivate static final String latestchapter = "<meta name=\" og:novel:latest_chapter_name\ "content=\" (. *?) \ "/>";p rivate static final String Chapterlisturl = "<meta name=\" og:novel:read_url\ "content=\" (. *?) \ "/>";p rivate static final String WORDCOUNT = "<span itemprop=\" wordcount\ "> (\\d*?) </span> ";p rivate static final StrinG KEYWORDS = "<div class=\" keyword\ "> (. *?) </div> ";p rivate static final String KEYWORD =" <a.*?> (. *?) </a> ";p rivate String pageurl;private static hashmap<string, string> params;/** * Adds related header information, disguises the request */static { params = new hashmap<string, string> ();p arams.put ("Referer", "http://book.zongheng.com");p arams.put (" User-agent "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/36.0.1985.125 safari/537.36 "); Public intropage (String URL) throws IOException {readpagebyget (URL, "Utf-8", params); this.pageurl = URL;} /** * @return * @Author: Lulei * @Description: Get the title */private String getName () {return doregex.getfirststring (getpagesour Cecode (), NAME, 1);} /** * @return * @Author: Lulei * @Description: Get author name */private String Getauthor () {return doregex.getfirststring (getpages Ourcecode (), AUTHOR, 1);} /** * @return * @Author: Lulei * @Description: Book Introduction */private String GetDesc () {return doregex.getfirststring (getpagesour Cecode (), DESC, 1);} /** * @return * @Author: Lulei * @Description: Book category */private String GetType () {return doregex.getfirststring (getpagesour Cecode (), TYPE, 1);} /** * @return * @Author: Lulei * @Description: Latest Chapter */private String Getlatestchapter () {return doregex.getfirststring (ge Tpagesourcecode (), Latestchapter, 1);} /** * @return * @Author: Lulei * @Description: Chapter list page URL */private String getchapterlisturl () {return Doregex.getfirststri Ng (Getpagesourcecode (), Chapterlisturl, 1);} /** * @return * @Author: Lulei * @Description: Word count */private int getwordcount () {String WordCount = doregex.getfirststring (Getpagesourcecode (), WORDCOUNT, 1); return Parseutil.parsestringtoint (WORDCOUNT, 0);} /** * @return * @Author: Lulei * @Description: Label */private string KeyWords () {String keyhtml = Doregex.getfirststring (ge Tpagesourcecode (), KEYWORDS, 1); return doregex.getstring (keyhtml, KEYWORD, "", 1);} public static void Main (string[] args) throws IOException {//TODO auto-generated method stub Intropage Intro = New Intropage ("http://book.zongheng.com/book/362857.html"); System.out.println (Intro.pageurl); System.out.println (Intro.getname ()); System.out.println (Intro.getauthor ()); System.out.println (Intro.getdesc ()); System.out.println (Intro.gettype ()); System.out.println (Intro.getlatestchapter ()); System.out.println (Intro.getchapterlisturl ()); System.out.println (Intro.getwordcount ()); System.out.println (Intro.keywords ());}}

----------------------------------------------------------------------------------------------------
PS: Recently found other sites may be reproduced on the blog, there is no source link, if you want to see more about Lucene-based case development please click here. Or visit the URL http://blog.csdn.net/xiaojimanman/article/category/2841877 or http://www.llwjy.com/blogtype/lucene.html

Lucene-based Case development: Introduction to vertical and horizontal novels page capture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.