Detailed Java Watercress Movie crawler--The growth of small reptiles (with source code) _java

Source: Internet
Author: User
Tags pack maven central aliyun

Used to use reptiles, such as using Nutch to crawl the designated seed, based on the data to do a search, but also roughly read some source code. Of course, Nutch is very comprehensive and meticulous about reptiles. Whenever you see the screen of the past crawling to the Web page information and processing information, always feel that this is very black technology. Just the opportunity to comb the spring MVC, I want to get a small crawler, simple does not matter, there are some small bugs do not matter, I need only a certain seed site can crawl the information I want to. There are exception to solve, may be some API misuse, may be encountered in the HTTP request state exception, or database read and write problems, is in this newspaper exception and solve the exception process, Jewelcrawler (son's nickname) has been able to crawl data independently, and there is a Word2vec algorithm based on the emotional analysis of the small skills.

There may also be unknown exception waiting to be resolved, there are some performance needs to be optimized, such as the interaction with the database, data reading and writing, and so on. But the eye of the year did not have much energy to put this up, so today to do a simple summary, and the first two main focus is the function and results, this article to say how Jewelcrawler was born, and put the code on the GitHub (source address at the end of the article), Interested can pay attention to (only for Exchange study, please do not use, consider the Douban June. A little more sincerity, less harm)

Environment Introduction

Development tool: IntelliJ idea 14

Database: Mysql 5.5 + Database administration Tool NAVICAT (available to connect to query database)

Language: Java

Jar Pack Management: Maven

Version management: Git

Directory structure

which

Com.ansj.vec is the Java version implementation of the WORD2VEC algorithm

Com.jackie.crawler.doubanmovie is a crawler implementation module, which includes

Some of the packages are empty because these modules are not yet in use, where

    • Constants packages are stored as constant classes
    • Crawl Packet Storage Crawler Entry Program
    • Entity classes for Entity package mapping database tables
    • Test Package Storage Testing Class
    • Utils Package Storage Tool Class

The resource module holds configuration files and resource files, such as

    • Configuration file for beans.xml:Spring context
    • Seed.properties: Seed Files
    • Stopwords.dic: Stop Word Library
    • Comment12031715.txt: The essay data of Crawling
    • TokenizerResult.txt: The result file after using Ikanalyzer participle
    • VECTOR.MOD: Model data based on WORD2VEC algorithm training

Test module is a testing module for the preparation of UT.

Database configuration

1. Add a dependent package

Jewelcrawler uses MAVEN management, so just add the appropriate dependencies to the Pom.xml.

<dependency>

  <groupId>org.springframework</groupId>

  <artifactid>spring-jdbc </artifactId>

  <version>4.1.1.RELEASE</version>

</dependency>

< dependency>

  <groupId>commons-pool</groupId>

  <artifactid>commons-pool</ artifactid>

  <version>1.6</version>

</dependency>

<dependency>

  <groupId>commons-dbcp</groupId>

  <artifactId>commons-dbcp</artifactId>

  < version>1.4</version>

</dependency>

<dependency>

  <groupid>mysql</ groupid>

  <artifactId>mysql-connector-java</artifactId>

  <version>5.1.38</ version>

</dependency>

<dependency>

  <groupId>mysql</groupId>

  <artifactId>mysql-connector-java</artifactId>

  <version>5.1.38</version>

< /dependency> 

2. Declaring the data source bean

We need to declare the bean of the data source in Beans.xml

 <context:property-placeholder location= "classpath*:*. Properties"/> <bean id= "DataSource" class= "

Org.apache.commons.dbcp.BasicDataSource "destroy-method=" Close ">

  <property name=" Driverclassname "value = "${jdbc.driver}"/>

  <property name= "url" value= "${jdbc.url}"/> <property name= "

  username" value = "${jdbc.username}"/>

  <property name= "password" value= "${jdbc.password}"/>

</bean> 

Note: Here is an external configuration file jdbc.properties, and the parameters of the specific data source are read from the file.

If you experience problems with SQL [insert into user (ID) VALUES (?)]; Field ' name ' doesn ' t have a default value; The workaround is to set the corresponding field of the table from the Grow fields.

Resolving Problems with Pages

For crawled Web page data need to parse the DOM structure, get the data you want, during the period encountered the following error

Org.htmlparser.Node not recognized

Workaround: Add Jar Pack Dependencies

<dependency>

  <groupId>org.htmlparser</groupId>

  <artifactid>htmlparser</ artifactid>

  <version>1.6</version>

</dependency> 

Org.apache.http.HttpEntity not recognized

Workaround: Add Jar Pack Dependencies

<dependency>

  <groupId>org.apache.httpcomponents</groupId>

  <artifactId> httpclient</artifactid>

  <version>4.5.2</version>

</dependency>

Of course, this is the problem encountered during the final use of Jsoup do the page parsing.

Maven Warehouse Downloads Slow

Previously used is the default Maven central warehouse, download the jar package is slow, do not know whether my network problems or other reasons, and later found in the online Aliyun maven warehouse, updated, compared to before almost seconds, vomiting blood recommended.

<mirrors>

  <mirror>

   <id>alimaven</id>

   <name>aliyun maven</name>

   <url>http://maven.aliyun.com/nexus/content/groups/public/</url>

   <mirrorof>central </mirrorOf>    

  </mirror>

</mirrors> 

Find Maven's settings.xml file and add this image.

A way to read the files under the resource module

Like reading seed.properties files.

@Test public

  void Testfile () {

    File seedfile = new file (This.getclass (). GetResource ("/seed.properties"). GetPath ());

    System.out.print ("===========" + seedfile.length () + "===========");

  }

About regular Expressions

When you use a Regrex regular expression, if you match the defined pattern, you need to call the Matcher find method before you can use the group method to locate the substring. Calling the group method directly is no way to find the result you want.

I looked at the top of the Matcher class source

Package Java.util.regex;

Import java.util.Objects;

   Public final class Matcher implements Matchresult {/** * "The Pattern object" is created this Matcher.

 

  * * Pattern Parentpattern; /** * The storage used by groups.

   They may contain invalid values if * A group is skipped during the matching.

 

  * * int[] groups; /** * The range within the sequence is matched. Anchors * would match at these "hard" boundaries.

   Changing the region * changes these values.

 

  */int from, to; /** * Lookbehind uses this value to ensure the subexpression * match ends in the point where the lookbehind W

   As encountered.

 

  */int lookbehindto;

   /** * The original string being matched.

 

  * * charsequence text; /** * Matcher state used from the last node. Noanchor is used when a * match does not have to consume all of the input.

   Endanchor is * The mode used for matching all input. * * STATic Final int endanchor = 1;

  static final int noanchor = 0;

 

  int acceptmode = Noanchor; /** * The range of string that is last matched the pattern. If the last * match failed then-is-1; Last initially holds 0 then it * holds the ' end of ' the ' the ' of the ' last match ' (which is where the * Next Search St

   Arts).

 

  */INT-1, last = 0;

   /** * The end of what matched in the last match operation.

 

  */INT oldlast =-1;

   /** * The index of the last position appended in a substitution.

 

  */int lastappendposition = 0; /** * Storage used by nodes to tell what repetition they are on in * a pattern, and where groups begin.

   The nodes themselves are stateless, * so they rely in this field to hold state during a match.

 

  * * int[] locals;

   /** * Boolean indicating whether or not more input could change * The results of the last match. * * If Hitend is true, and a match was found, then More input * might cause a different the match to be found.

   * If Hitend is true and a match being not found, then the more * input could cause a match to be found.

   * If Hitend is false and a match being found, then more input * won't change the match.

   * If Hitend is false with a match was not found, then the more * input won't cause a match to be found.

 

  * * Boolean hitend;

   /** * Boolean indicating whether or not more input could change * a positive match into a negative one.

   * * If Requireend is true, and a match being found, then more * input could cause the match to be lost.

   * If Requireend is false and a match being found, then more * input might change the match but the match won ' t be lost.

   * If A match is not found, then Requireend has no meaning.

 

  * * Boolean requireend; /** * If Transparentbounds is true then the boundaries of this * matcher ' s region are-transparent to lookahead, lo Okbehind, * andBoundary matching constructs that try to the beyond them.

 

  * * Boolean transparentbounds = false;

   /** * If Anchoringbounds is true then the boundaries of this * matcher ' s region match anchors such as ^ and $.

 

  * * Boolean anchoringbounds = true;

   /** * No default constructor.

 */Matcher () {}/** * All matchers have the ' state used by pattern during a match.

  * * Matcher (pattern parent, charsequence text) {This.parentpattern = parent;

 

  This.text = text;

  Allocate state storage int parentgroupcount = Math.max (Parent.capturinggroupcount, 10);

  Groups = new Int[parentgroupcount * 2];

 

  Locals = new Int[parent.localcount];

Put fields into initial states reset ();

 /** * Returns The input subsequence matched by the previous match. * <p> for a matcher <i>m</i> with input sequence <i>s</i>, * the Expressions <i> M.</i><tt>group () </tt> and *<i>s.</i><tt>substring (</tt><i>m.</i><tt>start (),</tt> <i >m.</i><tt>end ()) </tt> * are equivalent. </p> * * <p> Note this some patterns, for example <tt>a*</tt>, match the empty * string. This method would return the empty string while the pattern * successfully matches the empty string in the input. </p> * * @return The (possibly empty) subsequence matched by the previous match, * in string form * * @throws IllegalStateException * If no match has yet been attempted, * or If the previous match operation Faile

d */Public String Group () {return Group (0);

 /** * Returns The input subsequence captured by the given group during the * previous match operation. * <p> for a matcher <i>m</i&gt, input sequence <i>s</i>, and group index * &LT;I&GT;G&LT ;/i&gt, the Expressions <i>m.</i><tt>group (&Lt;/tt><i>g</i><tt>) </tt> and * <i>s.</i><tt>substring (</tt> <i>m.</i><tt>start (</tt><i>g</i><tt>),</tt> <i>m.</i ><tt>end (</tt><i>g</i><tt>)) </tt> * are equivalent. </p> * <p> <a href= "PATTERN.HTML#CG" >capturing groups</a> are indexed , starting at one. Group Zero denotes the entire pattern, so * the expression <tt>m.group (0) </tt> are equivalent to <tt>m.

 Group () </tt>. * </p> * <p> If the match is successful but the group specified failed to match * All part of the INP UT sequence, then <tt>null</tt> is returned.

 Note * This some groups, for example <tt> (*) </tt&gt, match the empty string. * This method would return the empty string as such a group successfully * matches the empty string in the input.

 </p>* @param Group * The index of a capturing group in this matcher ' s pattern * * @return The (possibly empty) s Ubsequence captured by the group * during the previous match, or <tt>null</tt> if the group * fail  Ed to match part of the input * * @throws illegalstateexception * If no match has yet been attempted, * or If the previous match operation failed * * @throws indexoutofboundsexception * If There is no capturing group I n the pattern * with the given index */public String Group (int group) {if (< 0) throw new Ille

  Galstateexception ("No match found");

  if (Group < 0 | | | Group > GroupCount ()) throw new Indexoutofboundsexception ("No Group" + group); if ((groups[group*2] = =-1) | |

    (groups[group*2+1] = =-1))

  return null;

Return Getsubsequence (Groups[group * 2], Groups[group * 2 + 1]). toString (); }/** * Attempts to find the next subsequence of the input sequence that matches

 * The pattern. * <p> This method starts in the beginning of this Matcher ' s region, or, if * A previous invocation of the meth OD is successful and the Matcher has * not since been reset, at the the "" "character not matched by the previous * mat

 Ch. * * <p> If The match succeeds then more information can be obtained via the * &LT;TT&GT;START&LT;/TT&GT; <tt >end</tt&gt, and <tt>group</tt> methods. </p> * * @return <tt>true</tt> If, and only if, a subsequence of the input * sequence matches

  This is Matcher ' s pattern/public Boolean find () {int nextsearchindex = last;

 

  if (Nextsearchindex = = i) nextsearchindex++;

 

  If Next search starts before region, start it at region if (Nextsearchindex < from) Nextsearchindex = from; If Next search starts beyond region then it fails if (Nextsearchindex > To) {for (int i = 0; i < g Roups.length; i++) Groups[i] =-1;

  return false;

Return search (Nextsearchindex);

 }/** * Initiates a search to find a pattern within the given bounds. * The groups are filled with default values and the match of the ' root ' of the state machine is called.

 The state machine would hold the state * 's the match as it proceeds in this matcher. * * Matcher.from is isn't set here, because it is the "hard" boundary * of the "start of" search which anchors would se T to. The From Param * is the "soft" boundary of the "start of" search, meaning which the * regex tries to match at Dex but ^ won ' t match there.

 Subsequent * calls to the search methods start at a new "soft" boundary which are * the end of the previous match.

  * * Boolean search (int from) {this.hitend = false;

  This.requireend = false; From = from < 0?

  0:from;

  This.first = from; This.oldlast = Oldlast < 0?

  From:oldlast;

for (int i = 0; i < groups.length i++) Groups[i] =-1;  Acceptmode = Noanchor;

  Boolean result = ParentPattern.root.match (this, from, text);

  if (!result) This.first =-1;

  This.oldlast = This.last;

return result; 

 }

...

}

The reason for this is this: if you call group without first calling the Find method, you can discover that the Group method call Group (int group) has an if first<0 in the method body of the method, and obviously this condition is set up here. Because the initial value of first is 1, the exception is thrown here. But if you call the Find method, you can see that it will eventually call search (Nextsearchindex), and notice that the nextsearchindex here has been assigned a value of last, and then the value of 0, and then jump to the search method

Boolean search (int from) {

  this.hitend = false;

  This.requireend = false;

  From    = to < 0 0:from;

  This.first = from;

  This.oldlast = Oldlast < 0? From:oldlast;

  for (int i = 0; i < groups.length i++)

    groups[i] =-1;

  Acceptmode = Noanchor;

  Boolean result = ParentPattern.root.match (this, from, text);

  if (!result)

    This.first =-1;

  This.oldlast = This.last;

  return result;

} 

The Nextsearchindex is passed to from, and the From is assigned to the in the method body, so after invoking the Find method, this one is not-1, and it is not thrown.

The source code has been uploaded to Baidu network disk: Http://pan.baidu.com/s/1dFwtvNz

The above mentioned problems are more broken, are in the face of problems and solve the problems of some of the summary. In the specific operation of the time will also encounter other problems, there are problems or suggestions are welcome to put ^ ^.

Finally, put a few of the data up to the present crawl

Record table

It stores 79,032, crawled pages with 48,471

Movie table

Currently climbed 2,964 film and television works

Comments Table

Crawled 29,711 Records.

The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.