Used to use reptiles, such as using Nutch to crawl the designated seed, based on the data to do a search, but also roughly read some source code. Of course, Nutch is very comprehensive and meticulous about reptiles. Whenever you see the screen of the past crawling to the Web page information and processing information, always feel that this is very black technology. Just the opportunity to comb the spring MVC, I want to get a small crawler, simple does not matter, there are some small bugs do not matter, I need only a certain seed site can crawl the information I want to. There are exception to solve, may be some API misuse, may be encountered in the HTTP request state exception, or database read and write problems, is in this newspaper exception and solve the exception process, Jewelcrawler (son's nickname) has been able to crawl data independently, and there is a Word2vec algorithm based on the emotional analysis of the small skills.
There may also be unknown exception waiting to be resolved, there are some performance needs to be optimized, such as the interaction with the database, data reading and writing, and so on. But the eye of the year did not have much energy to put this up, so today to do a simple summary, and the first two main focus is the function and results, this article to say how Jewelcrawler was born, and put the code on the GitHub (source address at the end of the article), Interested can pay attention to (only for Exchange study, please do not use, consider the Douban June. A little more sincerity, less harm)
Environment Introduction
Development tool: IntelliJ idea 14
Database: Mysql 5.5 + Database administration Tool NAVICAT (available to connect to query database)
Language: Java
Jar Pack Management: Maven
Version management: Git
Directory structure
which
Com.ansj.vec is the Java version implementation of the WORD2VEC algorithm
Com.jackie.crawler.doubanmovie is a crawler implementation module, which includes
Some of the packages are empty because these modules are not yet in use, where
- Constants packages are stored as constant classes
- Crawl Packet Storage Crawler Entry Program
- Entity classes for Entity package mapping database tables
- Test Package Storage Testing Class
- Utils Package Storage Tool Class
The resource module holds configuration files and resource files, such as
- Configuration file for beans.xml:Spring context
- Seed.properties: Seed Files
- Stopwords.dic: Stop Word Library
- Comment12031715.txt: The essay data of Crawling
- TokenizerResult.txt: The result file after using Ikanalyzer participle
- VECTOR.MOD: Model data based on WORD2VEC algorithm training
Test module is a testing module for the preparation of UT.
Database configuration
1. Add a dependent package
Jewelcrawler uses MAVEN management, so just add the appropriate dependencies to the Pom.xml.
<dependency>
<groupId>org.springframework</groupId>
<artifactid>spring-jdbc </artifactId>
<version>4.1.1.RELEASE</version>
</dependency>
< dependency>
<groupId>commons-pool</groupId>
<artifactid>commons-pool</ artifactid>
<version>1.6</version>
</dependency>
<dependency>
<groupId>commons-dbcp</groupId>
<artifactId>commons-dbcp</artifactId>
< version>1.4</version>
</dependency>
<dependency>
<groupid>mysql</ groupid>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</ version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
< /dependency>
2. Declaring the data source bean
We need to declare the bean of the data source in Beans.xml
<context:property-placeholder location= "classpath*:*. Properties"/> <bean id= "DataSource" class= "
Org.apache.commons.dbcp.BasicDataSource "destroy-method=" Close ">
<property name=" Driverclassname "value = "${jdbc.driver}"/>
<property name= "url" value= "${jdbc.url}"/> <property name= "
username" value = "${jdbc.username}"/>
<property name= "password" value= "${jdbc.password}"/>
</bean>
Note: Here is an external configuration file jdbc.properties, and the parameters of the specific data source are read from the file.
If you experience problems with SQL [insert into user (ID) VALUES (?)]; Field ' name ' doesn ' t have a default value; The workaround is to set the corresponding field of the table from the Grow fields.
Resolving Problems with Pages
For crawled Web page data need to parse the DOM structure, get the data you want, during the period encountered the following error
Org.htmlparser.Node not recognized
Workaround: Add Jar Pack Dependencies
<dependency>
<groupId>org.htmlparser</groupId>
<artifactid>htmlparser</ artifactid>
<version>1.6</version>
</dependency>
Org.apache.http.HttpEntity not recognized
Workaround: Add Jar Pack Dependencies
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId> httpclient</artifactid>
<version>4.5.2</version>
</dependency>
Of course, this is the problem encountered during the final use of Jsoup do the page parsing.
Maven Warehouse Downloads Slow
Previously used is the default Maven central warehouse, download the jar package is slow, do not know whether my network problems or other reasons, and later found in the online Aliyun maven warehouse, updated, compared to before almost seconds, vomiting blood recommended.
<mirrors>
<mirror>
<id>alimaven</id>
<name>aliyun maven</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
<mirrorof>central </mirrorOf>
</mirror>
</mirrors>
Find Maven's settings.xml file and add this image.
A way to read the files under the resource module
Like reading seed.properties files.
@Test public
void Testfile () {
File seedfile = new file (This.getclass (). GetResource ("/seed.properties"). GetPath ());
System.out.print ("===========" + seedfile.length () + "===========");
}
About regular Expressions
When you use a Regrex regular expression, if you match the defined pattern, you need to call the Matcher find method before you can use the group method to locate the substring. Calling the group method directly is no way to find the result you want.
I looked at the top of the Matcher class source
Package Java.util.regex;
Import java.util.Objects;
Public final class Matcher implements Matchresult {/** * "The Pattern object" is created this Matcher.
* * Pattern Parentpattern; /** * The storage used by groups.
They may contain invalid values if * A group is skipped during the matching.
* * int[] groups; /** * The range within the sequence is matched. Anchors * would match at these "hard" boundaries.
Changing the region * changes these values.
*/int from, to; /** * Lookbehind uses this value to ensure the subexpression * match ends in the point where the lookbehind W
As encountered.
*/int lookbehindto;
/** * The original string being matched.
* * charsequence text; /** * Matcher state used from the last node. Noanchor is used when a * match does not have to consume all of the input.
Endanchor is * The mode used for matching all input. * * STATic Final int endanchor = 1;
static final int noanchor = 0;
int acceptmode = Noanchor; /** * The range of string that is last matched the pattern. If the last * match failed then-is-1; Last initially holds 0 then it * holds the ' end of ' the ' the ' of the ' last match ' (which is where the * Next Search St
Arts).
*/INT-1, last = 0;
/** * The end of what matched in the last match operation.
*/INT oldlast =-1;
/** * The index of the last position appended in a substitution.
*/int lastappendposition = 0; /** * Storage used by nodes to tell what repetition they are on in * a pattern, and where groups begin.
The nodes themselves are stateless, * so they rely in this field to hold state during a match.
* * int[] locals;
/** * Boolean indicating whether or not more input could change * The results of the last match. * * If Hitend is true, and a match was found, then More input * might cause a different the match to be found.
* If Hitend is true and a match being not found, then the more * input could cause a match to be found.
* If Hitend is false and a match being found, then more input * won't change the match.
* If Hitend is false with a match was not found, then the more * input won't cause a match to be found.
* * Boolean hitend;
/** * Boolean indicating whether or not more input could change * a positive match into a negative one.
* * If Requireend is true, and a match being found, then more * input could cause the match to be lost.
* If Requireend is false and a match being found, then more * input might change the match but the match won ' t be lost.
* If A match is not found, then Requireend has no meaning.
* * Boolean requireend; /** * If Transparentbounds is true then the boundaries of this * matcher ' s region are-transparent to lookahead, lo Okbehind, * andBoundary matching constructs that try to the beyond them.
* * Boolean transparentbounds = false;
/** * If Anchoringbounds is true then the boundaries of this * matcher ' s region match anchors such as ^ and $.
* * Boolean anchoringbounds = true;
/** * No default constructor.
*/Matcher () {}/** * All matchers have the ' state used by pattern during a match.
* * Matcher (pattern parent, charsequence text) {This.parentpattern = parent;
This.text = text;
Allocate state storage int parentgroupcount = Math.max (Parent.capturinggroupcount, 10);
Groups = new Int[parentgroupcount * 2];
Locals = new Int[parent.localcount];
Put fields into initial states reset ();
/** * Returns The input subsequence matched by the previous match. * <p> for a matcher <i>m</i> with input sequence <i>s</i>, * the Expressions <i> M.</i><tt>group () </tt> and *<i>s.</i><tt>substring (</tt><i>m.</i><tt>start (),</tt> <i >m.</i><tt>end ()) </tt> * are equivalent. </p> * * <p> Note this some patterns, for example <tt>a*</tt>, match the empty * string. This method would return the empty string while the pattern * successfully matches the empty string in the input. </p> * * @return The (possibly empty) subsequence matched by the previous match, * in string form * * @throws IllegalStateException * If no match has yet been attempted, * or If the previous match operation Faile
d */Public String Group () {return Group (0);
/** * Returns The input subsequence captured by the given group during the * previous match operation. * <p> for a matcher <i>m</i>, input sequence <i>s</i>, and group index * <I>G< ;/i>, the Expressions <i>m.</i><tt>group (≪/tt><i>g</i><tt>) </tt> and * <i>s.</i><tt>substring (</tt> <i>m.</i><tt>start (</tt><i>g</i><tt>),</tt> <i>m.</i ><tt>end (</tt><i>g</i><tt>)) </tt> * are equivalent. </p> * <p> <a href= "PATTERN.HTML#CG" >capturing groups</a> are indexed , starting at one. Group Zero denotes the entire pattern, so * the expression <tt>m.group (0) </tt> are equivalent to <tt>m.
Group () </tt>. * </p> * <p> If the match is successful but the group specified failed to match * All part of the INP UT sequence, then <tt>null</tt> is returned.
Note * This some groups, for example <tt> (*) </tt>, match the empty string. * This method would return the empty string as such a group successfully * matches the empty string in the input.
</p>* @param Group * The index of a capturing group in this matcher ' s pattern * * @return The (possibly empty) s Ubsequence captured by the group * during the previous match, or <tt>null</tt> if the group * fail Ed to match part of the input * * @throws illegalstateexception * If no match has yet been attempted, * or If the previous match operation failed * * @throws indexoutofboundsexception * If There is no capturing group I n the pattern * with the given index */public String Group (int group) {if (< 0) throw new Ille
Galstateexception ("No match found");
if (Group < 0 | | | Group > GroupCount ()) throw new Indexoutofboundsexception ("No Group" + group); if ((groups[group*2] = =-1) | |
(groups[group*2+1] = =-1))
return null;
Return Getsubsequence (Groups[group * 2], Groups[group * 2 + 1]). toString (); }/** * Attempts to find the next subsequence of the input sequence that matches
* The pattern. * <p> This method starts in the beginning of this Matcher ' s region, or, if * A previous invocation of the meth OD is successful and the Matcher has * not since been reset, at the the "" "character not matched by the previous * mat
Ch. * * <p> If The match succeeds then more information can be obtained via the * <TT>START</TT> <tt >end</tt>, and <tt>group</tt> methods. </p> * * @return <tt>true</tt> If, and only if, a subsequence of the input * sequence matches
This is Matcher ' s pattern/public Boolean find () {int nextsearchindex = last;
if (Nextsearchindex = = i) nextsearchindex++;
If Next search starts before region, start it at region if (Nextsearchindex < from) Nextsearchindex = from; If Next search starts beyond region then it fails if (Nextsearchindex > To) {for (int i = 0; i < g Roups.length; i++) Groups[i] =-1;
return false;
Return search (Nextsearchindex);
}/** * Initiates a search to find a pattern within the given bounds. * The groups are filled with default values and the match of the ' root ' of the state machine is called.
The state machine would hold the state * 's the match as it proceeds in this matcher. * * Matcher.from is isn't set here, because it is the "hard" boundary * of the "start of" search which anchors would se T to. The From Param * is the "soft" boundary of the "start of" search, meaning which the * regex tries to match at Dex but ^ won ' t match there.
Subsequent * calls to the search methods start at a new "soft" boundary which are * the end of the previous match.
* * Boolean search (int from) {this.hitend = false;
This.requireend = false; From = from < 0?
0:from;
This.first = from; This.oldlast = Oldlast < 0?
From:oldlast;
for (int i = 0; i < groups.length i++) Groups[i] =-1; Acceptmode = Noanchor;
Boolean result = ParentPattern.root.match (this, from, text);
if (!result) This.first =-1;
This.oldlast = This.last;
return result;
}
...
}
The reason for this is this: if you call group without first calling the Find method, you can discover that the Group method call Group (int group) has an if first<0 in the method body of the method, and obviously this condition is set up here. Because the initial value of first is 1, the exception is thrown here. But if you call the Find method, you can see that it will eventually call search (Nextsearchindex), and notice that the nextsearchindex here has been assigned a value of last, and then the value of 0, and then jump to the search method
Boolean search (int from) {
this.hitend = false;
This.requireend = false;
From = to < 0 0:from;
This.first = from;
This.oldlast = Oldlast < 0? From:oldlast;
for (int i = 0; i < groups.length i++)
groups[i] =-1;
Acceptmode = Noanchor;
Boolean result = ParentPattern.root.match (this, from, text);
if (!result)
This.first =-1;
This.oldlast = This.last;
return result;
}
The Nextsearchindex is passed to from, and the From is assigned to the in the method body, so after invoking the Find method, this one is not-1, and it is not thrown.
The source code has been uploaded to Baidu network disk: Http://pan.baidu.com/s/1dFwtvNz
The above mentioned problems are more broken, are in the face of problems and solve the problems of some of the summary. In the specific operation of the time will also encounter other problems, there are problems or suggestions are welcome to put ^ ^.
Finally, put a few of the data up to the present crawl
Record table
It stores 79,032, crawled pages with 48,471
Movie table
Currently climbed 2,964 film and television works
Comments Table
Crawled 29,711 Records.
The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.