Tinyspider open source, huh?

Source: Internet
Author: User
Tags solr query sybase

Tinyspider is a network data grabbing framework based on tiny Htmlparser. MAVEN Reference Coordinates:
?

1
2
3
4
5
<dependency>
< groupid>org.tinygroup</groupid>
< artifactid>tinyspider</artifactid>
< version>0.0.12</version>
</dependency>



Web crawler, generally used in full-text search or content access above.


The tiny framework also has limited support for this, although it does not have much functionality, but it is also very convenient to do a full-text search or get data from a Web page.
Frame Properties

    • Powerful node-filtering capabilities
    • Support Post and get two ways to submit data
    • Avoid Web page repeat processing function
    • Support multi-site content crawling capabilities
    • Strong HTML fault-tolerant processing

Framework Design web crawler ?

1
2
3
4
5
6
7
8
9

11


+
-
29
-
-
-

-
-
-
-
-


to
+
+
+
+
$
Notoginseng
Up
Interface
+
public Spinder {
/**
* Add site accessor

* @param sitevisitor
*/
void Addsitevisitor (Sitevisitor sitevisitor);


/**
* Add monitor

* @param watcher
*/
void Addwatcher (Watcher watcher);


/**
* Process URL

* @param url
*/
void Processurl (String URL);


/**
* Process URL
* @param URL
* @param parameter
*/
void Processurl (String url, map<string, Object > Parameter);


/**
* Set the URL warehouse

* @param urlrepository
*/
void Seturlrepository (urlrepository Urlrepository);
}



A crawler, at a minimum, needs to include a site accessor, which is used by the site accessor to access the URL. If there is no matching site accessor, the URL is ignored and no further processing is done.
A crawler needs to include at least one monitor, which is used to filter the content in the URL and process the hit node. Without a monitor, the content crawled back by the crawler would be of no value.
A crawler needs at least one URL warehouse, and the URL warehouse is used to judge Ur, whether it has been crawled and processed. If you do not have a URL warehouse, you will not be able to determine whether the URL has been processed, and in very many cases, it will cause a dead loop and cannot exit.
Of course, a crawler must be able to handle URLs as well.
site visitors because a crawler can have multiple site accessors, it is necessary to have a IsMatch method to tell the crawler whether it should be handled by itself.
Access mode, you can set whether to get the data through the Get or post method.
URL warehouse ?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Public interface Urlrepository {
/**
* Returns whether the URL is already present in the warehouse
*
* @param URL
* @return
*/
Boolean isexist (String URL);


/**
* Returns whether the URL is already present in the warehouse with parameters
*
* @param URL
* @param parameter
* @return
*/
Boolean isexist (String URL, map<string, object> parameter);


/**
* If not present, then put, if already exists, replace
*
* @param URL
* @param content
*/
void puturlwithcontent (string url, string content);


/**
* If not present, then put, if already exists, replace
*
* @param URL
* @param parameter
* @param content
*/
void puturlwithcontent (String url, map<string, object> parameter,
String content);


/**
* Returns the content if it exists, or throws a run-time exception if it does not exist
*
* @param URL
* @return
*/
String getcontent (string url);


/**
* Returns the content if it exists, or throws a run-time exception if it does not exist
*
* @param URL
* @param parameter
* @return
*/
String getcontent (string url, map<string, object> parameter);
}



The URL warehouse is used to manage URLs and their contents. Because the methods are simple and straightforward, no more introductions are made.
A monitor ?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
Public interface Watcher {
/**
* Set Node filter
*
* @param filter
*/
void Setnodefilter (nodefilter

/**
* Get node Filter
*
* @return
*/
Nodefilter

/**
* Add processor
*
* @param processor
*/
void Addprocessor (Processor Processor);


/**
* Get processor List
*
* @return
*/
List<processor> getprocessorlist ();
}



A monitor, you must have a node filter, but you can have multiple processors.
processor ?

1
2
3
4
5
6
7
8
Public interface Processor {
/**
* Processing node
*
* @param node
*/
void process (Htmlnode node);
}



The processor is simple enough to handle a hit node.
The example accesses [http://www.oschina.net/question?catalog=1] to see that there are many questions about technical questions.
Let's write a program to get these titles out:
Writing crawlers

1
2
3
4
5
6
7
8
9
10
11
public static void Main (string[] args) {
Spinder Spinder = new Spinderimpl ();
Watcher Watcher = new Watcherimpl ();
Watcher.addprocessor (New Printoschinaprocessor ());
quicknamefilterNodefilter.setnodename ("div");
Nodefilter.setincludeattribute ("Class", "qbody");
Watcher.setnodefilter (Nodefilter);
Spinder.addwatcher (watcher);
Spinder.processurl ("http://www.oschina.net/question?catalog=1");
}



Writing the processor

1
2
3
4
5
6
7
8
9
10
11
public class Printoschinaprocessor implements Processor {
public void process (Htmlnode node) {
fastnamefilterFilter.setnodename ("H2");
Filter.setincludenode ("a");
Htmlnode h3 = Filter.findnode ();
if (h3! = null) {
System.out.println (H3.getsubnode ("a"). GetContent ());
}
}
}



The results of the run result may not be the same as the results because the data is constantly changing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Joseph Ring Question, a piece of code to ask for explanation
To recommend a share, reply to the front-end open source JS
MySQL What situation use MyISAM, when use InnoDB?
Phpstorm use Sogou input Chinese appear disorderly problem how to solve?
How to achieve the effect of entertainment vane in fast seeding in Android
Use Java to do mobile background development!
The alert dialog box for Chrome 29 is beautiful, with wood and wood.
Eclipse+adt+android Environment Configuration Issues
The doubts about Android Holderview
Egg ache from one company to another company is a person developed have wood has
Wsunit official visit is not
Android Ask the Big God to show me what's wrong
Questions about Hibernate search query results that do not match the database
Find a good book or PDF for Oracle
About the implementation of the Wrap in Notepad
Swing Online HTML text editor
Network blocking issues under Android
How to do the file on-line system (code on line)
Ztree node is set to check multi box how to get only leaf nodes, no other nodes
How to set the uploaded image does not automatically compress
JS Regular expression problem
Eclipse often loading descriptor for XXX and then snaps
About the Android development XML display problem
RMI remote objects are shared, right?
Participate in open source projects how to write documents
How does PHP list all the files on the server as a file icon?
A simple question in PHP? Please help out, rookie.
Consult SOLR query word breaker, the result is empty problem
Is there a problem with this code, and how do I run an error?
Switch the splash screen issue in the jquery mobile page
You help me, I'll tell you a joke. Tut
asp: How does JS get the value in a cookie?
Android phone interception and processing
IIS7 How to display the error PHP?
When installing VirtualBox, you are prompted to install the Universal Serial Bus controller, do you want to install it?
API to get Sina Weibo news
The factory should not have default behavior
How to deal with unused code left over from the development process
ireport You cannot use Sybase driver com.sybase.jdbc3.jdbc.SybDriver when designing a report template?
Some questions about the use of Druid.



Summary from the example can be seen, to get data from the Web page, is really very easy one thing, only a few lines (20 rows or so), on the acquisition of the data we want, want to grasp more data, as long as the layer to refine the analysis can be.

Tinyspider open source, huh?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.