Java Multithreaded crawler Instance

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Very early know the principle of the crawler, but has not been implemented, today, write up really encounter a lot of difficulties, especially the problem of multithreading synchronization. Or they are not familiar with multithreading, there is no large number of practical reasons.

Let me do the results first:

Start a reptile ..... ..... ..... ...... ..... ..... ..... ....
. ... Currently, there are 1 threads waiting for 2 threads waiting for the current 3 threads waiting for the current 4 threads waiting for the current
5 threads to wait
... and the other is waiting for you to be there.

Crawl http://dev.yesky.com success, depth 2 is a thread thread-9 to crawl
There are currently 7 threads waiting to crawl the
Web http://www.cnblogs.com/rexyoung/archive/ 2012/05/01/2477960.html succeeded, the depth of 2 is thread-2 by thread.
There are currently 8 threads waiting to
crawl the page http://www.hjenglish.com success, depth 2 is a thread thread-0 to crawl
There are currently 9 threads http://www.cnblogs.com/snandy/archive/2012/05/01/2476675.html succeeded in waiting for
crawled page, depth 2 It's a thread thread-5.
There are currently 10 threads waiting for a
total of
53 seconds to climb 159 pages

This is the homepage of the Climbing Blog Park, only climbed two level depth, 10 threads, a total of 53 seconds, should be a good speed, the following is all the code:

public class WebCrawler {arraylist<string> Allurlset = new arraylist<string> ()//All Web page URLs, The need for more efficient weight can be considered hashset arraylist<string> Notcrawlurlset = new arraylist<string> ()//web page URL not crawled hashmap< String, integer> depth = new hashmap<string, integer> ();//url depth of all web pages int crawdepth = 2; Reptile depth int threadcount = 10; Number of threads int count = 0;   Indicates how many threads are in the wait state public static final Object signal = new Object (); Inter-thread communication variable public static void main (string[] args) {final WebCrawler WC = new WebCrawler ();//Wc.addurl ("http://www".
		126.com ", 1);
		Wc.addurl ("http://www.cnblogs.com", 1);
		Long start= system.currenttimemillis ();
		System.out.println ("Start the reptile ... ... ...)."?.. .................... ...
		
		Wc.begin (); while (true) {if (Wc.notCrawlurlSet.isEmpty () && thread.activecount () = = 1| |
				Wc.count==wc.threadcount) {Long end = System.currenttimemillis ();
				System.out.println ("Climbed Up" +wc.allurlset.size () + "a webpage"); SYSTEM.OUT.PRINTLN ("Total time consuming" + (end)-start)/1000+ "seconds");
System.exit (1);
			Break }} private void Begin () {for (int i=0;i<threadcount;i++) {new Thread (new Runnable () {public void Ru N () {//System.out.println ("Current Entry" +thread.currentthread (). GetName ());//while (!notcrawlurlset.isempty ()) {-------
---------------------------(1)//String TMP = Getaurl ();
Crawler (TMP);
						} while (true) {//System.out.println ("Current Entry" +thread.currentthread (). GetName ());
						String tmp = Getaurl ();
						if (tmp!=null) {crawler (TMP);
									}else{synchronized (signal) {//------------------(2) try {count++;
									System.out.println ("currently has" +count+ "threads Waiting");
								Signal.wait ();
								catch (Interruptedexception e) {//TODO auto-generated catch block E.printstacktrace ();
		}}}}, "thread-" +i). Start (); } public synchronized String Getaurl () {if (notcrawlurlset.isEmpty ()) return null;
String Tmpaurl;
			Synchronized (notcrawlurlset) {tmpaurl= notcrawlurlset.get (0);
Notcrawlurlset.remove (0);
	return tmpaurl; }//Public synchronized Boolean IsEmpty () {//Boolean F = Notcrawlurlset.isempty ();//return F;//} public synch
			ronized void Addurl (String url,int D) {notcrawlurlset.add (URL);
			Allurlset.add (URL);
	Depth.put (URL, d);
		}//Crawl page surl public void crawler (String surl) {URL url; 
				try {url = new URL (surl);//HttpURLConnection URLConnection = (httpurlconnection) url.openconnection ();
				URLConnection urlconnection = Url.openconnection (); Urlconnection.addrequestproperty ("User-agent", "mozilla/4.0" (compatible; MSIE 6.0;
				Windows NT 5.0) ");
				InputStream is = Url.openstream ();
				BufferedReader breader = new BufferedReader (new InputStreamReader (IS));
				StringBuffer sb = new StringBuffer ();//sb for crawled Web page content String rline = null; while ((Rline=breader.readline ())!=null) {Sb.append (rline);
				Sb.append ("/r/n");
				int d = depth.get (sURL);
				SYSTEM.OUT.PRINTLN ("Crawl page" +surl+ "succeeded, the depth is" +d+ "is by Thread" +thread.currentthread (). GetName () + "to Crawl");
				if (d<crawdepth) {//Parse Web page content, extract link from parsecontext (sb.tostring (), d+1);

			
		}//System.out.println (sb.tostring ());
			catch (IOException e) {//Crawlurlset.add (SURL);//Notcrawlurlset.remove (SURL);
		E.printstacktrace (); The URL address public void Parsecontext (string context,int dep) {string regex = "<a href.*?/a>" is extracted from the context;
		/String regex = "<title>.*?</title>"; String s = "fdfd<title> i   </title><a href=\" http://www.iteye.com/blogs/tag/google\ ">Google
		</a>fdfd<> ";
		String regex = "http://.*?>";
		Pattern pt = pattern.compile (regex);
		Matcher MT = Pt.matcher (context);
			while (Mt.find ()) {//System.out.println (Mt.group ());
			Matcher Myurl = Pattern.compile ("href=\". *?\ ""). Matcher (Mt.group ());
		while (Myurl.find ()) {		String str = Myurl.group (). ReplaceAll ("href=\" |\ "", "");
				System.out.println ("URL is:" + str); if (Str.contains ("http:")) {//Remove some addresses that are not URLs if (!allurlset.contains (str)) {Addurl (str, DEP);/Add a new URL if (c
								ount>0) {///If there is a waiting thread, wake synchronized (signal) {//---------------------(2) count--;
							Signal.notify ();	 }
						}
						
					}
				}
			}
		}
	}
}

In the above (1) (2) Two places card for a long time, two places is actually a knowledge point, are multi-threaded knowledge:

It started with a

While					(!notcrawlurlset.isempty ()) {----------------------------------(1)
//						String tmp = Getaurl ();						crawler (TMP);
//					}

One into the thread on the Judge Notcrawlurlset is not empty, but is multi-threaded, the beginning of the notcrawlurlset is not empty, so all the threads have entered the loop, although the Getaul () method I set the synchronized, But once a thread comes out of the Getaurl () method, another thread goes in and looks at the code of the first Getaurl method:

	Public synchronized  string Getaurl () {
		string tmpaurl;		synchronized (notcrawlurlset) {
			tmpaurl= notcrawlurlset.get (0);
			Notcrawlurlset.remove (0);		return
		tmpaurl;
	}

Each time the element in a notcrawlurlset array is deleted, causing the first thread to execute the Getaurl method, and when Notcrawlurlset is just empty, another thread enters the error, because Notcrawlurlset has no elements, Get (0) will complain. Later, the Getaurl function was changed to:

	Public synchronized  String Getaurl () {
		if (Notcrawlurlset.isempty ()) return
			null;
		String Tmpaurl;		synchronized (notcrawlurlset) {
			tmpaurl= notcrawlurlset.get (0);
			Notcrawlurlset.remove (0);		return
		tmpaurl;
	}

The run function of the thread is changed to:

					while (true) { 
//						System.out.println ("Current Entry" +thread.currentthread (). GetName ());
						String tmp = Getaurl ();
						if (tmp!=null) {
							crawler (TMP);
						} else{
							synchronized (signal) {
								try {
									count++;
									System.out.println ("currently has" +count+ "threads Waiting");
									Signal.wait ();
								} catch (Interruptedexception e) {
									//TODO auto-generated catch block
									e.printstacktrace ();
								}
							}
							
							
						}
					}

That is, when the thread enters, it calls the Getaurl function, takes the URL from the Notcrawlurlset array, and if not, uses signal to let the thread wait, but where to wake up. Surely when Notcrawlurlset has elements, that is, when Notcrawlurlset cannot be empty, there is an important variable count that represents the number of threads waiting, and only count greater than 0 will wake the thread. That is, only when the thread is waiting for the call signal.notify (); This segment is implemented in the Parsecontext function:

				if (Str.contains ("http:")) {//Remove some addresses that are not URLs
					if (!allurlset.contains (str)) {
						addurl (str, DEP);/Add a new URL
						if (count>0) {//If there is a waiting thread, wake
							synchronized (signal) {
								count--;
								Signal.notify ();}}}

This count variable also solves one of my problems, when all the threads start up, they also crawl the page correctly, but do not know how to end the thread, because the thread is a permanent loop, with the count variable, you know how many threads are waiting, when the waiting thread equals ThreadCount, It means that it's over, because all the threads are waiting and won't add a new URL to Notcrawlurlset, which has already crawled through all the pages at the specified depth.

Write down a little sentiment, understand the principle is one thing, and sometimes it is very demanding to achieve.

Code several times modified, still need to improve the place and my thinking:

1: Crawl the page to save up, how to store is also a problem, how the directory generation. Automatic classification of web pages. And so on, the classification can be used to consider the Bayes classifier, after the class to install the category to store.

2: Web page to heavy problem, if too many URLs, memory can not fit to do. Consider the first compression, such as MD5 compression, while MD5 can get hash value, the simplest is the hash to heavy, or you can consider using Bloom filter to weight, there is a way to consider using the Key-value database to achieve heavy, However, I am not very familiar with the Key-value database, should be similar to hash, but the efficiency of the problem database has been helped you solve.

3:url different Web pages may also be content, how to determine the similarity of the page. Web page similarity can be extracted from the main body of the Web page, the method has a block function method, the text can be extracted after the use of vector cosine method to calculate the similarity.

4: Incremental crawl problem, after a web crawl, when again to catch. can be specific to the page to update the frequency to solve this problem, such as Sina homepage of the news may be updated a little faster, the frequency of crawling again will be a little faster.

Think of these for a moment, and then continue to improve.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java Multithreaded crawler Instance

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support