Java Implements web crawler

Source: Internet
Author: User

Last night with their own written web crawler from a website downloaded more than 30,000 pictures, very refreshing, today to share with you a few points.

I. SUMMARY OF CONTENTS

1:java can also implement web crawler

Simple use of the 2:jsoup.jar package

3: Can crawl a website's picture, the motion diagram as well as the compress package

4: Can consider multithreading speed up download speed

Ii. preparatory work

1: Install Java JDK

2: Download Jsoup.jar

3: Install Eclipse or other programming environment

4: Create a new Java project, import Jsoup.jar

Third, step

1: Get Web page source code with java.net package link on a Web site

2: Use Jsoup package parsing and iteration source code to get the final image URL you want

3: Write your own download method according to the image URL to download pictures to the local

Iv. Program realization and Effect

For example, for the following pages, we can download all HD images from the first page to the last page in the main area.




Program implementation:

Package Com.kendy.spider;import Java.io.bufferedreader;import Java.io.file;import java.io.fileoutputstream;import Java.io.ioexception;import Java.io.inputstream;import Java.io.inputstreamreader;import java.net.HttpURLConnection ; Import Java.net.url;import Java.net.urlconnection;import java.util.arraylist;import Java.util.list;import Org.jsoup.connection;import Org.jsoup.jsoup;import Org.jsoup.nodes.document;import Org.jsoup.nodes.Element;import Org.jsoup.select.elements;public class Myspider {/** * networking, get web page source code, parse source code */public static string gethtmlfromurl (string URL , String encoding) {StringBuffer html = new StringBuffer (); InputStreamReader isr=null; BufferedReader Buf=null; String str = null;try {URL urlobj = new URL (URL); URLConnection con = urlobj.openconnection (); ISR = new InputStreamReader (Con.getinputstream (), encoding); buf = new BufferedReader (ISR); while ((Str=buf.readline ()) = null) {html.append (str+ "\ n");} SOP (Html.tostring ());} catch (Exception e) {e.printstacktrace ();} Finally{if (ISR!)= null) {try {buf.close (); Isr.close ();} catch (IOException e) {e.printstacktrace ();}}} return html.tostring ();} public static void Download (String url,string path) {File file= null; FileOutputStream Fos=null; String downloadname= url.substring (Url.lastindexof ("/") +1); HttpURLConnection Httpcon = null; URLConnection con = null; URL Urlobj=null;inputstream in =null;byte[] size = new Byte[1024];int num=0;try {file = new file (path+downloadname);//if (! File.exists ()) {//file.mkdir ();//}fos = new FileOutputStream (file), if (Url.startswith ("http")) {///The following two sentences have a problem urlobj = New URL (url); con = Urlobj.openconnection (); Httpcon = (httpurlconnection) con;in = Httpcon.getinputstream (); while (num= In.read (size))! =-1) {for (int i=0;i<num;i++) fos.write (Size[i]);}} catch (Exception e) {e.printstacktrace ();} finally{try {in.close (); Fos.close ();} catch (Exception e) {E.printstacktrace ();}}} public static void Sop (Object obj) {System.out.println (obj);} public static void seperate (char c) {for (int x=0;x<100;x++) {System. Out.print (c);} SOP ("");} /** * @author Kendy * @version 1.0 */public static void Main (string[] args) throws Exception{int i=0; String Picurl=null; list<string> list = new arraylist<> (); for (int x=11;x<=16;x++) {String url = "http://so.sccnn.com/search/ %d6%d0%b9%fa%c3%ce/"+x+". html "; String encoding = "Utf-8"; String html = gethtmlfromurl (url,encoding);D ocument doc = jsoup.parse (HTML); Elements Elements = doc.select ("Table tbody tr TD table Tbody TR TD Table Tbody TR TD Div[valign] a[href]"); A element with href attribute//elements PNGs = Doc.select ("img[src$=.png]");//Picture with. png extension for (element element:elements) {Picurl = Element.attr ("href"), if (Picurl.startswith ("http") && picurl.endswith ("html")) {I++;list.add (Picurl); SOP (" URL "+i+": "+picurl";D ocument Document = Jsoup.connect (Picurl). get (); Elements els = document.select ("Div. Photodiv div font img "); for (Element el:els) {String Pictureurl = el.attr (" src "); SYSTEM.OUT.PRINTLN ("------" +pictureurl);d ownload (Pictureurl, "e:\\mydownload\\");}}}Seperate (' * ');} SOP (List.size ());} /*public static void Main (string[] args) {int i=0; String Picurl=null; String url = "http://so.sccnn.com/"; String encoding = "Utf-8"; String html = gethtmlfromurl (url,encoding);D ocument doc = jsoup.parse (HTML); Elements Elements = Doc.getelementsbytag ("img"); for (Element element:elements) {Picurl = element.attr ("src"); if ( Picurl.startswith ("http") && picurl.endswith ("jpg")) {I++;sop ("picture" +i+ "-------------" +picurl);d ownload ( Picurl, "e:\\mydownload\\");}} SOP ("End ... ");} */}
Remark Description:

1: The Custom Gethtmlfromurl () method and the Jsoup.pars () method can actually be used with the Jsoup.connect (URL). Get () method instead

2: The custom Gethtmlfromurl and download methods can be optimized for different access to HTTP and HTTPS requests

3: You can consider using multithreading to speed up the download, that is, this class to implement the Runnable interface, and rewrite the Run method, in the constructor, the start page and the end page as parameters to initialize the class.


Java Implements web crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.