Java implementation of simple crawler--Grab the cool Ann user avatar

Source: Internet
Author: User

Crawler ideas

To the user's personal center of the users who have more fans in the network, get the Personal center link, the user Avatar link and the user name of all fans of the user, and put them into the queue separately. Open two threads to get information, one thread gets the information of the user in the queue and puts it in the queue, and the other thread takes the link from the Avatar link queue and downloads the user picture.

Reptile Analysis

Open a user's fan list with a browser (http://coolapk.com/u/[user id]/contacts)

and view the source code

We can see that the fan list is displayed in the UL tag of the HTML, and that the ID of each Li tag in the Datalist,ul tag is the information of each user ~ further analysis, the IMG tag in the LI tag is the user's avatar. The content of the H4 tag is the user name, and the href attribute of the A tag in the H4 tag is the user's personal center link.

Through observation we also know: user's fan list link = Personal Center link + "/contacts"

So we can start climbing the avatar.

Library to use

Jsoup:

Function: Parse and manipulate HTML elements. : Https://jsoup.org/download

HttpClient:

Role: Download pictures. : http://hc.apache.org/downloads.cgi

Code

Main.java

Package Main;import Java.io.file;import Java.io.fileoutputstream;import java.io.inputstream;import Org.apache.http.client.methods.closeablehttpresponse;import Org.apache.http.client.methods.httpget;import Org.apache.http.impl.client.closeablehttpclient;import Org.apache.http.impl.client.httpclients;import Org.jsoup.connection;import Org.jsoup.jsoup;import Org.jsoup.nodes.document;import Org.jsoup.nodes.Element;public Class Main {//browser uaprivate static String ua= "mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/56.0.2924.87 safari/537.36 ";//host address private static final String host= "http://coolapk.com";//avatar Local Save address private static final String save_payh= "d:/coolapk/";//Indicates whether Userthread is running private Static boolean isrun=false;//User Center interface queue private static myqueue<string> userurlqueue=new myqueue<> ();// User Picture link queue private static myqueue<string> userheadurlqueue=new myqueue<> ();//user name queue private static Myqueue <String> usernamequeue=new myqueue<> ();p ublic static void Main (string[] args) throws Exception {//TODO auto-generated method Stubuserurlqueue.put ("Http://coo Lapk.com/u/12202/contacts "); Java.io.File f=new java.io.File (SAVE_PAYH);//If the folder does not exist, create if (!f.exists ()) {f.mkdirs ();} Start ();} /** * starts */private static void Start () {New Userthread (). Start (); New Headthread (). Start (); /** * Get RELATED LINKS * @throws Exception */private static void Getuserurl () throws Exception {String url=userurlqueue.poll (); if (U Rl!=null) {isrun=true; Connection connection=jsoup.connect (URL), connection.useragent (UA);D ocument document=connection.get (); Element Ulelement=document.getelementbyid ("dataList"); Org.jsoup.select.Elements lielements= Ulelement.getelementsbytag ("Li"); if (lielements==null) {return;} for (Element li:lielements) {if (li==null) continue;//Gets the user picture link string userheadurl=li.getelementsbytag ("img"). First (). attr ("src");//Gets a user's fan list of urlstring Userurl=host+li.getelementsbytag ("H4"). First (). Getelementsbytag ("a"). First ( ). attr ("href") + "/contacts";//Gets the user name of a user string Username=li.getelementsbytag ("H4"). First (). Getelementsbytag ("a"). First (). text ();//local save no longer joins queue if (!new File (save _payh+username+ ". jpg"). Exists ()) {userurlqueue.put (Userurl); Userheadurlqueue.put (Userheadurl); usernamequeue.put (UserName);}} The queue is empty, Isrun=falseisrun=false;}} /** * Get pictures and save to Local * @param imgurl * @param localPath * @throws Exception */private static void GetImage (String imgurl,strin G LocalPath) throws Exception {//system.out.println (imgurl); Closeablehttpclient httpclient = Httpclients.createdefault (); HttpGet httpget= New HttpGet (Imgurl); Closeablehttpresponse Resp=httpclient.execute (httpget); InputStream inputstream=resp.getentity (). GetContent (); FileOutputStream fileoutputstream=new FileOutputStream (LocalPath); byte[] Buf=new byte[1024];int len=0;while ((len= Inputstream.read (BUF))!=-1) {fileoutputstream.write (buf, 0, Len); Fileoutputstream.flush ();} Inputstream.close (); Fileoutputstream.close ();} /** * Get link thread * @author zyw * */public static class Userthread extends thread{@Overridepublic void Run () {//TODO auto-generated method stub//If the queue userurlqueue is not empty while (!userurlqueue.isempty ()) {try {getuserurl ()}; catch (Exception e) {//TODO auto-generated catch Blocke.printstacktrace ();}}} /** * Get Avatar Thread * @author zyw * */public static class Headthread extends thread{@Overridepublic void Run () {//TODO Auto-gen Erated method stub//If the queue userheadurlqueue is not empty and userthread is working while (!userheadurlqueue.isempty () | | Isrun) {try {String imgurl=userheadurlqueue.poll (); String Username=usernamequeue.poll (); GetImage (Imgurl, save_payh+username+ ". jpg"); catch (Exception e) {//TODO auto-generated catch Blocke.printstacktrace ();}}}}

Myqueue.java

Package Main;import java.util.linkedlist;import java.util.queue;/** * thread-safe Queue * @author zyw * * @param <T> */public C Lass myqueue<t> {private linkedlist<t> userurlqueue=new linkedlist<t> (); private Object Lock=new Object ();  /**  * Gets whether the queue is empty  * @return *  /public boolean isEmpty () {return userurlqueue.isempty ();}  /**  * Inserts an element into the end of the queue  * @param t *  /public void put (T t) {synchronized (lock) {userurlqueue.addlast (t);}}  /**  * Queue header take out an element  * @return *  /Public T  poll () {T t=null; synchronized (lock) {t= (t) Userurlqueue.remov Efirst (); }return t;}}


Java implementation of simple crawler--Grab the cool Ann user avatar

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.