Android to write simple web crawler _android

Source: Internet
Author: User

First, the basic knowledge of web crawler

Network crawler through the Internet, the network of related pages crawl all over, this embodies the concept of crawling. How the crawler traverses the network, the Internet can be seen as a big picture, each page as one of the nodes, the page connection as a side. The traversal of the graph is divided into width traversal and depth traversal, but the depth traversal may traverse deeply or sink into a black hole. Therefore, most reptiles do not use this form. On the other hand, the crawler gives a certain priority to the pages to be traversed in the way that the width first traverses, which is called the traversal with preference.

The actual crawler begins with a series of seed links. The seed link is the starting node, and the hyperlink to the seed page is the child node (the middle node), and for non-HTML documents, such as Excel, you cannot extract hyperlinks from them as a terminal node of the graph. The entire traversal process maintains a visited table, recording which nodes (links) have been processed, skipping processing.

Second, the simple realization of Android web crawler demo

Look at the effect.

Grabbed this page and wrote an app.

That is true

The ListView into a card-type, and then the color is also very paper texture Ah ah ah

I kind of like it anyway.

And then we'll see how it's done.

Take a look at what each class does:

mainactivity: activity on the main interface

Mainadapter:ListView Adapter

networkclass: link Network use httpclient send request, receive response get content probably just got the Web page what the hell.

And a lot of it is an HTML code to parse this

News: There are two attributes in this class. A title one is this headlines point into that URL;

newsactivity: Detailed Press side

pulllistview: rewrite listview with pull down refresh and pull up load function

Then start at Oncreat ():

protected void OnCreate (Bundle savedinstancestate) {super.oncreate (savedinstancestate);
    Setcontentview (R.layout.activity_main);
    Initview ();
    Mainthread MT = new Mainthread (Newsurl);
    Final thread t = new Thread (MT, "Mainthread");

    T.start ();
        Pulllistview.setonrefreshlistener (New Pulllistview.onrefreshlistener () {@Override public void Onrefresh () {
        Isgetmore = false;
        Mainthread MT = new Mainthread (Newsurl);
        Thread t = new Thread (MT, "Mainthread");

      T.start ();

    }
    });
        Pulllistview.setongetmorelistener (New Pulllistview.ongetmorelistener () {@Override public void Ongetmore () {
        Isgetmore = true;
          if (num > 1) {mainthread MT = new Mainthread (nextPage);
          Thread t = new Thread (MT, "Mainthread");
        T.start ();
    }

      }
    }); Pulllistview.setonitemclicklistener (New Adapterview.onitemclicklistener () {@Override public void OniteMclick (adapterview<?> Parent, view view, int position, long id) {Intent Intent = new Intent (mainactivity.th
        Is,newsactivity.class);
        Intent.putextra ("url", List.get (position-1). GETURL ());

      StartActivity (Intent);

  }
    }); }

This is basically the first initialization of the data

And then a new thread, because it involves a network request, so we're going to have to run a process to execute it and then there's a pull listview up click on the binding

So the main content is the thread inside

Look at the thread before you look at it.networkClass

Package com.example.katherine_qj.news;
Import Android.net.http.HttpResponseCache;

Import Android.util.Log;
Import Org.apache.http.HttpResponse;
Import org.apache.http.client.HttpClient;
Import Org.apache.http.client.methods.HttpGet;

Import org.apache.http.impl.client.DefaultHttpClient;
Import Java.io.BufferedReader;
Import java.io.IOException;
Import Java.io.InputStream;

Import Java.io.InputStreamReader;
 /** * Created by KATHERINE-QJ on 2016/7/24.
    */public class Networkclass {public string getdatabyget (string url) {log.e ("qwe", "content");
    String content = "";
    HttpClient httpclient = new Defaulthttpclient ();
    LOG.E ("Qwe", "Content1");
    /* Use HttpClient to send a request, receive a response is very simple, generally need the following steps can be.
    1. Create the HttpClient object. 2. Create an instance of the request method and specify the request URL.
    If you need to send a GET request, create a HttpGet object, or create a HttpPost object if you need to send a POST request. 3. If you need to send request parameters, you can call the HttpGet, HttpPost common setparams (Hetpparams params) method to add request parameters, and for HttpPost objects, you can also call Setentity (
    Httpentity entity) method to set request parameters. 4. Execute (httpurirequest request) to invoke the HttpClient object to sendRequest, the method returns a HttpResponse. 5. Call HttpResponse getallheaders (), Getheaders (String name), and so on to get the server's response header; call HttpResponse getentity () method to get the Httpentity object that wraps the response content of the server.
    This object is used by the program to get the response content of the server. 6. Release the connection.
    Regardless of whether the execution method succeeds, must release the connection * * HttpGet httpget = new HttpGet (URL);
      try {HttpResponse HttpResponse = Httpclient.execute (HttpGet);
      Httpreponse is a class that is commonly used to process return results after the server receives a request from the browser. if (Httpresponse.getstatusline (). Getstatuscode () = =/*getstatusline () Gets the status line for this response.
        The status bar can be set using one of the setstatusline methods, or it can be initialized in the constructor/InputStream is = Httpresponse.getentity (). getcontent (); /*getentity () Gets the message entity for this response, if any. Entities are provided by invoking setentity.
        * * BufferedReader reader = new BufferedReader (new InputStreamReader (IS));
        String Line;
        while (line = Reader.readline ())!= null) {content + = line;
    }}catch (IOException e) {log.e ("http", E.tostring ());
    } log.e ("SDF", content);
  return content; }
}

The notes are very detailed.

There's probably a getDataByGet way to take a url parameter and get the page content back through a series of requests.content

Down is the thread that uses this class.

 public class Mainthread implements runnable{
    private String URL;
    Public mainthread (String URL) {
      this.url = URL;
    }
    @Override public
    Void Run () {
      networkclass networkclass =new networkclass ();//new a network class
      content = Networkclass.getdatabyget (URL);//Receive the string returned by this class, which is the sequence of
      log.e ("Qwe", content) that needs to be parsed;
      Handler.sendemptymessage ();
    }
  

is to use this thread to get content and then pass handle to the main thread to parse

   Private final Android.os.Handler Handler = new Android.os.Handler () {public
    void Handlemessage (msg) {
      Switch (msg.what) {case
        :
          analysehtml ();
          if (isgetmore) {
            mainadapter.notifydatasetchanged ();
      /* Each time notifydatasetchange () will cause the interface redraw. When you need to modify the relevant properties of view on the interface, the
       final set is completed before calling Notifydatasetchange () to redraw the interface. * *
          }else {
            mainadapter = new Mainadapter (mainactivity.this, list);
            Pulllistview.setadapter (Mainadapter);
          }
          Pulllistview.refreshcomplete ();
          Pulllistview.getmorecomplete ();
          break;
      }
    }
   ;

Analysehtml ();

Find the things that actually parse in this method so here is the thing to parse the page:

 public void analysehtml () {if (content!=null) {int x= 0;
       Document document = Jsoup.parse (content);
         Parse the HTML string if (!isgetmore) {list.clear (); Element element = document.getElementById ("fanye3942");/Get fanye3942 this node String text = Element.text ()//Get the text of this node
         Partial System.out.print (text);
         num = Integer.parseint (text.substring (Text.lastindexof ('/') + 1, text.length ()-1));
       System.out.print (num); } Elements Elements = Document.getelementsbyclass ("c3942");//Get c3942 all child nodes in this node while (true) {I
            F (x==elements.size ()) {System.out.print (Elements.size ());
           break;//traversal to the last Exit} news = new News ();
          News.settitle (Elements.get (x). attr ("title");//To get the text part of each child node News.seturl (Elements.get (x). attr ("href"));
           List.add (news); if (!isgetmore| |
             X&GT;10) {List.add (news); if (X&GT;=25) {break;

         }//This is because our school's web page has repeated} x + +; } if (num>1) {nextPage = url+ "/" +--num+ ". htm";//Because there is a flip page here's the URL for the next one to open the thread to request the data when it is pulled Syste
         M.out.println ("QQQQQQQQQQQ" +nextpage); }

       }
     }

The Document object allows us to access all the elements in the HTML page from the script.

So Android is based on Jsoup to make content into a Document object.

And then you can decompose it, take it, and take the data and see what you need.

I'm starting to wonder what those fanye3942 and c3942 were. The node ID or class of the data that was needed later

Just like this.

And then add the data to the collection to the ListView to bind to the collection.

The main page is like this and then jump the page is because news inside also put in the URL after each click, so spread to the newsactivity to use the same ideas to resolve the show is good

Package com.example.katherine_qj.news;
Import android.app.Activity;
Import Android.os.Bundle;
Import Android.os.Message;
Import Android.util.Log;
Import Android.widget.EditText;

Import Android.widget.TextView;
Import Org.jsoup.Jsoup;
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element;

Import org.jsoup.select.Elements;
 /** * Created by KATHERINE-QJ on 2016/7/25.
  * * Public class Newsactivity extends activity {private TextView texttitle;
  Private TextView TextEdit;
  Private TextView Textdetail;
  Private String title;
  Private String edit;
  Private String detail;
  private StringBuilder text;
  Private String URL;
  Private document document;
  Private String content;
    @Override protected void OnCreate (Bundle savedinstancestate) {super.oncreate (savedinstancestate);
    Setcontentview (r.layout.activity_news);
    Initview ();
    Url=getintent (). Getstringextra ("url");
    LOG.E ("QQQ", url);

    Newsthread newsthread = new Newsthread (URL); Final Thread t = new Thread (Newsthread, "newsactivity");

  T.start ();
    public void Initview () {texttitle = (TextView) Findviewbyid (r.id.texttitle);
    TextEdit = (TextView) Findviewbyid (R.id.textedit);
  Textdetail = (TextView) Findviewbyid (R.id.textdetail); Private final Android.os.Handler Handler = new Android.os.Handler () {public void Handlemessage (msg) {i
       F (msg.what==1001) {document = Jsoup.parse (content);
       Analysehtml (document);
       Texttitle.settext (title);
       Textedit.settext (edit);
     Textdetail.settext (text);
  }

   }
  };
    public class Newsthread implements runnable{String URL;
    Public newsthread (String url) {this.url = URL;
      @Override public void Run () {Networkclass networkclass = new Networkclass ();
      Content = Networkclass.getdatabyget (URL);
      System.out.print ("QQQ" +content);
    Handler.sendemptymessage (1001); } public void analysehtml (document document) {if (document!=null) {Element element = document.getElementById ("Nrys");
      Elements Elements = element.getallelements ();
      title = Elements.get (1). text ();
      Edit = Elements.get (4). text ();
      Element melement = document.getElementById ("vsb_content_1031");
        if (melement!= null) {Elements melements = melement.getallelements ();
        Text = new StringBuilder (); for (Element melement:melements) {if (Melement.classname (). Equals ("Nrzwys") | | melement.tagname (). Equals ("Stro
          Ng ")) {continue; The IF (!melement.text (). Equals ("") &&!melement.text (). Equals (""));
          {Text.append (""). Append (Melement.text ()). Append ("\ n");
          } if (Melement.classname (). Equals ("Vsbcontent_end")) {break; }
        }
      }
    }
  }
}

The above is based on Android to write a simple web crawler all the content, this article is very detailed, I hope to give you in the process of Android development help.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.