Jsoup access to national data (towns and villages in provinces and counties)

Source: Internet
Author: User

Recently on hand to do something, need a geographical data from all over the country, from the province to the county Township Street. Various degrees Niang, all kinds of Google, have not found a complete data. The last kung Fu, finally found a relatively complete data, but the data here is only accurate to the town level, no village-level data (later through the analysis of data sources I know why, hehe), in addition to the bloggers provide some data redundancy, for the obsessive-compulsive disorder and the pursuit of perfection of me, Thinking that I must do my own to get this part of the data to crawl out.

The content of the above blog is also rich, bloggers are using PHP to achieve, as the first ranking of the 2015 programming language, we can not be weak ah, below I take everyone together to see how to crawl from the Web page we want data ...

first step, preparation (data source + tools):

Data sources (official data by far the most comprehensive authority): http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2013/

Crawling Data Tool (crawler): http://jsoup.org/

Second Step, Data source Analysis:

First of all, the use of Jsoup tools I do not explain here, interested in their own hands to consult.

Do development should be more to understand the use of some software tools, in the ordinary development process encountered before you know where to start, encourage everyone to pay more attention to some of the software tools around, in case of a rainy day. Before doing this thing, I do not know how to use Jsoup, but I know what jsoup can be used to do, in my need to use the time, then go to consult the information, their own study.

The above data source is issued by the National Bureau of Statistics of 2013, its accuracy and authority is self-evident.

Next we analyze the structure of the data source, starting from the first page:

Through the analysis of Home source code we can get the following 3 points:

    1. The entire layout of the page is controlled by the table label, that is, if we want to select the hyperlink through Jsoup, then we must note that in the region is not as long as the location of the provinces and cities to use the table, the entire page has multiple tables, so it is not possible to directly through the table
      Document connect = Connect ("http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2013/"  = connect. Select ("table");

      To parse the data.

    2. How many parts of the page have hyperlinks. It may be that the authorities consider the reason why you programmers need to get such data, the page is very clean, except for the record number below is an extra hyperlink, other links can be crawled directly.
    3. Data laws of the provinces ' cities. Each row of a table containing valid information has a class attribute provincetr, which is important, and as to why it is important, keep looking down; there are multiple TD labels in each row of data, and each TD tag contains a hyperlink, which is exactly the hyperlink we want. The text of the hyperlink even if the name of the province (municipality, etc.).

Again we look at the General data page (the General data page includes the city, county, town level this three-level data display page):

The reason to put together the above three pages, because through analysis we can find that the data page of the three-level data is exactly the same, the only difference is that the data row TR in the HTML source data table The class attribute is inconsistent, respectively, corresponds to: citytr,countrytrhe Towntr. Others are consistent. This way we can use a common method to solve the data crawl of the three pages.

  

Finally, take a look at the village level data page:

In the village level data, and the above city and county town data format Inconsistent, this level represents the lowest level of data, so there is no link to the above city and county town data crawl way to crawl; The table row for the data shown here is VILLAGETR, except for these two points, In each row of data contains three columns of data, the first column is Citycode, the second column is the Urban and rural classification (the city town's data format does not exist), and the third column is the city name.

Having grasped all the above points, we can start coding.

The third step, the code realization:
import java.io.bufferedwriter;import java.io.file;import java.io.filewriter;import Java.io.ioexception;import Java.util.hashmap;import Java.util.map;import Org.jsoup.jsoup;import Org.jsoup.nodes.document;import Org.jsoup.nodes.element;import Org.jsoup.Select. Elements;/** * National provinces and cities County Town Village Data Crawl * @author Liushaofeng * @date 2015-10-11 Morning 12:19:39 * @version 1.0.0*/ Public classjsouptest{Private StaticMap<integer, string> cssmap =NewHashmap<integer, string>(); Private StaticBufferedWriter BufferedWriter =NULL; Static{cssmap.put (1,"Provincetr");//ProvinceCssmap.put (2,"Citytr");//CityCssmap.put (3,"Countytr");//CountyCssmap.put (4,"towntr");//TownCssmap.put (5,"Villagetr");//Village    }     Public Static voidMain (string[] args) throws IOException {intLevel =1;        Initfile (); //access to various provincial information nationwideDocument connect = connect ("http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2013/"); Elements rowprovince= Connect.Select("tr."+ Cssmap.Get(level));  for(Element provinceelement:rowprovince)//to traverse each row of the province city{ElementsSelect= Provinceelement.Select("a");  for(Element Province:Select)//each province (Sichuan province){parsenextlevel (province, level+1);    }} closestream (); }    Private Static voidInitfile () {Try{BufferedWriter=NewBufferedWriter (NewFileWriter (NewFile ("D:\\cityinfo.txt"),true)); } Catch(IOException e) {e.printstacktrace (); }    }    Private Static voidClosestream () {if(BufferedWriter! =NULL)        {            Try{bufferedwriter.close (); } Catch(IOException e) {e.printstacktrace (); } bufferedwriter=NULL; }    }    Private Static voidParsenextlevel (Element parentelement,intLevel ) throws IOException {Try{Thread.Sleep ( -);//sleep, or you may see various error status codes}Catch(interruptedexception e) {e.printstacktrace (); } Document Doc= Connect (Parentelement.attr ("Abs:href")); if(Doc! =NULL) {Elements newsheadlines= Doc.Select("tr."+ Cssmap.Get(level));//            //get a row of data for a table             for(element Element:newsheadlines) {Printinfo (element, level+1); ElementsSelect= element.Select("a");//at the time of recursive invocation, this is to determine whether the data at the village level, the village level of data does not have a label                if(Select. Size ()! =0) {Parsenextlevel (Select. Last (), Level +1); }            }        }    }    /** * Write a line of data to a data file * @param element crawled to data elements * @param level city*/    Private Static voidPrintinfo (element element,intLevel ) {        Try{bufferedwriter.write (element.Select("TD"). Last (). Text () +"{"+ Level +"}["+ element.Select("TD"). First (). Text () +"]");            Bufferedwriter.newline ();        Bufferedwriter.flush (); } Catch(IOException e) {e.printstacktrace (); }    }    Private StaticDocument connect (String URL) {if(url = =NULL||Url.isempty ()) {            Throw NewIllegalArgumentException ("The input URL ('"+ URL +"') is invalid!"); }        Try        {            returnJsoup.connect (URL). Timeout ( -* +).Get(); } Catch(IOException e) {e.printstacktrace (); return NULL; }    }}

Data crawl process is a long process, only need to slowly wait, hehe, because the program runs longer, please do not print the output in the console, otherwise it may affect the program run ....

Finally, the format of the data is as follows ("{}" indicates the city level, "[]" the content represents the city code):

after getting the above data, what you want to do is self-fulfilling, The above code can be run directly, after crawling from the data source, you can directly switch to the format you want.

For the final result of subsequent processing, see the blog post: http://www.cnblogs.com/liushaofeng89/p/4937714.html

If you think this blog is helpful to you, please remember to click on " recommend " at the bottom right! , what's The DA ...

Reprint Please specify source: http://www.cnblogs.com/liushaofeng89/p/4873086.html

Jsoup access to National Geographic data (town and county towns and villages)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.