Jsoup is a Java HTML parser that can directly parse a URL address, HTML text content. It provides a very labor-saving API for fetching and manipulating data through dom,css and jquery-like operations.
Recently on hand to do something, need a geographical data from all over the country, from the provincial city to County town Street. All kinds of Niang, all kinds of Google, did not find a complete data. Finally, to find a relatively complete data, but the data here is only accurate to the town level, there is no village-level data (later through the analysis of data sources I know why, hehe), in addition to the blogger provided some data redundancy, for obsessive-compulsive disorder and the pursuit of perfection, I thought I'd have to get this part of the data crawling out of my own hands.
The contents of the above blog is also rich, bloggers are using PHP to achieve, as the 2015 programming language list of the first, we can not show weakness Ah, the following I take you to see how to use Java from the Web to crawl the data we want ...
First step, preparation (data source + tools):
Data sources (up to now the most comprehensive authoritative official data): http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2013/
Tools for crawling data (crawler tools): http://jsoup.org/
The second step, data source analysis:
First of all, the use of Jsoup tools I do not explain here, interested in the can do their own to check.
Do development should be more to understand the use of some software tools, in the ordinary development process encountered just know where to start, encourage us to pay attention to some of the software tools around, for a rainy day. Before doing this thing, I do not know how to use Jsoup, but I know what jsoup can be used to do, when I need to use the time, then consult the data, their own study.
The above data sources are published by the National Statistical Office of the People's Republic of China in 2013, and their accuracy and authority are self-evident.
Next we analyze the structure of the data source, first from the home page:
Through the analysis of home code we can get the following 3 points:
1. The entire layout of the page is controlled by the table tag, that is, if we want to select the hyperlink through the Jsoup, then we must pay attention to the above picture is not as long as the location of the provinces and cities adopted is the table, the entire page has multiple tables, so it is not directly through the table
Document connect = connect ("http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2013/");
Elements rowprovince = connect.select ("table");
To parse the data.
2. How many parts of a page have hyperlinks. It may be that the authorities have taken into account the reason why you programmers need to get such data, the pages are clean, except that the record number below is a redundant hyperlink, and other links can be crawled directly.
3. Data laws of provincial cities. Each row of a table that contains valid information has a class attribute provincetr, this attribute is important, as to why it is important, please go on and look down; there are multiple TD tags in each row of data, and each TD tag contains a hyperlink, which is the hyperlink we want, The text of the hyperlink even the name of the province (municipality, etc.).
Again we look at the General data page (General data page, including municipal, county, town level of this three-level data display page):
The reason to put the above three pages together, is because through analysis we can find that the data page of the three level of data exactly the same, the only difference is in the HTML source data table in the data row TR class attributes are inconsistent, respectively, corresponding to: Citytr,countrytrhe Towntr. Others are consistent. This allows us to solve the data crawl of these three pages in a common way.
Next we analyze the structure of the data source, first from the home page:
Finally, look at the village level data page:
At the village level, the data format is inconsistent with that of the city and county towns, this level of data is the lowest level, so there is no link, so can not use the above city and county town data crawl way to climb; here is a table row of data for VILLAGETR, except for these two points, In each row of data contains three columns of data, the first column is Citycode, the second column is urban-rural classification (the city and county town of the data format does not exist this item), the third column is the city name.
Having grasped all the above points, we can start coding.
The third step, coding implementation:
Import Java.io.BufferedWriter;
Import Java.io.File;
Import Java.io.FileWriter;
Import java.io.IOException;
Import Java.util.HashMap;
Import Java.util.Map;
Import Org.jsoup.Jsoup;
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element;
Import org.jsoup.select.Elements;
/** * National Provinces County Town Village Data Crawl * @author Liushaofeng * @date-Morning:: * @version.
* * Public class Jsouptest {private static Map<integer, string> Cssmap = new Hashmap<integer, string> ();
private static BufferedWriter bufferedwriter = null;
static {Cssmap.put (, "provincetr");//Province Cssmap.put (, "citytr");//City Cssmap.put (, "countytr");//County Cssmap.put (, "towntr");//Town Cssmap.put (, "villagetr");//village} public static void Main (string[) args) throws Ioex
ception {int level =;
Initfile ();
Access to the National provincial information Document connect = connect ("http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm//"); Elements rowprovince = Connect.select ("tr." + Cssmap.get (level));
for (Element provinceelement:rowprovince)//traverse each row of the province city {Elements select = Provinceelement.select ("a");
for (Element province:select)//each province (Sichuan province) {parsenextlevel (province, Level +);
} closestream (); private static void Initfile () {try {bufferedwriter = new BufferedWriter (new FileWriter) (New Fil
E ("D:\\cityinfo.txt"), true);
catch (IOException e) {e.printstacktrace (); }} private static void Closestream () {if (BufferedWriter!= null) {try {buff
Eredwriter.close ();
catch (IOException e) {e.printstacktrace ();
} bufferedwriter = null;
The private static void Parsenextlevel (Element parentelement, int level) throws IOException {try { Thread.Sleep ();//sleep, or you may receive various error status code} catch (Interruptedexception e) {E.printstackTrace ();
Document doc = connect (parentelement.attr ("Abs:href"));
if (Doc!= null) {Elements newsheadlines = Doc.select ("tr." + Cssmap.get (level))///Get a row of data from the table
for (element Element:newsheadlines) {Printinfo (element, level +);
Elements select = Element.select ("a");//In recursive invocation, here is to determine whether the village level of data, the village level of data does not have a label if (Select.size ()!=) {
Parsenextlevel (Select.last (), level +); /** * Write a row of data to the data file * @param element Crawl Data elements * @param level city levels/private Stati c void Printinfo (element element, int level) {try {bufferedwriter.write (Element.select ("TD '). Last ().
Text () + "{" + level + "}[" + element.select ("TD"). "A" (). Text () + "]"
Bufferedwriter.newline ();
Bufferedwriter.flush ();
catch (IOException e) {e.printstacktrace (); } private static Document connect (String URL) {
if (url = null | | url.isempty ()) {throw new IllegalArgumentException ("The input url (' + URL + ' ') is
Invalid! ");}
try {return Jsoup.connect (URL). Timeout (*). get ();
catch (IOException e) {e.printstacktrace ();
return null; }
}
}
The data crawl process is a long process, just wait slowly, hehe, because the program runs for a long time, please do not print the output in the console, otherwise it may affect the program running ....
The final data is obtained in the format below ("{}" to represent the city level, "[]" the content represents the city code):
Municipal District {3}[110100000000]
Dongcheng District {4}[110101000000]
Donghua Street Subdistrict Office {5}[110101001000]
Multi Fu Xiang Community Neighborhood Committee {6}[110101001001]
Silver Gate Community Neighborhood Committee {6}[110101001002]
East Factory Community Neighborhood Committee {6}[110101001005]
Zhi De Community Neighborhood Committee {6}[110101001006]
South Pool Community Neighborhood Committee {6}[110101001007]
Huang Tugang Community Neighborhood Committee {6}[110101001008]
Dengshikou Community Neighborhood Committee {6}[110101001009]
Justice Road Community Neighborhood Committee {6}[110101001010]
Ganyu Community Neighborhood Committee {6}[110101001011]
Plant Community Neighborhood Committee {6}[110101001013]
Shao Nine Community neighborhood Committee {6}[110101001014]
Wangfujing Community Neighborhood Committee {6}[110101001015]
Jingshan Subdistrict Office {5}[110101002000]
Longfu Temple Community Neighborhood Committee {6}[110101002001]
Auspicious Community Neighborhood Committee {6}[110101002002]
Yellow Gate Community Neighborhood Committee {6}[110101002003]
Zhong Gu Community Neighborhood Committee {6}[110101002004]
Wei Jia Community Neighborhood Committee {6}[110101002005]
Wang Sesame Community Neighborhood Committee {6}[110101002006]
Jingshan Street Community Neighborhood Committee {6}[110101002008]
Imperial Root North Street Community Neighborhood Committee {6}[110101002009]
Intersection subdistrict Office {5}[110101003000]
Eastern Community Neighborhood Committee {6}[110101003001]
Fuk Cheung Community Neighborhood Committee {6}[110101003002]
Daxing Community Neighborhood Committee {6}[110101003003]
Fu Study Community Neighborhood Committee {6}[110101003005]
Gulou Court Community Neighborhood Committee {6}[110101003007]
Ju ER Community Neighborhood Committee {6}[110101003008]
South Gongs and Drums Alley Community Neighborhood Committee {6}[110101003009]
Anding Door Subdistrict Office {5}[110101004000]
North Headlines Community Neighborhood Committee {6}[110101004001]
North Percussion Alley Community Neighborhood Committee {6}[110101004002]
Guozijian Community Neighborhood Committee {6}[110101004003]
......
Get the above data, what you want to do can self to achieve, the above code can be run directly from the data source crawl, you can directly convert to their own format.