Straight to the point, releasing first:
1, http://blog.jjonline.cn/soft/J_Position/ajing.sql.gz
phpMyAdmin compressed exported MySQL library, library name: Ajing, there are 6 tables, a table without suffix is the original data, each row is a village, from the province to the village; the other 5 tables with suffixes are associated with their respective administrative codes, such as Hubei Province ID 420 (actually 42, The state code in the database is 3 digits, the last 0 is redundant, the Yichang ID is 4205 (then 8 0 is 420500000000), Dangyang (my hometown, county-level city) is 420582 (and then 6 0 is 420582000000), and so on.
Size: 17164601 bytes (16.3M)
Modification Date: July 16, 2014, 13:02:02
Md5:a170d11e82a2532ce29574c46739b9ca
sha1:6d0fe378e2d6ab007e5f5977b4039e9e28de5431
crc32:2aabf023
2, http://blog.jjonline.cn/soft/J_Position/ajing_position.7z
This is the phpMyAdmin exported SQL file, which is consistent with the first file and then compressed in 7z format using 360 compression software, the original SQL file 170 a few megabytes too big.
Size: 10363487 bytes (9.88M)
Modification Date: July 16, 2014, 13:09:03
md5:2a3916a6617f7507fadb98e34341f59e
Sha1:517f07dc7221bae0da5857bb77941e50388b4ce0
crc32:c4ff8237
3, http://blog.jjonline.cn/soft/J_Position/j_position.7z
This file is a table in the MySQL library that is mentioned in the first file without a suffix.
Size: 6206567 bytes (5.91M)
Modification Date: July 15, 2014, 23:08:57
md5:ec7f7f500e7888fb36639fd76a598337
sha1:e0ee991f7b2ae8b1ea96ddbf49badf7b6434b853
crc32:75d1a75a
4, http://blog.jjonline.cn/soft/J_Position/positionJson.7z
This file is a JSON-formatted data file that is generated after reading the Web page, a text file of the JSON suffix saved by the city, County (county), town, and village (village), with the text formatted as JSON. Every city and county town has saved the province code clearly, such as the city of Hubei saved./positionjson/city/420.json, county for./positionjson/county/420.json, town for./positionjson/town /420.json, village for./positionjson/village/420.json, etc.
This JSON file is intended for you to read and insert the database in your own format.
Size: 5384282 bytes (5.13M)
Modification Date: July 16, 2014, 13:12:22
md5:d862a925839f1358984607a63e79c701
Sha1:e47dd7c7bc2a815e15664d7529c7dc2268b8cb36
Crc32:edb0b0ac
====
Still remember two years ago, at that time to PHP still do not know very much, is to find the article to share, the data found on the Internet is either very old or incomplete; and then every group sent messages everywhere, the idea was very simple, that is, there must be an e-commerce program Ape, find them to take a national province, city, County statistics are fairly simple, the results wait a day without any response, or people feel too simple to ignore the rookie, or everyone busy each, there is no time to ignore a small news.
This thing to my touch is very big, the network knowledge, although there are many errors, incomplete and even old problems, or an article you copy me I copy you this kind of situation, the demand is often to spend more time to search, screening; the most abominable is often see a wrong or imperfect article was turned around, The search keyword turned dozens of pages to see the situation ...
Well, it seems to be off-topic, or back to the point. This article shares the province, city, county, town, village data source is: http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2013/index.html This website does not introduce, this data source is the current, The most complete and authoritative data source.
The data source is HTML format, you need to crawl the HTML page, and then read out the data in the relationship, just start to write the program is the Edge crawl page read out the data, do not consider the need to read the total number of HTML files up to more than 40,000, The program ran 12 small time only to find some data crawl (the data source server occasionally Fanhun return 404 error, and the program did not deal with this situation), found that the problem has changed the strategy, first down the data source of all the page data, and then read the local HTML data generated JSON file, After down, then write the program to read the side check, to ensure that the data source of all the HTML pages are successfully down to local. Down to the local and check the integrity, the process lasted 16 hours, simply program automation, is too long a point-------more than 40,000 HTML pages, even if an HTML cost 1 seconds, also need 40,000 seconds (more than 10 hours), is normal, Plus integrity check 16 hours is OK!
(All HTML folder properties downloaded)
Next is the regular match of each page (and processing the relationship of each page into a MySQL field), resulting in all the JSON files in the fourth folder above, and then read the JSON files inserted into MySQL, and eventually produced a complete province, city, county, town, village library, for other applications to call. This process, the production of JSON more time-consuming, a province (provincial and municipal side-by-side processing) on average 30 minutes, 311 provinces (and municipalities) spent a large half day; and then read the JSON into the database, this is faster, every 1000 inserts, 2 hours will be done.
(The CentOS resource in the local virtual machine is read when JSON is inserted into MySQL)
Someone asked, "Why not just read HTML and directly insert MSYQL after processing, but also generate JSON and then read the insert?" "In fact, this is for the other program ape call convenience, my JSON here can actually be considered to be a kind of data storage way only."
Looking at the second figure in this article, the CentOS memory of the small 512M memory inside the local virtual machine has been eaten up and the swap is running out.
MySQL database and JSON format data for all provinces, municipalities, counties, towns and villages in the country