Python crawler _ example of collecting data of city buses, subway stations, and lines, and python Crawler
The urban public transit and subway data reflect the public transportation of the city. The data can be used to mine the traffic structure, road network planning, and bus Site Selection of the city. However, such data is often stored in specific departments and is difficult to obtain. There is a large amount of information on the Internet map, including public transportation, subway and other data. The data feedback method can be parsed and collected through Python crawlers. The following describes how to use Python crawlers to crawl city buses, subway stations, and data.
First, crawl the names of all bus and subway lines in the study city, namely, XX Road and Metro Line X. You can use websites such as tuba public transit, public transit network, 8684, and bendibao to obtain the names of public transit lines by number or letter. Python can be collected by writing a simple crawler. For more information, seeWenWu_BothThe blogger detailed how to use python to crawl data of all bus stops in a city in 8684. The blogger collected detailed information about the site, including, but lacked the coordinates of the bus station and bus lines. This makes people crazy, how to map without spatial coordinates, how to analyze, so this article focuses on the acquisition of site coordinates and lines.
Take tuba bus as an example. After you click a bus, the detailed site information and map information of the bus are displayed. The blogger was excited and felt that it was about to succeed. packet capture and discovery could not be resolved. Limited by the technology of smart bloggers, please kindly advise if you have the power to capture the coordinates of site points and lines. This TM is desperate, and the fat meat in the mouth cannot be eaten.
There is no way to find a map. You can call the API to find the background address of the map. If you are familiar with the front-end, you can try it. The front-end of the blogger will only have a hello world. This is a way of thinking, and practice has proved that it is possible.
The map API works. How can I capture packets through a map? Open a map home page, enter the name of a certain city bus, and capture packets to find the site and line information. As shown in the specific packet capture information, busline_list details the information of the site and line, two of which are data in different directions of the same bus, slightly different, please note. After finding the portal, crawlers will show up.
The main crawling code is as follows, which is also very simple. The main function is as follows. First, we need to construct input parameters, including route name, city code, geographic range, and scaling scale. The geographic range can be obtained through the coordinate picker. After the parameters are url encoded, a request is sent to determine whether the returned data meets the requirements (Note: The line map may be out of service or does not exist, it may also be that the access speed is too fast. The anti-crawler mechanism requires manual verification. The crawler encountered it during crawling, so the random sleep was set later ). Next, parse json data. In the code, extratStations and extractLine are used to extract the required fields. The website and route are stored separately.
Def main (): df = pd. read_excel (" .xlsx",) BaseUrl = "https://ditu.amap.com/service/poiInfo? Query_type = TQUERY & pagesize = 20 & pagenum = 1 & qii = true & cluster_state = 5 & need_utd = true & utd_sceneid = 1000 & div = PC1000 & Signature = true & is_classify = true & "for bus in df [u" line "]: params = {'keyword': '11', 'zoom ': '11', 'city': '000000', 'geoobj': '2017. 623 | 33.696 | 109.817 | 34.745 '} print (bus) paramMerge = urllib. parse. urlencode (params) # print (paramMerge) targetUrl = BaseUrl + paramMerge stationFile = ". /busStation/"+ bus + ". csv "lineFile = ". /busLine/"+ bus + ". csv "req = urllib. request. request (targetUrl) res = urllib. request. urlopen (req) content = res. read () jsonData = json. loads (content) if (jsonData ["data"] ["message"]) and jsonData ["data"] ["busline_list"]: busList = jsonData ["data"] ["busline_list"] # busline list busListSlt = busList [0] # busList contains two lines for the same bus to different directions, choose one of them to crawl busStations = extratStations (busListSlt) busLine = extractLine (busListSlt) writeStation (busStations, stationFile) writeLine (busLine, lineFile) sleep (random. random () * random. randint (0, 7) + random. randint () # Set random sleep else: continue
Attached to the analysis function of the blogger:
Def extratStations (busListSlt): busName = busListSlt ["name"] stationSet = [] stations = busListSlt ["stations"] for bs in stations: tmp = [] tmp. append (bs ["station_id"]) tmp. append (busName) tmp. append (bs ["name"]) cor = bs ["xy_coords"]. split (";") tmp. append (cor [0]) tmp. append (cor [1]) wgs84cor1 = gcj02towgs84 (float (cor [0]), float (cor [1]) tmp. append (wgs84cor1 [0]) tmp. append (wgs84cor1 [1]) stationSet. append (tmp) return stationSetdef extractLine (busListSlt): # busList contains two lines, remark name keyName = busListSlt ["key_name"] busName = busListSlt ["name"] fromName = busListSlt ["front_name"] toName = busListSlt ["terminal_name"] lineSet = [] Xstr = busListSlt ["xs"] Ystr = busListSlt ["ys"] Xset = Xstr. split (",") Yset = Ystr. split (",") length = len (Xset) for I in range (length): tmp = [] tmp. append (keyName) tmp. append (busName) tmp. append (fromName) tmp. append (toName) tmp. append (Xset [I]) tmp. append (Yset [I]) wgs84cor2 = gcj02towgs84 (float (Xset [I]), float (Yset [I]) tmp. append (wgs84cor2 [0]) tmp. append (wgs84cor2 [1]) lineSet. append (tmp) return lineSet
Crawlers collect the following raw data:
The following figure shows the processed data of a bus stop and line. Because different map vendors use different coordinate systems, different degrees of deviation may occur, and coordinate correction is required. Next, the blogger will explain in detail how to correct and normalize the coordinates of these sites and coordinates in batches.