The first two articles analyzed why I grabbed the http://www.php.cn/code/11829.html "target=" _blank "> interface and the results of data analysis from the motorcycle, which is provided as a direct source for learning.
Statement:
This crawler is only used for learning, research purposes, please do not use for illegal purposes. Any legal disputes arising therefrom shall be the sole responsibility of the
No patience to read the article after the Please direct:
git clone Https://github.com/derekhe/mobike-crawlerpython3 crawler.py
Please don't forget to give a star and!
Directory structure
\analysis-jupyter Doing data analysis
\influx-importer-Import to Influxdb, but not exactly
\modules-Agent Module
\web-Real-time graphical display module, just to learn about react, the effect please see here
crawler.py-Crawler Core code
importtodb.py-Import into the Postgres database for analysis
Sql.sql-SQL to create tables
start.sh-scripts that run continuously
Ideas
The core code is placed in the crawler.py, the data is first stored in the Sqlite3 database, and then exported to a CSV file after duplication to save space.
The bike's API returns a bicycle in a square area, so I can crawl the entire area of data by moving around a piece of area.
Left,top,right,bottom defines the scope of the crawl, which is now the square area of the high-speed city and south to Nanhu Lake in Chengdu. Offset defines the crawl interval, which is now based on 0.002 and can be crawled within 15 minutes on the Digitalocean 5$ server.
def start: Left = 30.7828453209 top = 103.9213455517 Right = 30.4781772402 bottom = 104.2178123382 offset = 0.002 if Os.path.isfile (self.db_name): os.remove (self.db_name) try: With Sqlite3.connect (Self.db_name) as C: c.execute ("" CREATE TABLE mobike (Time DATETIME, Bikeids VARCHAR (12 ), Biketype tinyint,distid integer,distnum TINYINT, type TINYINT, x double, y Double) "") except Exception as Ex:
pass
Then started the 250 threads, as you want to ask why I did not use the process, hem ~ ~ I did not learn ~ ~ ~ is actually possible, perhaps more efficient.
The final group_data is doing this because the data must be de-weighed after crawling in order to eliminate the duplication between the small square areas.
Executor = Threadpoolexecutor (max_workers=250) print ("Start") self.total = 0 lat_range = Np.arange (left, Right,-offset) for lat in lat_range: lon_range = Np.arange (top, bottom, offset) for Lon in Lon_range: SE Lf.total + = 1 executor.submit (Self.get_nearby_bikes, (lat, lon)) Executor.shutdown () Self.group_data ( )
The most core API code is here. Small Program API interface, make a few variables on it, very simple.
def get_nearby_bikes (self, args): try: url = "Https://mwx.mobike.com/mobike-api/rent/nearbyBikesInfo.do" payload = "latitude=%s&longitude=%s&errmsg=getmapcenterlocation"% (Args[0], args[1]) headers = { ' charset ': "Utf-8", ' platform ': "4", "Referer": "https://servicewechat.com/wx40f112341ae33edb/1/", ' Content-type ': "application/x-www-form-urlencoded", ' user-agent ': "micromessenger/6.5.4.1000 NetType/ WIFI language/zh_cn ", ' host ':" Mwx.mobike.com ", ' connection ':" Keep-alive ", ' accept-encoding ':" Gzip ", ' Cache-control ':" No-cache " } Self.request (headers, payload, args, url) except Exception As ex: print (ex)
Finally you may have to ask the frequent crawl IP is not blocked? In fact, the motorcycle is the IP access speed limit, but the way to crack is very simple, is to use a large number of agents.
I have a proxy pool that basically has more than 8000 proxies per day. Get to this proxy pool directly in Proxyprovider and then provide a pick function for randomly selecting the agent for the top 50 of the score. Note that my agent pool is updated hourly, but the list of Jsonblob proxies provided in the code is just a sample and should be largely obsolete after a period of time.
Here is a mechanism to use a proxy score. Instead of randomly selecting an agent, I sort the agent by the score. Each successful request will be added, and the requested error will be reduced. This allows you to select the best agent for speed and quality in a moment. You can save it for the next time if you need to.
Class Proxyprovider: def init (self, min_proxies=200): self._bad_proxies = {} self._minproxies = Min_ Proxies Self.lock = threading. Rlock () self.get_list () def get_list (self): logger.debug ("Getting proxy list") r = Requests.get (" https://jsonblob.com/31bf2dc8-00e6-11e7-a0ba-e39b7fdbe78b ", timeout=10) proxies = Ujson.decode (r.text) Logger.debug ("Got%s proxies", Len (proxies)) self._proxies = list (map (lambda P:proxy (p), proxies)) def pick ( Self): With self.lock: self._proxies.sort (key = Lambda P:p.score, reverse=true) Proxy_len = Len (self._ proxies) Max_range = if proxy_len > Else proxy_len proxy = Self._proxies[random.randrange (1, Max_range )] proxy.used () return proxy
In actual use, select the agent through Proxyprovider.pick () and then use. If there is any problem with the agent, lower the score directly with Proxy.fatal_error () so that the agent is not selected for follow-up.
def request (self, headers, payload, args, URL): While true:proxy = Self.proxyProvider.pick () Try:response = Requests.request ("POST", url, Data=payload, headers=headers, proxies={"https": Proxy.url}, Timeout=5,verify=false) with Self.lock:with Sqlite3.connect (Self.db_name) as C:try: Print (response.text) decoded = Ujson.decode (Response.text) [' object '] Self.done + = 1 for x in Decoded:c.execute ("INSERT into mo Bike VALUES (%d, '%s ',%d,%d,%s,%s,%f,%f) "% (int (time.time ()) *, x[' bikeids '], int (x[' Biketype '), int (x[' Distid '), x[' Distnum '], x[' type '], x[' distx '], x[' dIsty ']) Timespend = Datetime.datetime.now ()-Self.start_time Perce NT = self.done/self.total total = timespend/percent print (args, Self.done, percent *, self.done/timespend.total_seconds () *, Total, total-time Spend) except Exception as Ex:print (ex) break Except Exception as Ex:proxy.fatal_error ()
OK, basically this is it ~ ~ ~ other code to study it yourself ~ ~ ~