The first two articles analyzed why I captured the Mobike interface and the data analysis results. This article describes how to directly provide runable source code for learning. Disclaimer: This crawler is only used for learning and research purposes. please do not use it for illegal purposes. You are solely responsible for any legal disputes arising therefrom. If you have no patience to read the article, please go directly: Please don't forget to give a star and reward! Directory Structure \ analysisjupyter for data analysis \ influximporter import to influxdb, but not how to get it before... the first two articles analyzed why I captured the # code/11829.html "target =" _ blank "> interface and data analysis results of Mobike, this article describes how to directly provide runable source code for learning.
Statement:
This crawler is only used for learning and research purposes. please do not use it for illegal purposes. You are solely responsible for any legal disputes arising therefrom.
If you have no patience to read the article, go directly:
git clone https://github.com/derekhe/mobike-crawlerpython3 crawler.py
Don't forget to give a star and!
Directory structure
\ Analysis-jupyter for data analysis
\ Influx-importer-import to influxdb.
\ Modules-agent module
\ Web-real-time graphical display module, which was only used to learn react. for the effect, see here.
Crawler. py-core crawler code
ImportToDb. py-import to S database for analysis
SQL. SQL-SQL statement used to create a table
Start. sh-continuous running script
Ideas
The core code is stored in the crawler. py. The data is first stored in the sqlite3 database, and then duplicated and exported to the csv file to save space.
The Mobike API returns a bicycle in a square area. I only need to move one area to capture data in the whole area.
Left, top, right, and bottom define the capture range. Currently, they are square areas within the expressway of Chengdu city and between South and South Lake. Offset defines the capture interval. Currently, based on 0.002, the interval can be captured within 15 minutes on DigitalOcean's $ server.
def start(self): left = 30.7828453209 top = 103.9213455517 right = 30.4781772402 bottom = 104.2178123382 offset = 0.002 if os.path.isfile(self.db_name): os.remove(self.db_name) try: with sqlite3.connect(self.db_name) as c: c.execute('''CREATE TABLE mobike (Time DATETIME, bikeIds VARCHAR(12), bikeType TINYINT,distId INTEGER,distNum TINYINT, type TINYINT, x DOUBLE, y DOUBLE)''') except Exception as ex: pass
Then we started 250 threads. As for why I didn't use a coroutine, why ~~ I didn't learn it at the time ~~~ It is actually possible, maybe it is more efficient.
Since the data needs to be de-duplicated after capturing, in order to eliminate duplicates between small square areas, the last group_data is doing this.
executor = ThreadPoolExecutor(max_workers=250) print("Start") self.total = 0 lat_range = np.arange(left, right, -offset) for lat in lat_range: lon_range = np.arange(top, bottom, offset) for lon in lon_range: self.total += 1 executor.submit(self.get_nearby_bikes, (lat, lon)) executor.shutdown() self.group_data()
The core API code is here. The small program's API interface can be used to create a few variables, which is very simple.
def get_nearby_bikes(self, args): try: url = "https://mwx.mobike.com/mobike-api/rent/nearbyBikesInfo.do" payload = "latitude=%s&longitude=%s&errMsg=getMapCenterLocation" % (args[0], args[1]) headers = { 'charset': "utf-8", 'platform': "4", "referer":"https://servicewechat.com/wx40f112341ae33edb/1/", 'content-type': "application/x-www-form-urlencoded", 'user-agent': "MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN", 'host': "mwx.mobike.com", 'connection': "Keep-Alive", 'accept-encoding': "gzip", 'cache-control': "no-cache" } self.request(headers, payload, args, url) except Exception as ex: print(ex)
Finally, you may want to ask if the captured IP address is not blocked frequently? In fact, Mobike has IP access speed restrictions, but the method of cracking is very simple, that is, using a large number of proxies.
I have a proxy pool with more than 8000 proxies every day. Obtain the proxy pool directly in ProxyProvider and provide a pick function to randomly select the top 50 proxies. Please note that my proxy pool is updated hourly, but the jsonblob proxy list provided in the code is just an example. after a while, most of them should be voided.
Here we use a proxy scoring mechanism. Instead of randomly selecting a proxy, I sort the proxy by score. Each successful request will receive additional points, and the missing points will be deducted. In this way, agents with the best speed and quality can be selected in a short time. If necessary, save it and continue using it.
class ProxyProvider: def init(self, min_proxies=200): self._bad_proxies = {} self._minProxies = min_proxies self.lock = threading.RLock() self.get_list() def get_list(self): logger.debug("Getting proxy list") r = requests.get("https://jsonblob.com/31bf2dc8-00e6-11e7-a0ba-e39b7fdbe78b", timeout=10) proxies = ujson.decode(r.text) logger.debug("Got %s proxies", len(proxies)) self._proxies = list(map(lambda p: Proxy(p), proxies)) def pick(self): with self.lock: self._proxies.sort(key = lambda p: p.score, reverse=True) proxy_len = len(self._proxies) max_range = 50 if proxy_len > 50 else proxy_len proxy = self._proxies[random.randrange(1, max_range)] proxy.used() return proxy
In actual use, use proxyProvider. pick () to select a proxy and then use it. If the agent has any problems, use proxy. fatal_error () to lower the score so that the agent will not be selected in the future.
def request(self, headers, payload, args, url): while True: proxy = self.proxyProvider.pick() try: response = requests.request( "POST", url, data=payload, headers=headers, proxies={"https": proxy.url}, timeout=5,verify=False ) with self.lock: with sqlite3.connect(self.db_name) as c: try: print(response.text) decoded = ujson.decode(response.text)['object'] self.done += 1 for x in decoded: c.execute("INSERT INTO mobike VALUES (%d,'%s',%d,%d,%s,%s,%f,%f)" % ( int(time.time()) * 1000, x['bikeIds'], int(x['biketype']), int(x['distId']), x['distNum'], x['type'], x['distX'], x['distY'])) timespend = datetime.datetime.now() - self.start_time percent = self.done / self.total total = timespend / percent print(args, self.done, percent * 100, self.done / timespend.total_seconds() * 60, total, total - timespend) except Exception as ex: print(ex) break except Exception as ex: proxy.fatal_error()
Okay, it's basically here ~~~ Study other codes by yourself ~~~
The above is the detailed content of the Mobike crawler source code parsing. For more information, see other related articles in the first PHP community!