Mobike crawler source code analysis

Last Update:2017-05-14 Source: Internet

Author: User

Tags influxdb

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first two articles analyzed why I captured the Mobike interface and the data analysis results. This article describes how to directly provide runable source code for learning. Disclaimer: This crawler is only used for learning and research purposes. please do not use it for illegal purposes. You are solely responsible for any legal disputes arising therefrom. If you have no patience to read the article, please go directly: Please don't forget to give a star and reward! Directory Structure \ analysisjupyter for data analysis \ influximporter import to influxdb, but not how to get it before... the first two articles analyzed why I captured the # code/11829.html "target =" _ blank "> interface and data analysis results of Mobike, this article describes how to directly provide runable source code for learning.

Statement:
This crawler is only used for learning and research purposes. please do not use it for illegal purposes. You are solely responsible for any legal disputes arising therefrom.

If you have no patience to read the article, go directly:

git clone https://github.com/derekhe/mobike-crawlerpython3 crawler.py

Don't forget to give a star and!

Directory structure

\ Analysis-jupyter for data analysis
\ Influx-importer-import to influxdb.
\ Modules-agent module
\ Web-real-time graphical display module, which was only used to learn react. for the effect, see here.
Crawler. py-core crawler code
ImportToDb. py-import to S database for analysis
SQL. SQL-SQL statement used to create a table
Start. sh-continuous running script

Ideas

The core code is stored in the crawler. py. The data is first stored in the sqlite3 database, and then duplicated and exported to the csv file to save space.

The Mobike API returns a bicycle in a square area. I only need to move one area to capture data in the whole area.

Left, top, right, and bottom define the capture range. Currently, they are square areas within the expressway of Chengdu city and between South and South Lake. Offset defines the capture interval. Currently, based on 0.002, the interval can be captured within 15 minutes on DigitalOcean's $ server.

    def start(self):        left = 30.7828453209        top = 103.9213455517        right = 30.4781772402        bottom = 104.2178123382        offset = 0.002        if os.path.isfile(self.db_name):            os.remove(self.db_name)        try:            with sqlite3.connect(self.db_name) as c:                c.execute('''CREATE TABLE mobike                    (Time DATETIME, bikeIds VARCHAR(12), bikeType TINYINT,distId INTEGER,distNum TINYINT, type TINYINT, x DOUBLE, y DOUBLE)''')        except Exception as ex:            pass

Then we started 250 threads. As for why I didn't use a coroutine, why ~~ I didn't learn it at the time ~~~ It is actually possible, maybe it is more efficient.

Since the data needs to be de-duplicated after capturing, in order to eliminate duplicates between small square areas, the last group_data is doing this.

        executor = ThreadPoolExecutor(max_workers=250)        print("Start")        self.total = 0        lat_range = np.arange(left, right, -offset)        for lat in lat_range:            lon_range = np.arange(top, bottom, offset)            for lon in lon_range:                self.total += 1                executor.submit(self.get_nearby_bikes, (lat, lon))        executor.shutdown()        self.group_data()

The core API code is here. The small program's API interface can be used to create a few variables, which is very simple.

    def get_nearby_bikes(self, args):        try:            url = "https://mwx.mobike.com/mobike-api/rent/nearbyBikesInfo.do"            payload = "latitude=%s&longitude=%s&errMsg=getMapCenterLocation" % (args[0], args[1])            headers = {                'charset': "utf-8",                'platform': "4",                "referer":"https://servicewechat.com/wx40f112341ae33edb/1/",                'content-type': "application/x-www-form-urlencoded",                'user-agent': "MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN",                'host': "mwx.mobike.com",                'connection': "Keep-Alive",                'accept-encoding': "gzip",                'cache-control': "no-cache"            }            self.request(headers, payload, args, url)        except Exception as ex:            print(ex)

Finally, you may want to ask if the captured IP address is not blocked frequently? In fact, Mobike has IP access speed restrictions, but the method of cracking is very simple, that is, using a large number of proxies.

I have a proxy pool with more than 8000 proxies every day. Obtain the proxy pool directly in ProxyProvider and provide a pick function to randomly select the top 50 proxies. Please note that my proxy pool is updated hourly, but the jsonblob proxy list provided in the code is just an example. after a while, most of them should be voided.

Here we use a proxy scoring mechanism. Instead of randomly selecting a proxy, I sort the proxy by score. Each successful request will receive additional points, and the missing points will be deducted. In this way, agents with the best speed and quality can be selected in a short time. If necessary, save it and continue using it.

class ProxyProvider:    def init(self, min_proxies=200):        self._bad_proxies = {}        self._minProxies = min_proxies        self.lock = threading.RLock()        self.get_list()    def get_list(self):        logger.debug("Getting proxy list")        r = requests.get("https://jsonblob.com/31bf2dc8-00e6-11e7-a0ba-e39b7fdbe78b", timeout=10)        proxies = ujson.decode(r.text)        logger.debug("Got %s proxies", len(proxies))        self._proxies = list(map(lambda p: Proxy(p), proxies))    def pick(self):        with self.lock:            self._proxies.sort(key = lambda p: p.score, reverse=True)            proxy_len = len(self._proxies)            max_range = 50 if proxy_len > 50 else proxy_len            proxy = self._proxies[random.randrange(1, max_range)]            proxy.used()            return proxy

In actual use, use proxyProvider. pick () to select a proxy and then use it. If the agent has any problems, use proxy. fatal_error () to lower the score so that the agent will not be selected in the future.

    def request(self, headers, payload, args, url):        while True:            proxy = self.proxyProvider.pick()            try:                response = requests.request(                    "POST", url, data=payload, headers=headers,                    proxies={"https": proxy.url},                    timeout=5,verify=False                )                with self.lock:                    with sqlite3.connect(self.db_name) as c:                        try:                            print(response.text)                            decoded = ujson.decode(response.text)['object']                            self.done += 1                            for x in decoded:                                c.execute("INSERT INTO mobike VALUES (%d,'%s',%d,%d,%s,%s,%f,%f)" % (                                    int(time.time()) * 1000, x['bikeIds'], int(x['biketype']), int(x['distId']),                                    x['distNum'], x['type'], x['distX'],                                    x['distY']))                            timespend = datetime.datetime.now() - self.start_time                            percent = self.done / self.total                            total = timespend / percent                            print(args, self.done, percent * 100, self.done / timespend.total_seconds() * 60, total,                                  total - timespend)                        except Exception as ex:                            print(ex)                    break            except Exception as ex:                proxy.fatal_error()

Okay, it's basically here ~~~ Study other codes by yourself ~~~

The above is the detailed content of the Mobike crawler source code parsing. For more information, see other related articles in the first PHP community!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More