Motorcycle Crawler Source Analysis

Source: Internet
Author: User
Tags postgres database
The first two articles analyzed why I grabbed the http://www.php.cn/code/11829.html "target=" _blank "> interface and the results of data analysis from the motorcycle, which is provided as a direct source for learning.

Statement:
This crawler is only used for learning, research purposes, please do not use for illegal purposes. Any legal disputes arising therefrom shall be the sole responsibility of the

No patience to read the article after the Please direct:

git clone Https://github.com/derekhe/mobike-crawlerpython3 crawler.py

Please don't forget to give a star and!

Directory structure

    • \analysis-jupyter Doing data analysis

    • \influx-importer-Import to Influxdb, but not exactly

    • \modules-Agent Module

    • \web-Real-time graphical display module, just to learn about react, the effect please see here

    • crawler.py-Crawler Core code

    • importtodb.py-Import into the Postgres database for analysis

    • Sql.sql-SQL to create tables

    • start.sh-scripts that run continuously

Ideas

The core code is placed in the crawler.py, the data is first stored in the Sqlite3 database, and then exported to a CSV file after duplication to save space.

The bike's API returns a bicycle in a square area, so I can crawl the entire area of data by moving around a piece of area.

Left,top,right,bottom defines the scope of the crawl, which is now the square area of the high-speed city and south to Nanhu Lake in Chengdu. Offset defines the crawl interval, which is now based on 0.002 and can be crawled within 15 minutes on the Digitalocean 5$ server.

    def start: Left        = 30.7828453209        top = 103.9213455517 Right        = 30.4781772402        bottom = 104.2178123382        offset = 0.002        if Os.path.isfile (self.db_name):            os.remove (self.db_name)        try:            With Sqlite3.connect (Self.db_name) as C:                c.execute ("" CREATE TABLE mobike                    (Time DATETIME, Bikeids VARCHAR (12 ), Biketype tinyint,distid integer,distnum TINYINT, type TINYINT, x double, y Double) "")        except Exception as Ex:
  pass

Then started the 250 threads, as you want to ask why I did not use the process, hem ~ ~ I did not learn ~ ~ ~ is actually possible, perhaps more efficient.

The final group_data is doing this because the data must be de-weighed after crawling in order to eliminate the duplication between the small square areas.

        Executor = Threadpoolexecutor (max_workers=250)        print ("Start")        self.total = 0        lat_range = Np.arange (left, Right,-offset) for        lat in lat_range:            lon_range = Np.arange (top, bottom, offset) for            Lon in Lon_range:                SE Lf.total + = 1                executor.submit (Self.get_nearby_bikes, (lat, lon))        Executor.shutdown ()        Self.group_data ( )

The most core API code is here. Small Program API interface, make a few variables on it, very simple.

    def get_nearby_bikes (self, args):        try:            url = "Https://mwx.mobike.com/mobike-api/rent/nearbyBikesInfo.do"            payload = "latitude=%s&longitude=%s&errmsg=getmapcenterlocation"% (Args[0], args[1])            headers = {                ' charset ': "Utf-8",                ' platform ': "4",                "Referer": "https://servicewechat.com/wx40f112341ae33edb/1/",                ' Content-type ': "application/x-www-form-urlencoded",                ' user-agent ': "micromessenger/6.5.4.1000 NetType/ WIFI language/zh_cn ",                ' host ':" Mwx.mobike.com ",                ' connection ':" Keep-alive ",                ' accept-encoding ':" Gzip ",                ' Cache-control ':" No-cache "            }            Self.request (headers, payload, args, url)        except Exception As ex:            print (ex)

Finally you may have to ask the frequent crawl IP is not blocked? In fact, the motorcycle is the IP access speed limit, but the way to crack is very simple, is to use a large number of agents.

I have a proxy pool that basically has more than 8000 proxies per day. Get to this proxy pool directly in Proxyprovider and then provide a pick function for randomly selecting the agent for the top 50 of the score. Note that my agent pool is updated hourly, but the list of Jsonblob proxies provided in the code is just a sample and should be largely obsolete after a period of time.

Here is a mechanism to use a proxy score. Instead of randomly selecting an agent, I sort the agent by the score. Each successful request will be added, and the requested error will be reduced. This allows you to select the best agent for speed and quality in a moment. You can save it for the next time if you need to.

Class Proxyprovider:    def init (self, min_proxies=200):        self._bad_proxies = {}        self._minproxies = Min_ Proxies        Self.lock = threading. Rlock ()        self.get_list ()    def get_list (self):        logger.debug ("Getting proxy list")        r = Requests.get (" https://jsonblob.com/31bf2dc8-00e6-11e7-a0ba-e39b7fdbe78b ", timeout=10)        proxies = Ujson.decode (r.text)        Logger.debug ("Got%s proxies", Len (proxies))        self._proxies = list (map (lambda P:proxy (p), proxies))    def pick ( Self): With        self.lock:            self._proxies.sort (key = Lambda P:p.score, reverse=true)            Proxy_len = Len (self._ proxies)            Max_range = if proxy_len > Else proxy_len            proxy = Self._proxies[random.randrange (1, Max_range )]            proxy.used ()            return proxy

In actual use, select the agent through Proxyprovider.pick () and then use. If there is any problem with the agent, lower the score directly with Proxy.fatal_error () so that the agent is not selected for follow-up.

    def request (self, headers, payload, args, URL): While true:proxy = Self.proxyProvider.pick ()                    Try:response = Requests.request ("POST", url, Data=payload, headers=headers,  proxies={"https": Proxy.url}, Timeout=5,verify=false) with                            Self.lock:with Sqlite3.connect (Self.db_name) as C:try:                            Print (response.text) decoded = Ujson.decode (Response.text) [' object '] Self.done + = 1 for x in Decoded:c.execute ("INSERT into mo Bike VALUES (%d, '%s ',%d,%d,%s,%s,%f,%f) "% (int (time.time ()) *, x[' bikeids '], int                                    (x[' Biketype '), int (x[' Distid '), x[' Distnum '], x[' type '], x[' distx '], x[' dIsty ']) Timespend = Datetime.datetime.now ()-Self.start_time Perce NT = self.done/self.total total = timespend/percent print (args, Self.done, percent *, self.done/timespend.total_seconds () *, Total, total-time            Spend) except Exception as Ex:print (ex) break Except Exception as Ex:proxy.fatal_error ()

OK, basically this is it ~ ~ ~ other code to study it yourself ~ ~ ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.