Motorcycle Crawler Source Analysis

Last Update:2017-04-04 Source: Internet

Author: User

Tags postgres database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first two articles analyzed why I grabbed the http://www.php.cn/code/11829.html "target=" _blank "> interface and the results of data analysis from the motorcycle, which is provided as a direct source for learning.

Statement:
This crawler is only used for learning, research purposes, please do not use for illegal purposes. Any legal disputes arising therefrom shall be the sole responsibility of the

No patience to read the article after the Please direct:

git clone Https://github.com/derekhe/mobike-crawlerpython3 crawler.py

Please don't forget to give a star and!

Directory structure

\analysis-jupyter Doing data analysis
\influx-importer-Import to Influxdb, but not exactly
\modules-Agent Module
\web-Real-time graphical display module, just to learn about react, the effect please see here
crawler.py-Crawler Core code
importtodb.py-Import into the Postgres database for analysis
Sql.sql-SQL to create tables
start.sh-scripts that run continuously

Ideas

The core code is placed in the crawler.py, the data is first stored in the Sqlite3 database, and then exported to a CSV file after duplication to save space.

The bike's API returns a bicycle in a square area, so I can crawl the entire area of data by moving around a piece of area.

Left,top,right,bottom defines the scope of the crawl, which is now the square area of the high-speed city and south to Nanhu Lake in Chengdu. Offset defines the crawl interval, which is now based on 0.002 and can be crawled within 15 minutes on the Digitalocean 5$ server.

    def start: Left        = 30.7828453209        top = 103.9213455517 Right        = 30.4781772402        bottom = 104.2178123382        offset = 0.002        if Os.path.isfile (self.db_name):            os.remove (self.db_name)        try:            With Sqlite3.connect (Self.db_name) as C:                c.execute ("" CREATE TABLE mobike                    (Time DATETIME, Bikeids VARCHAR (12 ), Biketype tinyint,distid integer,distnum TINYINT, type TINYINT, x double, y Double) "")        except Exception as Ex:
  pass

Then started the 250 threads, as you want to ask why I did not use the process, hem ~ ~ I did not learn ~ ~ ~ is actually possible, perhaps more efficient.

The final group_data is doing this because the data must be de-weighed after crawling in order to eliminate the duplication between the small square areas.

        Executor = Threadpoolexecutor (max_workers=250)        print ("Start")        self.total = 0        lat_range = Np.arange (left, Right,-offset) for        lat in lat_range:            lon_range = Np.arange (top, bottom, offset) for            Lon in Lon_range:                SE Lf.total + = 1                executor.submit (Self.get_nearby_bikes, (lat, lon))        Executor.shutdown ()        Self.group_data ( )

The most core API code is here. Small Program API interface, make a few variables on it, very simple.

    def get_nearby_bikes (self, args):        try:            url = "Https://mwx.mobike.com/mobike-api/rent/nearbyBikesInfo.do"            payload = "latitude=%s&longitude=%s&errmsg=getmapcenterlocation"% (Args[0], args[1])            headers = {                ' charset ': "Utf-8",                ' platform ': "4",                "Referer": "https://servicewechat.com/wx40f112341ae33edb/1/",                ' Content-type ': "application/x-www-form-urlencoded",                ' user-agent ': "micromessenger/6.5.4.1000 NetType/ WIFI language/zh_cn ",                ' host ':" Mwx.mobike.com ",                ' connection ':" Keep-alive ",                ' accept-encoding ':" Gzip ",                ' Cache-control ':" No-cache "            }            Self.request (headers, payload, args, url)        except Exception As ex:            print (ex)

Finally you may have to ask the frequent crawl IP is not blocked? In fact, the motorcycle is the IP access speed limit, but the way to crack is very simple, is to use a large number of agents.

I have a proxy pool that basically has more than 8000 proxies per day. Get to this proxy pool directly in Proxyprovider and then provide a pick function for randomly selecting the agent for the top 50 of the score. Note that my agent pool is updated hourly, but the list of Jsonblob proxies provided in the code is just a sample and should be largely obsolete after a period of time.

Here is a mechanism to use a proxy score. Instead of randomly selecting an agent, I sort the agent by the score. Each successful request will be added, and the requested error will be reduced. This allows you to select the best agent for speed and quality in a moment. You can save it for the next time if you need to.

Class Proxyprovider:    def init (self, min_proxies=200):        self._bad_proxies = {}        self._minproxies = Min_ Proxies        Self.lock = threading. Rlock ()        self.get_list ()    def get_list (self):        logger.debug ("Getting proxy list")        r = Requests.get (" https://jsonblob.com/31bf2dc8-00e6-11e7-a0ba-e39b7fdbe78b ", timeout=10)        proxies = Ujson.decode (r.text)        Logger.debug ("Got%s proxies", Len (proxies))        self._proxies = list (map (lambda P:proxy (p), proxies))    def pick ( Self): With        self.lock:            self._proxies.sort (key = Lambda P:p.score, reverse=true)            Proxy_len = Len (self._ proxies)            Max_range = if proxy_len > Else proxy_len            proxy = Self._proxies[random.randrange (1, Max_range )]            proxy.used ()            return proxy

In actual use, select the agent through Proxyprovider.pick () and then use. If there is any problem with the agent, lower the score directly with Proxy.fatal_error () so that the agent is not selected for follow-up.

    def request (self, headers, payload, args, URL): While true:proxy = Self.proxyProvider.pick ()                    Try:response = Requests.request ("POST", url, Data=payload, headers=headers,  proxies={"https": Proxy.url}, Timeout=5,verify=false) with                            Self.lock:with Sqlite3.connect (Self.db_name) as C:try:                            Print (response.text) decoded = Ujson.decode (Response.text) [' object '] Self.done + = 1 for x in Decoded:c.execute ("INSERT into mo Bike VALUES (%d, '%s ',%d,%d,%s,%s,%f,%f) "% (int (time.time ()) *, x[' bikeids '], int                                    (x[' Biketype '), int (x[' Distid '), x[' Distnum '], x[' type '], x[' distx '], x[' dIsty ']) Timespend = Datetime.datetime.now ()-Self.start_time Perce NT = self.done/self.total total = timespend/percent print (args, Self.done, percent *, self.done/timespend.total_seconds () *, Total, total-time            Spend) except Exception as Ex:print (ex) break Except Exception as Ex:proxy.fatal_error ()

OK, basically this is it ~ ~ ~ other code to study it yourself ~ ~ ~

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More