Speed up your crawler?

Last Update:2018-10-17 Source: Internet

Author: User

Tags societe generale

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Project Introduction

?? This article will show how to use asynchronous modules in Pyhton to improve the efficiency of crawlers.
?? The goal we need to crawl is: Finance product information on the website (HTTPS://WWW.RONG360.COM/LICAI-BANK/LIST/P1), the page is as follows:

We need to crawl 86,394 financial product information, each page 10, that is, 8,640 pages.
?? In the article Python crawler (16) using Scrapy to crawl banking product information (12多万条), we used the crawler framework scrapy to implement the crawler, climbed 127,130 data, and deposited in MongoDB, the whole process took 3 hours. According to reason, the use of scrapy to achieve the crawler is a better choice, but in terms of speed, whether it can be improved? This article will demonstrate how to use asynchronous modules (AIOHTPP and Asyncio) in Pyhton to improve the efficiency of crawlers.

Crawler projects

?? Our crawlers go in two steps:

Crawl and Merge financial product information on pages and deposit it into CSV file;
Read the CSV file and deposit to the MySQL database.

?? First, we crawled through the pages of the financial product information and stored in a CSV file, we use Aiohttp and Asyncio to speed up the crawler, complete Python code is as follows:

Import reimport timeimport aiohttpimport Asyncioimport pandas as Pdimport logging# set log format logging.basicconfig (level = Logg Ing.info, format= '% (asctime) s-% (levelname) s:% (message) s ') logger = Logging.getlogger (__name__) df = PD. DataFrame (columns=[' name ', ' Bank ', ' currency ', ' startdate ', ' endDate ', ' period ', ' protype ', ' pro Fit ', ' Amount ']) # Asynchronous HTTP request Async DEF FETCH (SEM, session, URL): Async with sem:headers = {' user-agent ': ' mozilla/ 5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.87 safari/537.36 '} async with Session.get (URL, Heade Rs=headers) as Response:return await Response.text () # Parse Web page Async def parser (HTML): # Parse Web page with regular expression tbody = Re.findall (R "<tbody>[\s\S]*?</tbody>", HTML) [0] trs = Re.findall (r "<tr [\s\s]*?</tr>", tbody) F or tr in Trs:tds = Re.findall (r "<td[\s\S]*?</td>", tr) Name,bank = Re.findall (R ' title= "(. +?)" ', " . Join (TDS)) NAme = Name.replace (' &amp; ', '). replace (' quot; ', ') currency, StartDate, endDate, amount = Re.findall (R ' <td& gt; (. +?) </td> ', '. Join (TDS) period = '. Join (Re.findall (R ' <td class= "Td7" > (. +?) </td> ', tds[5]) Protype = ". Join (Re.findall (R ' <td class=" Td7 "> (. +?) </td> ', tds[6]) Profit = ". Join (Re.findall (R ' <td class=" Td8 "> (. +?)                                    </td> ', tds[7])) df.loc[df.shape[0] + 1] = [Name, bank, currency, StartDate, EndDate,     Period, Protype, profit, amount] logger.info (str (df.shape[0]) + ' \ t ' +name) # Processing Web page async def download (sem, URL): Async with Aiohttp.        Clientsession () as session:try:html = await fetch (SEM, session, URL) await parser (HTML) Except Exception as Err:print (err) # all pages URLs = ["https://www.rong360.com/licai-bank/list/p%d"%i for I in R Ange (1, 8641)]# statistics The time consumed by the crawler print (' * ' *) t3 = Time.time () # Asynchronous IO processing with Asyncio module loop = Asyncio. Get_event_loop () Sem=asyncio. Semaphore tasks = [asyncio.ensure_future (Download (SEM, url)) for URLs in urls]tasks = Asyncio.gather (*tasks) Loop.run _until_complete (Tasks) df.to_csv (' e://rong.csv ') T4 = Time.time () print (' Total time:%s '% (T4-T3)) print (' * ' * 50)

The result of the output is as follows (the middle output has been omitted to ... instead):

**************************************************2018-10-17 13:33:50,717 - INFO: 10  金百合第245期2018-10-17 13:33:50,749 - INFO: 20  金荷恒升2018年第26期......2018-10-17 14:03:34,906 - INFO: 86381   翠竹同益1M22期FGAB15015A2018-10-17 14:03:35,257 - INFO: 86391   润鑫月月盈2号总共耗时：1787.4312353134155**************************************************

As you can see, in this reptile, we climbed 86,391 data, which took 1787.4 seconds and less than 30 minutes. Although the data is 3 less than expected, this loss is nothing. Take a look at the data in the CSV file:

?? OK, from our goal is one step further, save this CSV file to MySQL, the specific operation method can refer to the article: Python use Pandas Library to implement the MySQL database read and write: Https://www.jianshu.com/p/238a13995b2b. The complete Python code is as follows:

# -*- coding: utf-8 -*-# 导入必要模块import pandas as pdfrom sqlalchemy import create_engine# 初始化数据库连接，使用pymysql模块engine = create_engine(‘mysql+pymysql://root:******@localhost:33061/test‘, echo=True)print("Read CSV file...")# 读取本地CSV文件df = pd.read_csv("E://rong.csv", sep=‘,‘, encoding=‘gb18030‘)# 将新建的DataFrame储存为MySQL中的数据表，不储存index列df.to_sql(‘rong‘,          con=engine,          index= False,          index_label=‘name‘          )print("Write to MySQL successfully!")

The output is as follows (more than 10 seconds elapsed):

Read CSV File ... 2018-10-17 15:07:02,447 Info sqlalchemy.engine.base.Engine SHOW VARIABLES like ' Sql_mode ' 2018-10-17 15:07:02,447 INFO sqlalchemy.engine.base.Engine {}2018-10-17 15:07:02,452 INFO sqlalchemy.engine.base.Engine SELECT DATABASE () 2018-10-17 15:07:02,452 info sqlalchemy.engine.base.Engine {}2018-10-17 15:07:02,454 info Sqlalchemy.engine.base.Engine Show collation where ' Charset ' = ' utf8mb4 ' and ' collation ' = ' utf8mb4_bin ' 2018-10-17 15:07: 02,454 info Sqlalchemy.engine.base.Engine {}2018-10-17 15:07:02,455 info sqlalchemy.engine.base.Engine SELECT CAST (' Test plain returns ' as CHAR ' as ANON_12018-10-17 15:07:02,456 INFO sqlalchemy.engine.base.Engine {}2018-10-17 15:07:0  2,456 INFO sqlalchemy.engine.base.Engine SELECT CAST (' Test Unicode returns ' as CHAR ') as ANON_12018-10-17 15:07:02,456 Info sqlalchemy.engine.base.Engine {}2018-10-17 15:07:02,457 INFO sqlalchemy.engine.base.Engine SELECT CAST (' Test Collated returns ' as CHAR CHARACTER SET utf8mb4) COLLATE Utf8mb4_bin as anON_12018-10-17 15:07:02,457 info sqlalchemy.engine.base.Engine {}2018-10-17 15:07:02,458 info Sqlalchemy.engine.base.Engine DESCRIBE ' rong ' 2018-10-17 15:07:02,458 INFO sqlalchemy.engine.base.Engine {} 2018-10-17 15:07:02,459 Info sqlalchemy.engine.base.Engine rollback2018-10-17 15:07:02,462 Info Sqlalchemy.engine.base.Engine CREATE TABLE rong (' unnamed:0 ' BIGINT, name TEXT, bank TEXT, currency TE XT, ' startdate ' text, ' endDate ' text, enduration text, ' Protype ' text, profit text, amount text 20 18-10-17 15:07:02,462 info sqlalchemy.engine.base.Engine {}2018-10-17 15:07:02,867 info sqlalchemy.engine.base.Engine COMMIT2018-10-17 15:07:02,909 Info sqlalchemy.engine.base.Engine BEGIN (implicit) 2018-10-17 15:07:03,973 info Sqlalchemy.engine.base.Engine INSERT into rong (' unnamed:0 ', name, bank, currency, ' startdate ', ' endDate ', enduration , ' Protype ', profit, amount) VALUES (% (unnamed:0) s,% (name) s,% (bank) s,% (currency) s,% (startdate) s,% (enddATE) s,% (enduration) s,% (protype) s,% (profit) s,% (amount) s) 2018-10-17 15:07:03,974 INFO sqlalchemy.engine.base.Engine ({' unnamed:0 ': 1, ' name ': ' Long Letter 20183773 ', ' Bank ': ' Longjiang Bank ', ' currency ': ' RMB ', ' startdate ': ' 2018-10-12 ', ' endDate ': ' 2018-10 -14 ', ' enduration ': ' 99 Days ', ' protype ': ' Not protected ', ' profit ': ' 4.8% ', ' Amount ': ' 50,000 '}, {' unnamed:0 ': 2, ' name ': ' Fu Ying Jia NDHLCS201800 55B ', ' Bank ': ' Ningbo Donghai Bank ', ' currency ': ' RMB ', ' startdate ': ' 2018-10-12 ', ' endDate ': ' 2018-10-17 ', ' enduration ': ' 179 days ', ' pro Type ': ' Guaranteed proceeds ', ' Profit ': ' 4.8% ', ' Amount ': ' 50,000 '}, {' unnamed:0 ': 3, ' name ': ' Xin Yue 2018 ' 6th ', ' Bank ': ' Inactive farming firm ', ' currency ': ' People Min Yuan ', ' startdate ': ' 2018-10-12 ', ' endDate ': ' 2018-10-21 ', ' enduration ': ' 212 days ', ' protype ': ' Non-guaranteed ', ' profit ': ' 4.8% ', ' Amoun ' T ': ' 50,000 '}, {' unnamed:0 ': 4, ' name ': ' Anxin MTLC18165 ', ' Bank ': ' Min tai firm ', ' currency ': ' RMB ', ' startdate ': ' 2018-10-12 ', ' endDate ': ' 2018-10-15 ', ' enduration ': ' 49 days ', ' protype ': ' Not protected ', ' profit ': ' 4.75% ', ' Amount ': ' 50,000 '}, {' unnamed:0 ': 5, ' name ': ' The private Line • Ruyi adry181115a ', ' Bank ': ' Agricultural silverLine ', ' currency ': ' RMB ', ' startdate ': ' 2018-10-12 ', ' endDate ': ' 2018-10-16 ', ' enduration ': ' 90 days ', ' protype ': ' Non-guaranteed ', ' profit ': ' 4.75% ', ' Amount ': ' 1 million '}, {' unnamed:0 ': 6, ' name ': ' Steady Growth (2018) 176 ', ' Bank ': ' Weihai Commercial Bank ', ' currency ': ' RMB ', ' startdat ' E ': ' 2018-10-12 ', ' endDate ': ' 2018-10-15 ', ' enduration ': ' 91 days ', ' protype ': ' Not protected ', ' profit ': ' 4.75% ', ' Amount ': ' 50,000 '}, {' Un Named:0 ': 7, ' name ': ' Titi J18071 ', ' Bank ': ' Wenzhou bank ', ' currency ': ' Renminbi ', ' startdate ': ' 2018-10-12 ', ' endDate ': ' 2018-10-16 ', ' Enduration ': ' 96 days ', ' protype ': ' Not protected ', ' profit ': ' 4.75% ', ' Amount ': ' 10,000 '}, {' unnamed:0 ': 8, ' name ': ' Private bank client 84618042 ', ' Ba NK ': ' Societe Generale ', ' currency ': ' RMB ', ' startdate ': ' 2018-10-12 ', ' endDate ': ' 2018-10-17 ', ' enduration ': ' 99 Days ', ' protype ': ' Non-guaranteed '  , ' profit ': ' 4.75% ', ' Amount ': ' 500,000 '} ... displaying ten of 86391 total bound parameter sets ... {' unnamed:0 ': 86390, ' name ': ' Run Xin monthly surplus 3rd No. rx1m003 ', ' Bank ': ' Zhuhai Huarun Bank ', ' currency ': ' RMB ', ' startdate ': ' 2015-06-24 ', ' endDate ' : ' 2015-06-30 ', ' enduration ': ' 35 days ', ' protype ': ' Not insuredBen ', ' profit ': ' 4.5% ', ' Amount ': ' 50,000 '}, {' unnamed:0 ': 86391, ' name ': ' Yun Xin Yue Yue 2nd ', ' Bank ': ' Zhuhai Huarun Bank ', ' currency ': ' RMB ', ' star ' Tdate ': ' 2015-06-17 ', ' endDate ': ' 2015-06-23 ', ' enduration ': ' 35 days ', ' protype ': ' Non-guaranteed ', ' profit ': ' 4.4% ', ' Amount ': ' 50,000 '} 2 018-10-17 15:07:14,106 INFO sqlalchemy.engine.base.Engine commitwrite to MySQL successfully!

?? If you're not sure, maybe we can take a look at the data in MySQL:

Summarize

?? Let's compare the crawler with the crawler using Scrapy. Crawling 127,130 of data using scrap crawlers takes 3 hours, and the crawler crawls 86,391 data for half an hour. With the same amount of data, it takes about 2 hours for the scrapy to crawl 86,391 of data, and the crawler accomplishes the task with only one-fourth of the time of the Scrapy crawler.
?? Finally, let's take a look at the top 10 bank and Wealth management products (ranked from highest to lowest) and enter the following MySQL command:

use test;SELECT bank, count(*) as product_num FROM rongGROUP BY bankORDER BY product_num DESCLIMIT 10;

The output results are as follows:

Speed up your crawler?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More