Python multi-thread crawler and multiple data storage methods (Python crawler practice 2), python Crawler

Last Update:2018-02-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python multi-thread crawler and multiple data storage methods (Python crawler practice 2), python Crawler
1. multi-process Crawler

For crawlers with a large amount of data, you can use a python multi-process or multi-thread mechanism to process the data. multi-process refers to allocating multiple CPU processing programs, only one CPU is working at a time. multithreading means that multiple "sub-processes" in a process are working collaboratively at the same time. Python has multiple modules to complete multi-process and multi-thread tasks. Here, the multiprocessing module is used to complete multi-thread crawler. During the test, it is found that the website has anti-crawler mechanism, when the number of URLs and processes is large, the crawler reports an error.

2. Code content

#! /Usr/bin/python # _ * _ coding: utf _ * _ import reimport time import requestsfrom multiprocessing import Poolduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') failed t Exception as e: print e Return responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the segment sub-data ''' html = get_web_html (url) usernames = re. findall (R' 3. The crawled data is stored in the MongoDB database.
#! /Usr/bin/python # _ * _ coding: utf _ * _ import reimport time import jsonimport requestsimport py1_from multiprocessing import Poolduanzi_list = [] def get_web_html (url ): ''' @ params: Obtain the html data of the url web site ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') exce Pt Exception as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the Data Segment ''' html = get_web_html (url) usernames = re. findall (R' 4. insert to MySQL database
Insert the data obtained by crawlers into the relational database MySQL database for permanent data storage. First, create a database and a table in the MySQL database, as shown below:
1. create database MariaDB [(none)]> create database qiushi; Query OK, 1 row affected (0.00 sec) 2. use Database MariaDB [(none)]> use qiushi; Database changed3. create table MariaDB [qiushi]> create table qiushi_info (id int (32) unsigned primary key auto_increment, username varchar (64) not null, level int default 0, laugh_count int default 0, comment_count int default 0, content text default '') engine = InnoDB charset = 'utf8'; Query OK, 0 rows affected, 1 warning (0.06 sec) MariaDB [qiushi]> show create table qiushi_info; + ------------- + ------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------- + | Table | Create Table | + ------------- + ------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------- + | qiushi_info | create table 'qiushi _ info' ('id' int (32) unsigned not null AUTO_INCREMENT, 'username' varchar (64) not null, 'level' int (11) DEFAULT '0', 'ugh _ count' int (11) DEFAULT '0', 'comment _ count' int (11) DEFAULT '0', 'content' text, primary key ('id ')) ENGINE = InnoDB default charset = utf8 | + --------------- + too many ------------------------------------------- + 1 row in set (0.00 sec)
The code written to the MySQL database is as follows:
#! /Usr/bin/python # _ * _ coding: utf _ * _ # blog: http://www.cnblogs.com/cloudlab/import reimport time import pymysqlimport requestsduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') encoding t Ex Ception as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the Data Segment ''' html = get_web_html (url) usernames = re. findall (R' 5. Write crawler data to a CSV file
CSV files are separated by commas (,). They can be read in plain text or Excel. CSV files are a common data storage method, save the crawled data to a CSV file.
The code for saving data to a CSV file is as follows:
#! /Usr/bin/python # _ * _ coding: utf _ * _ # blog: http://www.cnblogs.com/cloudlab/import reimport csvimport time import requestsduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') cannot exceed t Ion as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the segment sub-data ''' html = get_web_html (url) usernames = re. findall (R' 6. Write the crawled data to a text file.
#! /Usr/bin/python # _ * _ coding: utf _ * _ # blog: http://www.cnblogs.com/cloudlab/import reimport csvimport time import requestsduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') cannot exceed t Ion as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the segment sub-data ''' html = get_web_html (url) usernames = re. findall (R'

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python multi-thread crawler and multiple data storage methods (Python crawler practice 2), python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python multi-thread crawler and multiple data storage methods (Python crawler practice 2), python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support