Python multi-thread crawler and multiple data storage methods (Python crawler practice 2), python Crawler
1. multi-process Crawler
For crawlers with a large amount of data, you can use a python multi-process or multi-thread mechanism to process the data. multi-process refers to allocating multiple CPU processing programs, only one CPU is working at a time. multithreading means that multiple "sub-processes" in a process are working collaboratively at the same time. Python has multiple modules to complete multi-process and multi-thread tasks. Here, the multiprocessing module is used to complete multi-thread crawler. During the test, it is found that the website has anti-crawler mechanism, when the number of URLs and processes is large, the crawler reports an error.
2. Code content
#! /Usr/bin/python # _ * _ coding: utf _ * _ import reimport time import requestsfrom multiprocessing import Poolduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') failed t Exception as e: print e Return responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the segment sub-data ''' html = get_web_html (url) usernames = re. findall (R' 3. The crawled data is stored in the MongoDB database.
#! /Usr/bin/python # _ * _ coding: utf _ * _ import reimport time import jsonimport requestsimport py1_from multiprocessing import Poolduanzi_list = [] def get_web_html (url ): ''' @ params: Obtain the html data of the url web site ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') exce Pt Exception as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the Data Segment ''' html = get_web_html (url) usernames = re. findall (R' 4. insert to MySQL database
Insert the data obtained by crawlers into the relational database MySQL database for permanent data storage. First, create a database and a table in the MySQL database, as shown below:
1. create database MariaDB [(none)]> create database qiushi; Query OK, 1 row affected (0.00 sec) 2. use Database MariaDB [(none)]> use qiushi; Database changed3. create table MariaDB [qiushi]> create table qiushi_info (id int (32) unsigned primary key auto_increment, username varchar (64) not null, level int default 0, laugh_count int default 0, comment_count int default 0, content text default '') engine = InnoDB charset = 'utf8'; Query OK, 0 rows affected, 1 warning (0.06 sec) MariaDB [qiushi]> show create table qiushi_info; + ------------- + ------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------- + | Table | Create Table | + ------------- + ------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------- + | qiushi_info | create table 'qiushi _ info' ('id' int (32) unsigned not null AUTO_INCREMENT, 'username' varchar (64) not null, 'level' int (11) DEFAULT '0', 'ugh _ count' int (11) DEFAULT '0', 'comment _ count' int (11) DEFAULT '0', 'content' text, primary key ('id ')) ENGINE = InnoDB default charset = utf8 | + --------------- + too many ------------------------------------------- + 1 row in set (0.00 sec)
The code written to the MySQL database is as follows:
#! /Usr/bin/python # _ * _ coding: utf _ * _ # blog: http://www.cnblogs.com/cloudlab/import reimport time import pymysqlimport requestsduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') encoding t Ex Ception as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the Data Segment ''' html = get_web_html (url) usernames = re. findall (R' 5. Write crawler data to a CSV file
CSV files are separated by commas (,). They can be read in plain text or Excel. CSV files are a common data storage method, save the crawled data to a CSV file.
The code for saving data to a CSV file is as follows:
#! /Usr/bin/python # _ * _ coding: utf _ * _ # blog: http://www.cnblogs.com/cloudlab/import reimport csvimport time import requestsduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') cannot exceed t Ion as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the segment sub-data ''' html = get_web_html (url) usernames = re. findall (R' 6. Write the crawled data to a text file.
#! /Usr/bin/python # _ * _ coding: utf _ * _ # blog: http://www.cnblogs.com/cloudlab/import reimport csvimport time import requestsduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') cannot exceed t Ion as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the segment sub-data ''' html = get_web_html (url) usernames = re. findall (R'