Python multi-thread crawler and multiple data storage methods (Python crawler practice 2), python Crawler

Source: Internet
Author: User

Python multi-thread crawler and multiple data storage methods (Python crawler practice 2), python Crawler
1. multi-process Crawler

For crawlers with a large amount of data, you can use a python multi-process or multi-thread mechanism to process the data. multi-process refers to allocating multiple CPU processing programs, only one CPU is working at a time. multithreading means that multiple "sub-processes" in a process are working collaboratively at the same time. Python has multiple modules to complete multi-process and multi-thread tasks. Here, the multiprocessing module is used to complete multi-thread crawler. During the test, it is found that the website has anti-crawler mechanism, when the number of URLs and processes is large, the crawler reports an error.

2. Code content
#! /Usr/bin/python # _ * _ coding: utf _ * _ import reimport time import requestsfrom multiprocessing import Poolduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') failed t Exception as e: print e Return responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the segment sub-data ''' html = get_web_html (url) usernames = re. findall (R' 3. The crawled data is stored in the MongoDB database.
#! /Usr/bin/python # _ * _ coding: utf _ * _ import reimport time import jsonimport requestsimport py1_from multiprocessing import Poolduanzi_list = [] def get_web_html (url ): ''' @ params: Obtain the html data of the url web site ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') exce Pt Exception as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the Data Segment ''' html = get_web_html (url) usernames = re. findall (R' 4. insert to MySQL database

Insert the data obtained by crawlers into the relational database MySQL database for permanent data storage. First, create a database and a table in the MySQL database, as shown below:

1. create database MariaDB [(none)]> create database qiushi; Query OK, 1 row affected (0.00 sec) 2. use Database MariaDB [(none)]> use qiushi; Database changed3. create table MariaDB [qiushi]> create table qiushi_info (id int (32) unsigned primary key auto_increment, username varchar (64) not null, level int default 0, laugh_count int default 0, comment_count int default 0, content text default '') engine = InnoDB charset = 'utf8'; Query OK, 0 rows affected, 1 warning (0.06 sec) MariaDB [qiushi]> show create table qiushi_info; + ------------- + ------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------- + | Table | Create Table | + ------------- + ------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------- + | qiushi_info | create table 'qiushi _ info' ('id' int (32) unsigned not null AUTO_INCREMENT, 'username' varchar (64) not null, 'level' int (11) DEFAULT '0', 'ugh _ count' int (11) DEFAULT '0', 'comment _ count' int (11) DEFAULT '0', 'content' text, primary key ('id ')) ENGINE = InnoDB default charset = utf8 | + --------------- + too many ------------------------------------------- + 1 row in set (0.00 sec)

The code written to the MySQL database is as follows:

#! /Usr/bin/python # _ * _ coding: utf _ * _ # blog: http://www.cnblogs.com/cloudlab/import reimport time import pymysqlimport requestsduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') encoding t Ex Ception as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the Data Segment ''' html = get_web_html (url) usernames = re. findall (R' 5. Write crawler data to a CSV file

CSV files are separated by commas (,). They can be read in plain text or Excel. CSV files are a common data storage method, save the crawled data to a CSV file.

The code for saving data to a CSV file is as follows:

#! /Usr/bin/python # _ * _ coding: utf _ * _ # blog: http://www.cnblogs.com/cloudlab/import reimport csvimport time import requestsduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') cannot exceed t Ion as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the segment sub-data ''' html = get_web_html (url) usernames = re. findall (R' 6. Write the crawled data to a text file.
#! /Usr/bin/python # _ * _ coding: utf _ * _ # blog: http://www.cnblogs.com/cloudlab/import reimport csvimport time import requestsduanzi_list = [] def get_web_html (url): ''' @ params: obtain the html data ''' headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv: 52.0) of the url website) gecko/20100101 Firefox/52.0 "} try: req = requests. get (url, headers = headers) if req. status_code = 200: response = req. text. encode ('utf8') cannot exceed t Ion as e: print ereturn responsedef scrap_qiushi_info (url): ''' @ params: url, get the information of the segment sub-data ''' html = get_web_html (url) usernames = re. findall (R' 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.