Python Selenium crawls content and stores it to the implementation code of the MySQL database, pythonselenium

Source: Internet
Author: User

Python Selenium crawls content and stores it to the implementation code of the MySQL database, pythonselenium

I used an article to describe how to crawl the CSDN blog abstract. Generally, after using Selenium crawlers to crawl data, they need to be stored in TXT text, but this is difficult to process and analyze data. This article mainly describes crawling my personal blog information through Selenium and storing it in the MySQL database for data analysis, for example, you can analyze the time when many blogs are published, analyze the topic of the article in combination with WordCloud, and rank the reading volume of the article.
This is a basic article. I hope it will help you. If there are errors or deficiencies in this article, please ask Hai Han. The next article will briefly explain the data analysis process.

1. crawler results
Crawling address: http://blog.csdn.net/Eastmount



The results of crawling and storing the data to the MySQL database are as follows:


Shows the running process:

Ii. complete code analysis

The complete code is as follows:

# Coding = UTF-8 from selenium import webdriver from selenium. webdriver. common. keys import Keys import selenium. webdriver. support. ui as ui import reimport timeimport osimport codecsimport MySQLdb # Open the Firefox browser and set the waiting time driver = webdriver. firefox () wait = ui. webDriverWait (driver, 10) # obtain the total low-end page number def getPage (): print 'getpage' number = 0 texts = driver. find_element_by_xpath ("// div [@ id = 'papelist']"). Text print 'page', texts m = re. findall (R' (\ w * [0-9] +) \ w * ', texts) # regular expression to find the number print 'page number:' + str (m [1]) return int (m [1]) # main Function def main (): # obtain the total number of lines in the txt file count = len (open ("Blog_URL.txt", 'ru '). readlines () print count n = 0 urlfile = open ("Blog_URL.txt", 'R') # obtain the document extraction information of each blogger cyclically while n <count: # Here, two blog posts are crawled. Normally, count the master information url = urlfile. readline () url = url. strip ("\ n") print url driver. get (url) # get the total page number allPage = GetPage () print U': ', allPage time. sleep (2) # Database Operations combined with try: conn = MySQLdb. connect (host = 'localhost', user = 'root', passwd = '000000', port = 123456, db = 'test01') cur = conn. cursor () # Database cursor # error: UnicodeEncodeError: 'Latin-1 'codec can't encode character conn. set_character_set ('utf8') cur.exe cute ('set NAMES utf8; ') cur.exe cute ('set character set utf8;') cur.exe cute ('set character_set_connection = utf8 ;') # Content Processing M = 1 # 1st page while m <= allPage: ur = url + "/article/list/" + str (m) print ur driver. get (ur) # title article_title = driver. find_elements_by_xpath ("// div [@ class = 'Article _ title']") for title in article_title: # print url con = title. text con = con. strip ("\ n") # print con + '\ n' # Abstract article_description = driver. find_elements_by_xpath ("// div [@ class = 'Article _ description']") for description in article_descri Ption: con = description. text con = con. strip ("\ n") # print con + '\ n' # Information article_manage = driver. find_elements_by_xpath ("// div [@ class = 'Article _ manage']") for manage in article_manage: con = manage. text con = con. strip ("\ n") # print con + '\ n' num = 0 print u 'length', len (article_title) while num <len (article_title ): # insert 8 values of Data SQL = ''' insert into csdn_blog (URL, Author, Artitle, Description, Manage, FBTim E, YDNum, PLNum) values (% s, % s) '''artitle = article_title [num]. text Description = article_description [num]. text Manage = article_manage [num]. text print Artitle print Description print Manage # obtain Author = url. split ('/') [-1] # Get the number of views and comments mode = re. compile (R' \ d + \.? \ D * ') YDNum = mode. findall (Manage) [-2] PLNum = mode. findall (Manage) [-1] print YDNum print PLNum # Get release time end = Manage. find (u'read') FBTime = Manage [: end] cur.exe cute (SQL, (url, Author, Artitle, Description, Manage, FBTime, YDNum, PLNum )) num = num + 1 else: print U' database inserted successfully'm = m + 1 # Exception Handling failure t MySQLdb. error, e: print "Mysql Error % d: % s" % (e. args [0], e. args [1]) finally: cur. close () conn. commit () conn. close () n = n + 1 else: urlfile. close () print 'Load over' main ()

Put the user's blog Address URL in the blog_url.txt file, as shown in. Note that the author has previously written the URL code for crawling all CSDN experts. Here, it is omitted to access other people for reading more.

The analysis process is as follows.
1. Obtain the total number of bloggers
First, read the Master Address from blog_url.txt, and then access and obtain the total number of page numbers. The Code is as follows:

# Retrieve the total page number at the bottom of each blog page of each blogger def getPage (): print 'getpage' number = 0 texts = driver. find_element_by_xpath ("// div [@ id = 'papelist']"). text print 'page', texts m = re. findall (R' (\ w * [0-9] +) \ w * ', texts) # regular expression to find the number print 'page number:' + str (m [1]) return int (m [1])

For example, you can obtain 17 pages of the total page number, as shown in:

2. Analyze the DOM tree
Here, the blog page flip adopts URL Connection, which is more convenient.
Such as: http://blog.csdn.net/Eastmount/article/list/2
Therefore, you only need to: 1. Get the total page number; 2. Crawl the information on each page; 3. Set the URL to flip pages cyclically; 4. crawl again.
You can also click "next page" to jump to the next page. If there is no "next page" to stop the jump, the crawler ends, and then crawls the next blogger.

3. Get details: title, summary, and time
Review the element and analyze each blog page. If you use BeautifulSoup to crawl, the error "Forbidden" will be reported ".
It is found that each article is composed of a <div> </div>, as shown in the following figure. You only need to locate the position.

Locate the position here to crawl. Here, you need to locate the title, abstract, and time respectively.

The Code is as follows. Note that three values are obtained simultaneously in while. They correspond to each other.

# Title article_title = driver. find_elements_by_xpath ("// div [@ class = 'Article _ title']") for title in article_title: con = title. text con = con. strip ("\ n") print con + '\ n' # Abstract article_description = driver. find_elements_by_xpath ("// div [@ class = 'Article _ description']") for description in article_description: con = description. text con = con. strip ("\ n") print con + '\ n' # Information article_manage = driver. find_elements_by_xpath ("// div [@ class = 'Article _ manage']") for manage in article_manage: con = manage. text con = con. strip ("\ n") print con + '\ n' num = 0 print u 'length', len (article_title) while num <len (article_title ): artitle = article_title [num]. text Description = article_description [num]. text Manage = article_manage [num]. text print Artitle, Description, Manage

 4. Special string processing
The following code obtains the name of the blogger at the end of the URL, the time for obtaining the string, and the number of readings:

# Get blogger name url = "http://blog.csdn.net/Eastmount" print url. split ('/') [-1] # output: Eastmount # obtain the number name = "Read (909) Comment (0)" print nameimport remode = re. compile (R' \ d + \.? \ D * ') print mode. findall (name) # output: ['20180101', '09', '08', '18', '06', '20180101', '0'] print mode. findall (name) [-2] # output: 909 # Get time end = name. find (r 'read') print name [: end] # output: 2015-09-08 18: 06 import time, datetimea = time. strptime (name [: end], '% Y-% m-% d % H: % m') print a # output: time. struct_time (tm_year = 2015, tm_mon = 9, tm_mday = 8, tm_hour = 18, tm_min = 6, # tm_sec = 0, tm_wday = 1, tm_yday = 251, tm_isdst =-1)

Iii. Database Operations
The code for creating a table using an SQL statement is as follows:

Create table 'csdn '('id' int (11) not null AUTO_INCREMENT, 'url' varchar (100) COLLATE utf8_bin default null, 'author' varchar (50) COLLATE utf8_bin default null comment 'author ', 'artitle' varchar (100) COLLATE utf8_bin default null comment 'title', 'description' varchar (400) COLLATE utf8_bin default null comment 'summary', 'manage' varchar (100) COLLATE utf8_bin default null comment' information', 'fbtime' datetime default null comment' publication date ', 'ynum' int (11) default null comment' readings ', 'plnum' int (11) default null comment 'comment number', 'dznum' int (11) default null comment 'thumbs up number', primary key ('id') ENGINE = InnoDB AUTO_INCREMENT = 9371 default charset = utf8 COLLATE = utf8_bin;

Shows:

The following text is recommended when Python calls MySQL.
Python Topic 9. Basic knowledge of Mysql Database Programming
The core code is as follows:

# Coding: UTF-8 import MySQLdb try: conn = MySQLdb. connect (host = 'localhost', user = 'root', passwd = '000000', port = 123456, db = 'test01') cur = conn. cursor () # insert data SQL = ''' insert into student values (% s, % s, % s) ''' cur.exe cute (SQL, ('yz ', '200', '10') # View data print U' \ n insert data: 'cur.exe cute ('select * from student ') for data in cur. fetchall (): print '% s % s' % data cur. close () conn. commit () conn. close () distinct T MySQLdb. error, e: print "Mysql Error % d: % s" % (e. args [0], e. args [1])

Note: During the download process, some websites are new versions and cannot obtain the page number.
For example: http://blog.csdn.net/michaelzhou224
In this case, you need to simply set and skip these links and save them to the file. The core code is as follows:

# Retrieve the total page number at the bottom of each blog page of each blogger def getPage (): print 'getpage' number = 0 # texts = driver. find_element_by_xpath ("// div [@ id = 'papelist']"). text texts = driver. find_element_by_xpath ("// div [@ class = 'pagelist']"). text print 'testsss' print U' page number ', texts if texts = "": print U' page number is 0 website error 'Return 0 m = re. findall (R' (\ w * [0-9] +) \ w * ', texts) # Regular Expression for Number print U' page: '+ str (m [1]) return int (m [1])

Modify the main function:

Error = codecs. open ("Blog_Error.txt", 'A', 'utf-8') # obtain the document extraction information of each blogger in a loop. while n <count: # Here, two blog posts are crawled, under normal circumstances, count the master information url = urlfile. readline () url = url. strip ("\ n") print url driver. get (url + "/article/list/1") # print driver. page_source # retrieve the total page number allPage = getPage () print u 'total page number:', allPage # return an error. Otherwise, the program intercepts if allPage = 0: error. write (url + "\ r \ n") print U' error URL 'continue; # Skip to the next blogger time. sleep (2) # Database Operations combined with try :.....

Finally, I hope the article will help you. If there are errors or deficiencies in the article, please try again ~
Improve efficiency, scientific research, teaching, and nameI life.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.