I've been through an article about how to crawl csdn blog summaries and more. Typically, after crawling data using the Selenium crawler, it needs to be stored in txt text, but it is difficult to do data processing and analysis. This article is mainly about crawling my personal blog information through selenium, and then stored in the database MySQL, in order to analyze the data, such as the analysis of which time period published more blog, combined with Wordcloud analysis of the topic, article reading volume rankings.
This is a basic article, I hope to help you, if there are errors or shortcomings in the article, please also Haihan. The next article will briefly explain the process of data analysis.
I. Results of crawling
The crawled address is: Http://blog.csdn.net/Eastmount
The results of crawling and storing to the MySQL database are as follows:
The running process is as follows:
Two. Complete code Analysis
The complete code looks like this:
# Coding=utf-8 from selenium import webdriver from Selenium.webdriver.common.keys import keys import Selenium . Webdriver.support.ui as UI import Reimport timeimport osimport codecsimport mysqldb #打开Firefox浏览器 set wait load time Driver = Webdriver. Firefox () wait = UI. Webdriverwait (driver,10) #获取每个博主的博客页面低端总页码 def getpage (): print ' getpage ' number = 0 texts = driver. Find_element_by_xpath ("//div[@id = ' papelist ']"). Text print ' page number ', Texts m = Re.findall (R ' (\w*[0-9]+) \w* ', t exts) #正则表达式寻找数字 print ' page: ' + str (m[1]) return int (m[1]) #主函数 def main (): #获取txt文件总行数 Co UNT = Len (open ("Blog_url.txt", "RU"). ReadLines ()) print count n = 0 urlfile = open ("Blog_url.txt", ' R ') #循环获取每个博 Master's article picking information while n < count: #这里爬取2个人博客信息, normal count of blogger information url = urlfile.readline () url = url.strip ("\ n") Print URL driver.get (URL) #获取总页码 allpage = GetPage () The total number of print U ' pages is: ', allPage Time.sleep (2) #数据库操作结合 try:conn=mysqldb.connect (host= ' localhost ', user= ' root ', Passwd= ' 123456 ', port=3306, db= ' test01 ') cur=conn.cursor () #数据库游标 #报错: Unico Deencodeerror: ' latin-1 ' codec can ' t encode character conn.set_character_set (' UTF8 ') cur.execute (' SE T NAMES UTF8; ') Cur.execute (' Set CHARACTER set UTF8; ') Cur.execute (' SET Character_set_connection=utf8; ') #具体内容处理 m = 1 #第1页 while m <= allpage:ur = URL + "/article/list/" + str (m) print ur driver.get (ur) #标题 article_title = Driver.f Ind_elements_by_xpath ("//div[@class = ' article_title ']") for the title in Article_title: #prin T url con = title.text con = con.strip ("\ n") #print con + ' \ n' #摘要 article_description = Driver.find_elements_by_xpath ("//div[@class = ' Art Icle_description '] ") for description in Article_description:con = Description.text con = Con.strip ("\ n") #print con + ' \ n ' #信息 article_manage = Driver.find_elements_by_xpath ("//div[@class = ' article_manage ']") for manage in Article_manage: con = manage.text con = con.strip ("\ n") #print con + ' \ n ' nu m = 0 Print u ' length ', Len (article_title) while Num < Len (article_title): #插入数据 8 value of sql = "INSERT INTO Csdn_blog (url,author,artitle,description , Manage,fbtime,ydnum,plnum) values (%s,%s,%s,%s,%s,%s,%s,%s) "Artitl E = article_title[num].tExt Description = Article_description[num].text Manage = Article_manage[num].text Print artitle print Description print Manage #获取 Author Author = Url.split ('/') [-1] #获取阅读数和评论数 mode = Re.compile (R ' \d+ \.? \d* ') Ydnum = Mode.findall (Manage) [-2] Plnum = Mode.findall (Manage) [-1] Print ydnum print plnum #获取发布时间 end = manage.find (U ' read ') Fbtime = manage[:end] Cur.execute (sql, (URL, Author, Artitle, Description, MANAGE,FB time,ydnum,plnum) num = num + 1 else:print u ' data Library Insert succeeded ' m = m + 1 #异常处理 except mysqldb.error,e: Print "MysqlError%d:%s "% (E.args[0], e.args[1]) finally:cur.close () Conn.commit () conn . Close () n = n + 1 else:urlfile.close () print ' Load over ' main ()
Place the Url of the blog address in the Blog_url.txt file that needs to crawl the user, as shown in. Note here that the author has pre-written a URL code to crawl all the CSDN experts, which is omitted for accessing other people to improve the reading volume.
The analysis process is as follows.
1. Get the total number of bloggers
First read the blogger address from Blog_url.txt, and then access and get the total number of pages. The code is as follows:
#获取每个博主的博客页面低端总页码 def getpage (): print ' getpage ' number = 0 texts = driver.find_element_by_xpath ("// div[@id = ' papelist '). Text print ' page ', texts m = Re.findall (R ' (\w*[0-9]+) \w* ', texts) #正则表达式寻找数字 print ' Pages: ' + str (m[1]) return int (m[1])
For example, get the total page number 17 pages, as shown in:
2. Page Turn DOM tree analysis
Here'sBlog page is the use of a URL connection, more convenient.
such as:Http://blog.csdn.net/Eastmount/article/list/2
Therefore, only need to: 1. Get the total page number; 2. Crawl each page of information, 3.URL set to cycle the page; 4. Crawl again.
You can also use click "Next Page" Jump, no "next page" Stop jump, crawler end, and then climb down a blogger.
3. Get more information: Title, summary, Time
Then review the element analysis for each blog page, if using BeautifulSoup crawl will error "Forbidden".
Discover that each article is made up of a <div></div>, as shown below, and only need to navigate to that location.
This location can be crawled, where you need to position the title, summary, and time separately.
The code is shown below. Note that you get three values at the same time in the while and they are the corresponding one.
#标题article _title = Driver.find_elements_by_xpath ("//div[@class = ' article_title ']") for title in Article_title:con = Title.text con = Con.strip ("\ n") print con + ' \ n ' #摘要article_description = Driver.find_elements_by_xpath ("//div[@cl ass= ' article_description ') ") for description in Article_description:con = Description.text con = con.strip (" \ n ") Print con + ' \ n ' #信息article_manage = Driver.find_elements_by_xpath ("//div[@class = ' article_manage ']") for manage in Article_manage:con = Manage.text con = con.strip ("\ n") print con + ' \ n ' num = 0print u ' length ', Len (article_title) whi Le num < Len (article_title): Artitle = Article_title[num].text Description = Article_description[num].text man Age = Article_manage[num].text Print Artitle, Description, manage
4. Special string Handling
Get the URL of the last/after the blogger name, get the string time, read the code as follows:
#获取博主姓名url = "Http://blog.csdn.net/Eastmount" Print url.split ('/') [-1] #输出: eastmount# get the number name = "2015-09-08 18:06 Read ( 909) Comments (0) "Print nameimport Remode = Re.compile (R ' \d+\.? \d* ') print Mode.findall (name) #输出: [' + ', ' ' ', ' ' ', ' ' ', ' ' ' ' ', ' ' 909 ', ' 0 ']print mode.findall (name) [-2] #输出: 909# Get Time end = Name.find (R ' reading ') print Name[:end] #输出: 2015-09-08 18:06import times, Datetimea = Time.strptime (Name[:end], '%y-% m-%d%h:%m ') print a# output: Time.struct_time (tm_year=2015, tm_mon=9, Tm_mday=8, tm_hour=18, tm_min=6,# tm_sec=0, tm_ Wday=1, tm_yday=251, Tm_isdst=-1)
three. Database-related Operations
The SQL statement creates the table code as follows:
CREATE TABLE ' csdn ' ( ' ID ' int (one) not null auto_increment, ' URL ' varchar () COLLATE utf8_bin DEFAULT NULL, ' Author ' varchar (COLLATE) utf8_bin default null COMMENT ' author ', ' artitle ' varchar (+) COLLATE utf8_bin default null COMMENT ' title ', ' Description ' varchar (+) COLLATE utf8_bin DEFAULT NULL COMMENT ' Digest ', ' Manage ' varchar (100) COLLATE utf8_bin default null COMMENT ' info ', ' fbtime ' datetime default NULL COMMENT ' release date ', ' ydnum ' int (one) default Null COMMENT ' reading ', ' plnum ' int (one) default null COMMENT ' comment number ', ' dznum ' int (one-by-one) default null COMMENT ' point like number ', PRIMARY KEY (' ID ')) Engine=innodb auto_increment=9371 DEFAULT Charset=utf8 collate=utf8_bin;
appears as shown in the following:
Where Python calls MySQL to recommend the following text.
[Python] topic nine. MySQL Database programming Basics
The core code is as follows:
# coding:utf-8 Import mysqldb try: conn=mysqldb.connect (host= ' localhost ', user= ' root ', passwd= ' 123456 ', port=3306 , db= ' test01 ') cur=conn.cursor () #插入数据 sql = "' INSERT into student values (%s,%s,%s) ' Cur.execute ( SQL, (' yxz ', ' 111111 ', ' ten ')) #查看数据 print u ' \ n Insert data: ' cur.execute (' select * from student ') Cur.fetchall (): print '%s%s '% Data cur.close () conn.commit () conn.close () except Mysqldb.error,e: print "Mysql Error%d:%s"% (E.args[0], e.args[1])
finally hope that the article is helpful to you, if there are errors or shortcomings in the article, please Haihan ~
improve efficiency, improve scientific research, serious teaching, Na Mei life.
(by:eastmount 2017-03-13 1:30 P.M.http://blog.csdn.net/eastmount/)
[python crawler] Selenium crawl content and save to MySQL database