Python crawler crawls Dynamic Web pages and stores data in MySQL database

Source: Internet
Author: User

Briefly

The following code is a Python-implemented web crawler that crawls Dynamic Web http://hb.qq.com/baoliao/. The most recent and elite content in this page is dynamically generated by JavaScript. Review page elements and Web page source code is different.

The above is the Web source

The above is the review page element

Therefore, it is not easy to use regular expressions to get the content here.

The following is the complete acquisition of content and stored in the database ideas and source code.

Implementation ideas:

URLs of dynamic pages that fetch actual access use regular expressions to get what you need – Parse content – Store content

The above part of the Process text interpretation:

Fetch the URL of the dynamic page that is actually accessed:

In Firefox, right-click to open the plugin using the **firebug review element * * (without this, to install the Firebug plugin), locate and open the * * Network (NET) * * tab. Reload the page to get response information for the webpage, including the connection address. Each connection address can be opened in the browser. The dynamic webpage access address of this website is: http://baoliao.hb.qq.com/api/report/NewIndexReportsList/cityid/18/num/20/pageno/1?callback= jquery183019859437816181613_1440723895018&_=1440723895472

  

Regular Expressions:

There are two ways to use regular expressions, and you can refer to individuals for their brief description: Python implements simple crawlers and regular expressions
More details can refer to the online information, search keywords: regular expression python

Json:

Refer to the Web for JSON introduction, search keywords: json python

Store to database:

Refer to the use of the Internet, search keywords: 1,mysql 2,mysql python

Source Code and comments

Note: The version with Python is 2.7

#!/usr/bin/python#Specify the encoding#-*-coding:utf-8-*-#Import a python libraryImportUrllibImportUrllib2ImportReImportMySQLdbImportJSON#Defining ReptilesclassCRAWL1:defGethtml (self,url=None):#AgentUser_agent="mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) gecko/20100101 firefox/40.0"Header={"user-agent": user_agent} request=urllib2. Request (url,headers=header) Response=Urllib2.urlopen (Request) HTML=Response.read ()returnHTMLdefgetcontent (Self,html,reg): Content=Re.findall (HTML, Reg, re.) S)returncontent#Connection Database MySQL    defConnectdb (self): host="192.168.85.21"DbName="test1"User="Root"Password="123456"        #charset= ' UTF8 ' is added here to display Chinese in the database, which must match the encoding of the databaseDb=mysqldb.connect (host,user,password,dbname,charset='UTF8')        returnDB Cursordb=db.cursor ()returnCursordb#Create a table, SQL language. CREATE TABLE IF not EXISTS indicates that the table createtablename is created when it does not exist    defcreattable (self,createtablename): Createtablesql="CREATE TABLE IF not EXISTS"+ createtablename+"(time varchar, title varchar (+), text varchar (+), clicks varchar )"db_create=self.connectdb () cursor_create=db_create.cursor () cursor_create.execute (Createtablesql) db_create.close ()Print 'creat table'+createtablename+'successfully'              returnCreatetablename#data is inserted into the table    definserttable (self,inserttable,inserttime,inserttitle,inserttext,insertclicks): Insertcontentsql="INSERT into"+inserttable+"(time,title,text,clicks) VALUES (%s,%s,%s,%s)"#insertcontentsql= "INSERT into" +inserttable+ "(time,title,text,clicks) VALUES (" +inserttime+ "," +inserttitle+ " , "+inserttext+", "+insertclicks+") "Db_insert=self.connectdb () Cursor_insert=db_insert.cursor () Cursor_insert.execute (Insertcontentsql, (inserttime,inserttitle,inserttext,insertclic KS)) Db_insert.commit () db_insert.close ()Print 'Inert contents to'+inserttable+'successfully'URL="http://baoliao.hb.qq.com/api/report/NewIndexReportsList/cityid/18/num/20/pageno/1?callback= jquery183019859437816181613_1440723895018&_=1440723895472"#regular expressions, get JS, time, title, text content, clicks (views)Reg_jason=r'. *?jquery.*?\ ((. *) \)'Reg_time=r'.*?" Create_time ":" (. *?) "'Reg_title=r'.*?" Title ":" (. *?) ". *?'Reg_text=r'.*?" Content ":" (. *?) ". *?'Reg_clicks=r'.*?" Counter_clicks ":" (. *?) "'#instantiate the crawl () objectCrawl=crawl1 () HTML=crawl.gethtml (URL) Html_jason=Re.findall (Reg_jason, HTML, re. S) Html_need=json.loads (html_jason[0])PrintLen (html_need)PrintLen (html_need['Data']['List']) Table=crawl.creattable ('yh1') forIinchRange (Len (html_need['Data']['List']): Creattime=html_need['Data']['List'][i]['Create_time'] Title=html_need['Data']['List'][i]['title'] Content=html_need['Data']['List'][i]['content'] Clicks=html_need['Data']['List'][i]['Counter_clicks'] Crawl.inserttable (table,creattime,title,content,clicks)

Python crawler crawls Dynamic Web pages and stores data in MySQL database

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.