Python crawler crawls Dynamic Web pages and stores data in MySQL database

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Briefly

The following code is a Python-implemented web crawler that crawls Dynamic Web http://hb.qq.com/baoliao/. The most recent and elite content in this page is dynamically generated by JavaScript. Review page elements and Web page source code is different.

The above is the Web source

The above is the review page element

Therefore, it is not easy to use regular expressions to get the content here.

The following is the complete acquisition of content and stored in the database ideas and source code.

Implementation ideas:

URLs of dynamic pages that fetch actual access use regular expressions to get what you need – Parse content – Store content

The above part of the Process text interpretation:

Fetch the URL of the dynamic page that is actually accessed:

In Firefox, right-click to open the plugin using the **firebug review element * * (without this, to install the Firebug plugin), locate and open the * * Network (NET) * * tab. Reload the page to get response information for the webpage, including the connection address. Each connection address can be opened in the browser. The dynamic webpage access address of this website is: http://baoliao.hb.qq.com/api/report/NewIndexReportsList/cityid/18/num/20/pageno/1?callback= jquery183019859437816181613_1440723895018&_=1440723895472

Regular Expressions:

There are two ways to use regular expressions, and you can refer to individuals for their brief description: Python implements simple crawlers and regular expressions
More details can refer to the online information, search keywords: regular expression python

Json:

Refer to the Web for JSON introduction, search keywords: json python

Store to database:

Refer to the use of the Internet, search keywords: 1,mysql 2,mysql python

Source Code and comments

Note: The version with Python is 2.7

#!/usr/bin/python#Specify the encoding#-*-coding:utf-8-*-#Import a python libraryImportUrllibImportUrllib2ImportReImportMySQLdbImportJSON#Defining ReptilesclassCRAWL1:defGethtml (self,url=None):#AgentUser_agent="mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) gecko/20100101 firefox/40.0"Header={"user-agent": user_agent} request=urllib2. Request (url,headers=header) Response=Urllib2.urlopen (Request) HTML=Response.read ()returnHTMLdefgetcontent (Self,html,reg): Content=Re.findall (HTML, Reg, re.) S)returncontent#Connection Database MySQL    defConnectdb (self): host="192.168.85.21"DbName="test1"User="Root"Password="123456"        #charset= ' UTF8 ' is added here to display Chinese in the database, which must match the encoding of the databaseDb=mysqldb.connect (host,user,password,dbname,charset='UTF8')        returnDB Cursordb=db.cursor ()returnCursordb#Create a table, SQL language. CREATE TABLE IF not EXISTS indicates that the table createtablename is created when it does not exist    defcreattable (self,createtablename): Createtablesql="CREATE TABLE IF not EXISTS"+ createtablename+"(time varchar, title varchar (+), text varchar (+), clicks varchar )"db_create=self.connectdb () cursor_create=db_create.cursor () cursor_create.execute (Createtablesql) db_create.close ()Print 'creat table'+createtablename+'successfully'              returnCreatetablename#data is inserted into the table    definserttable (self,inserttable,inserttime,inserttitle,inserttext,insertclicks): Insertcontentsql="INSERT into"+inserttable+"(time,title,text,clicks) VALUES (%s,%s,%s,%s)"#insertcontentsql= "INSERT into" +inserttable+ "(time,title,text,clicks) VALUES (" +inserttime+ "," +inserttitle+ " , "+inserttext+", "+insertclicks+") "Db_insert=self.connectdb () Cursor_insert=db_insert.cursor () Cursor_insert.execute (Insertcontentsql, (inserttime,inserttitle,inserttext,insertclic KS)) Db_insert.commit () db_insert.close ()Print 'Inert contents to'+inserttable+'successfully'URL="http://baoliao.hb.qq.com/api/report/NewIndexReportsList/cityid/18/num/20/pageno/1?callback= jquery183019859437816181613_1440723895018&_=1440723895472"#regular expressions, get JS, time, title, text content, clicks (views)Reg_jason=r'. *?jquery.*?\ ((. *) \)'Reg_time=r'.*?" Create_time ":" (. *?) "'Reg_title=r'.*?" Title ":" (. *?) ". *?'Reg_text=r'.*?" Content ":" (. *?) ". *?'Reg_clicks=r'.*?" Counter_clicks ":" (. *?) "'#instantiate the crawl () objectCrawl=crawl1 () HTML=crawl.gethtml (URL) Html_jason=Re.findall (Reg_jason, HTML, re. S) Html_need=json.loads (html_jason[0])PrintLen (html_need)PrintLen (html_need['Data']['List']) Table=crawl.creattable ('yh1') forIinchRange (Len (html_need['Data']['List']): Creattime=html_need['Data']['List'][i]['Create_time'] Title=html_need['Data']['List'][i]['title'] Content=html_need['Data']['List'][i]['content'] Clicks=html_need['Data']['List'][i]['Counter_clicks'] Crawl.inserttable (table,creattime,title,content,clicks)

Python crawler crawls Dynamic Web pages and stores data in MySQL database

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler crawls Dynamic Web pages and stores data in MySQL database

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler crawls Dynamic Web pages and stores data in MySQL database

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support