Briefly
The following code is a Python-implemented web crawler that crawls Dynamic Web http://hb.qq.com/baoliao/. The most recent and elite content in this page is dynamically generated by JavaScript. Review page elements and Web page source code is different.
The above is the Web source
The above is the review page element
Therefore, it is not easy to use regular expressions to get the content here.
The following is the complete acquisition of content and stored in the database ideas and source code.
Implementation ideas:
URLs of dynamic pages that fetch actual access use regular expressions to get what you need – Parse content – Store content
The above part of the Process text interpretation:
Fetch the URL of the dynamic page that is actually accessed:
In Firefox, right-click to open the plugin using the **firebug review element * * (without this, to install the Firebug plugin), locate and open the * * Network (NET) * * tab. Reload the page to get response information for the webpage, including the connection address. Each connection address can be opened in the browser. The dynamic webpage access address of this website is: http://baoliao.hb.qq.com/api/report/NewIndexReportsList/cityid/18/num/20/pageno/1?callback= jquery183019859437816181613_1440723895018&_=1440723895472
Regular Expressions:
There are two ways to use regular expressions, and you can refer to individuals for their brief description: Python implements simple crawlers and regular expressions
More details can refer to the online information, search keywords: regular expression python
Json:
Refer to the Web for JSON introduction, search keywords: json python
Store to database:
Refer to the use of the Internet, search keywords: 1,mysql 2,mysql python
Source Code and comments
Note: The version with Python is 2.7
#!/usr/bin/python#Specify the encoding#-*-coding:utf-8-*-#Import a python libraryImportUrllibImportUrllib2ImportReImportMySQLdbImportJSON#Defining ReptilesclassCRAWL1:defGethtml (self,url=None):#AgentUser_agent="mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) gecko/20100101 firefox/40.0"Header={"user-agent": user_agent} request=urllib2. Request (url,headers=header) Response=Urllib2.urlopen (Request) HTML=Response.read ()returnHTMLdefgetcontent (Self,html,reg): Content=Re.findall (HTML, Reg, re.) S)returncontent#Connection Database MySQL defConnectdb (self): host="192.168.85.21"DbName="test1"User="Root"Password="123456" #charset= ' UTF8 ' is added here to display Chinese in the database, which must match the encoding of the databaseDb=mysqldb.connect (host,user,password,dbname,charset='UTF8') returnDB Cursordb=db.cursor ()returnCursordb#Create a table, SQL language. CREATE TABLE IF not EXISTS indicates that the table createtablename is created when it does not exist defcreattable (self,createtablename): Createtablesql="CREATE TABLE IF not EXISTS"+ createtablename+"(time varchar, title varchar (+), text varchar (+), clicks varchar )"db_create=self.connectdb () cursor_create=db_create.cursor () cursor_create.execute (Createtablesql) db_create.close ()Print 'creat table'+createtablename+'successfully' returnCreatetablename#data is inserted into the table definserttable (self,inserttable,inserttime,inserttitle,inserttext,insertclicks): Insertcontentsql="INSERT into"+inserttable+"(time,title,text,clicks) VALUES (%s,%s,%s,%s)"#insertcontentsql= "INSERT into" +inserttable+ "(time,title,text,clicks) VALUES (" +inserttime+ "," +inserttitle+ " , "+inserttext+", "+insertclicks+") "Db_insert=self.connectdb () Cursor_insert=db_insert.cursor () Cursor_insert.execute (Insertcontentsql, (inserttime,inserttitle,inserttext,insertclic KS)) Db_insert.commit () db_insert.close ()Print 'Inert contents to'+inserttable+'successfully'URL="http://baoliao.hb.qq.com/api/report/NewIndexReportsList/cityid/18/num/20/pageno/1?callback= jquery183019859437816181613_1440723895018&_=1440723895472"#regular expressions, get JS, time, title, text content, clicks (views)Reg_jason=r'. *?jquery.*?\ ((. *) \)'Reg_time=r'.*?" Create_time ":" (. *?) "'Reg_title=r'.*?" Title ":" (. *?) ". *?'Reg_text=r'.*?" Content ":" (. *?) ". *?'Reg_clicks=r'.*?" Counter_clicks ":" (. *?) "'#instantiate the crawl () objectCrawl=crawl1 () HTML=crawl.gethtml (URL) Html_jason=Re.findall (Reg_jason, HTML, re. S) Html_need=json.loads (html_jason[0])PrintLen (html_need)PrintLen (html_need['Data']['List']) Table=crawl.creattable ('yh1') forIinchRange (Len (html_need['Data']['List']): Creattime=html_need['Data']['List'][i]['Create_time'] Title=html_need['Data']['List'][i]['title'] Content=html_need['Data']['List'][i]['content'] Clicks=html_need['Data']['List'][i]['Counter_clicks'] Crawl.inserttable (table,creattime,title,content,clicks)
Python crawler crawls Dynamic Web pages and stores data in MySQL database