python-Tiger flapping crawler

Last Update:2018-01-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python, as a high-level programming language, is not known to be popular in circles since. Individual is also a fresh picture, keep up with the pace of the times to learn a bit. "Lu Xun" said: "Can not apply knowledge, is bullying." I used Python to make a reptile for the tiger Flutter Forum. Script to write a rough point, the right to use for beginners to communicate, but also convenient for later inspection. Originally was prepared to write a tiger flutter analysis of the post, but later the lack of power is not written. However, as a Spurs fan it is an honor for our organization to be three before the heat.

　　Preparation : Install python, install MySQL, virtual machine "selective, post daily on the server to perform scheduled Tasks"

1. Install Python: Select 3.*, Process ignore

2, install MySQL: Select 5.6 version and above, process ignored

3, Virtual machine: Linux series, process ignored

Requirements Description

Crawl the Tiger forum post to learn about post content, author, Heat, and more.

Writing scripts

It is divided into three parts: Part1 through the analysis of the current link, extract the post author, read the information, Part2 get the content of the post itself, part3 to the post data extraction, for the post-analysis to provide ideas. The specific script is as follows. Note: Encode, encode, encode. Thank you!

Note: Due to the anti-crawler of tiger flutter causes the number of pages that can be broken down to 10 (Breakthrough defense failure, thank you!) In this case, my approach is to accumulate the script into a daily crawl in the server.

Part1: Crawl The name of the post, author, creation time, read/reply, author link, etc., and put it in the local MySQL database

#-*-Coding:utf-8-*-from bs4 import beautifulsoupimport requestsimport jsonimport timeimport pymysqlimport importlib,sy Simportlib.reload (SYS) forum_note_sum=[] #variavle: Save the content of tiezilist_d=[' original ', ' translate ', ' talk '] #内容判断条件, If the post header content is for this, take another value type = sys.getfilesystemencoding () #num: The record number of one page;get Tiezi of author and Othersdef par Ent_li_web (num): Forum_note_record = {} try:parent_tiezi=bs_obj.find (' ul ', class_= ' for-list '). Find_all (' Li ') [ num] Div_one = parent_tiezi.find (' div ', class_= ' titlelink box ') Div_two = Parent_tiezi.find (' div ', class_= ' a Uthor box ') Span_three = Parent_tiezi.find (' span ', class_= ' ansour box '). String.strip () Div_four = Parent_tie Zi.find (' div ', class_= ' endreply box ') subname=div_one.a.string sublink= ' https://bbs.hupu.com ' +div_one.a[' HRE F '] team_tmp=theme_tmp for i in List_d:if i==subname:subname=div_one.find_all (' A ') [1].string sublink= 'Https://bbs.hupu.com ' +div_one.find_all (' a ') [1][' href '] # print (i,subname,sublink) Forum_note_record            . Update ({' SubName ': subname, ' subname_link ': Sublink, ' author ':d iv_two.a.string, ' Author_link ':d iv_two.a[' href ', ' author_create_time ':d iv_two.find (' A ', style= ' color: #808080; cursor:initial; '). String, ' Read_reply_number ': span_three, ' Last_reply_writer ':d iv_four.span.string, ' Las    T_reply_time ':d iv_four.a.string, ' team_tmp ': team_tmp}) forum_note_sum.append (Forum_note_record) Except:return noneif __name__== ' __main__ ': # all_spurs_note Begin_time=time.time () print ('---------script The line time is: {}------------'. Format (time.strftime ('%y-%m-%d%h:%m:%s ', Time.localtime ()))) Team_list = [' rockets ', ' warriors ', ' Cavaliers ', ' Spurs ', ' Lakers ', ' Celtics ', ' Thunder ', ' Clippers ', ' Timberwolves ', ' Mavericks ', ' Knicks ' , ' bulls ', ' nets ', ' Sixers ', ' Jazz ', ' Pacers ', ' blazers ', ' heat ', ' Suns ', ' Grizzlies ', ' wizards ', ' Pelicans ', ' Bucks ', ' Kings ', ' Rapto ' Rs ', ' nuggets ', ' Hawks ', ' Hornets ', ' pistons ', ' magic '] for Li in Team_list:forum_note_sum_code =[] Theme_tmp=li for I in Range (1,11,1): #由于虎扑反爬, can only crawl to 10 pages, follow-up can be put into Linux timed execution url = ' Https://bbs.hup u.com/{}-{} '. Format (li,i) print (URL) wb_string = requests.get (URL) bs_obj = Beautifulsou                P (wb_string.content, ' Html.parser ') with open (' Web_spider_original.txt ', ' W ', encoding= ' UTF8 ') as F: F.write (str (bs_obj)) f.write (' \ R ' *10+ '-----I am split line-----' + ' \ R ' *10) for J in Range (1,61,1): #每个                Page data has 60 posts Parent_li_web (J) with open (' Hupu_spider_spurs_load.txt ', ' W ', encoding= ' UTF8 ') as F: For item in Forum_note_sum:json.dump (Item,f,ensure_ascii=false) F.WR Ite (' \ R ') #insert into MySQL conn=pymysql.connect (host= ' localhost ', user= ' root ', passwd= ' 1234 ', db= ' spider ', port=3306,charset= ' UTF8 ') Cur=conn.cursor () cur.execute (' Delete from hupuforum_spurs_note_daytmp ') with open (' Hupu_spider_spurs_load.txt ', ' R ' , encoding= ' UTF8 ') as F:for item in F:item=json.loads (item) #how Convert string to Dict # PR int (item)) Cur.execute (' INSERT INTO hupuforum_spurs_note_daytmp (Subname,subname_link,author,author_link,au Thor_create_time,read_reply_number,last_reply_writer,last_reply_time,theme_title) VALUES (%s,%s,%s,%s,%s,%s,%s,% S,%s ', (item[' subname '],item[' subname_link '],item[' author '],item[' Author_link ', '],item[', ' author_create_time '],    item[' read_reply_number '],item[' last_reply_writer '],item[' last_reply_time '],item[' team_tmp '])) Conn.commit () Cur.close () Conn.close () print (' finished! This execution consumes time: {} seconds '. Format (Time.time ()-begin_time))

　　Part2: Add content and update some fields

#Coding=utf8Import TimeImportRequests fromBs4ImportBeautifulSoupImportPymysqlImportSignalbegin_time=time.strftime ('%y-%m-%d%h:%m:%s', Time.localtime ()) Conn=pymysql.connect (host='localhost', port=3306,user='Root', passwd='1234', db='Spider', charset='UTF8') cur=conn.cursor () sub_cur=conn.cursor () Cur.execute ('INSERT into hupuforum_spurs_note select * from Hupuforum_spurs_note_daytmp WHERE subname_link not in (select A.subname _link from Hupuforum_spurs_note a);') Cur.execute ('Update hupuforum_spurs_note a,hupuforum_spurs_note_daytmp b set A.read_reply_number=b.read_reply_number,a.last_ Reply_writer=b.last_reply_writer,a.last_reply_time=b.last_reply_time where A.subname_link=b.subname_link')#Conn.commit ()Cur.execute ('Use spider;') Conn.commit () Cur.execute ('Select Subname_link from hupuforum_spurs_note where sub_text is null;') forUrlinchcur.fetchall (): URL=list (URL)#print (URL)    Try: Wb_page= Requests.get (url[0],timeout=2)#actual execution, there is a state of suspended animation, set timeoutBs_obj = BeautifulSoup (Wb_page.content,'Html.parser') Tmp_text= Bs_obj.select ('#tpc > div > Div.floor_box > Table.case > Tbody > TR > td > Div.quote-content') Sub_text=tmp_text[0].get_text (strip=True) Sub_text=sub_text.replace ('\ '',''') SQL="""Update hupuforum_spurs_note set sub_text=\ ' {}\ ' where subname_link={};""". Format ((sub_text[:1000]), str (URL). replace ('[',"'). Replace (']',"'))        #print (SQL)sub_cur.execute (SQL) Conn.commit ()Print('Success')    exceptIndexerror as E:#This error means that the page does not exist.Sql="""Update hupuforum_spurs_note set sub_text=\ ' {}\ ' where subname_link={};""". Format ('page does not exist', str (URL). replace ('[',"'). Replace (']',"') ) sub_cur.execute (SQL) Conn.commit ()exceptPymysql.err.InternalError as E:#The description contains Emoj and other UTF8 four character contentSql="""Update hupuforum_spurs_note set sub_text=\ ' {}\ ' where subname_link={};""". Format ('incorrect content format, resulting in an error! ', str (URL). replace ('[',"'). Replace (']',"') ) sub_cur.execute (SQL) Conn.commit ()exceptRequests.exceptions.ReadTimeout as E:#Web page response timed outSql="""Update hupuforum_spurs_note set sub_text=\ ' {}\ ' where subname_link={};""". Format ('Web page Open Timeout', str (URL). replace ('[',"'). Replace (']',"') ) sub_cur.execute (SQL) Conn.commit ()Else: SQL="""Update hupuforum_spurs_note set sub_text=\ ' {}\ ' where subname_link={};""". Format ('other types of errors', str (URL). replace ('[',"'). Replace (']',"')) Sub_cur.execute (SQL) Conn.commit () Conn.commit () Cur.close () Sub_cur.close () conn.close () End_time=time.strftime ('%y-%m-%d%h:%m:%s', Time.localtime ())Print('finished, the task start time is: {}, the end time is: {}'. Format (Begin_time,end_time))

PART3: crawling registered user information

#Coding=utf8Import TimeImportRequests fromBs4ImportBeautifulSoupImportPymysqlbegin_time=time.strftime ('%y-%m-%d%h:%m:%s', Time.localtime ()) Conn=pymysql.connect (host='localhost', port=3306,user='Root', passwd='1234', db='Spider', charset='UTF8') cur=conn.cursor () sub_cur=conn.cursor () Cur.execute ('SELECT distinct author_link from Hupuforum_spurs_note;') forAuthor_urlinchCur.fetchall ():Try: Author_url=list (author_url)Wb_obj=requests.get (author_url[0],timeout=2) Bs_obj=beautifulsoup (Wb_obj.content,'Html.parser') Author=bs_obj.select ('#main > Div.personal > Div.personal_right > H3 > Div') [0].string author_visited=bs_obj.select ('#main > Div.personal > Div.personal_right > H3 > Span') [0].string.replace ('have a',"'). Replace ('visits by visitors',"') Author_info=bs_obj.select ('#main > Div.personal > Div.personal_right > Div') [0].get_text (strip=True) Sub_cur.execute ('INSERT INTO Hupuforum_authors_info (author,author_link,author_visited,author_info,author_status) VALUES (%s,%s, %s,%s,%s)', (Author,author_url[0],author_visited,author_info,'Normal'))    exceptIndexerror as E:sub_cur.execute ('INSERT INTO Hupuforum_authors_info (author,author_link,author_visited,author_info,author_status) VALUES (%s,%s, %s,%s,%s)', (author, author_url[0],"',"','Cannot access')) Conn.commit () Conn.commit () Cur.close () conn.close () End_time=time.strftime ('%y-%m-%d%h:%m:%s', Time.localtime ())Print('finished, the task start time is: {}, the end time is: {}'. Format (Begin_time,end_time))

python-Tiger flapping crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

python-Tiger flapping crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

python-Tiger flapping crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support