python-Tiger flapping crawler

Source: Internet
Author: User

Python, as a high-level programming language, is not known to be popular in circles since. Individual is also a fresh picture, keep up with the pace of the times to learn a bit. "Lu Xun" said: "Can not apply knowledge, is bullying." I used Python to make a reptile for the tiger Flutter Forum. Script to write a rough point, the right to use for beginners to communicate, but also convenient for later inspection. Originally was prepared to write a tiger flutter analysis of the post, but later the lack of power is not written. However, as a Spurs fan it is an honor for our organization to be three before the heat.

  Preparation : Install python, install MySQL, virtual machine "selective, post daily on the server to perform scheduled Tasks"

1. Install Python: Select 3.*, Process ignore

2, install MySQL: Select 5.6 version and above, process ignored

3, Virtual machine: Linux series, process ignored

Requirements Description

Crawl the Tiger forum post to learn about post content, author, Heat, and more.

Writing scripts

It is divided into three parts: Part1 through the analysis of the current link, extract the post author, read the information, Part2 get the content of the post itself, part3 to the post data extraction, for the post-analysis to provide ideas. The specific script is as follows. Note: Encode, encode, encode. Thank you!

Note: Due to the anti-crawler of tiger flutter causes the number of pages that can be broken down to 10 (Breakthrough defense failure, thank you!) In this case, my approach is to accumulate the script into a daily crawl in the server.

Part1: Crawl The name of the post, author, creation time, read/reply, author link, etc., and put it in the local MySQL database

#-*-Coding:utf-8-*-from bs4 import beautifulsoupimport requestsimport jsonimport timeimport pymysqlimport importlib,sy Simportlib.reload (SYS) forum_note_sum=[] #variavle: Save the content of tiezilist_d=[' original ', ' translate ', ' talk '] #内容判断条件, If the post header content is for this, take another value type = sys.getfilesystemencoding () #num: The record number of one page;get Tiezi of author and Othersdef par Ent_li_web (num): Forum_note_record = {} try:parent_tiezi=bs_obj.find (' ul ', class_= ' for-list '). Find_all (' Li ') [ num] Div_one = parent_tiezi.find (' div ', class_= ' titlelink box ') Div_two = Parent_tiezi.find (' div ', class_= ' a Uthor box ') Span_three = Parent_tiezi.find (' span ', class_= ' ansour box '). String.strip () Div_four = Parent_tie Zi.find (' div ', class_= ' endreply box ') subname=div_one.a.string sublink= ' https://bbs.hupu.com ' +div_one.a[' HRE F '] team_tmp=theme_tmp for i in List_d:if i==subname:subname=div_one.find_all (' A ') [1].string sublink= 'Https://bbs.hupu.com ' +div_one.find_all (' a ') [1][' href '] # print (i,subname,sublink) Forum_note_record            . Update ({' SubName ': subname, ' subname_link ': Sublink, ' author ':d iv_two.a.string, ' Author_link ':d iv_two.a[' href ', ' author_create_time ':d iv_two.find (' A ', style= ' color: #808080; cursor:initial; '). String, ' Read_reply_number ': span_three, ' Last_reply_writer ':d iv_four.span.string, ' Las    T_reply_time ':d iv_four.a.string, ' team_tmp ': team_tmp}) forum_note_sum.append (Forum_note_record) Except:return noneif __name__== ' __main__ ': # all_spurs_note Begin_time=time.time () print ('---------script The line time is: {}------------'. Format (time.strftime ('%y-%m-%d%h:%m:%s ', Time.localtime ()))) Team_list = [' rockets ', ' warriors ', ' Cavaliers ', ' Spurs ', ' Lakers ', ' Celtics ', ' Thunder ', ' Clippers ', ' Timberwolves ', ' Mavericks ', ' Knicks ' , ' bulls ', ' nets ', ' Sixers ', ' Jazz ', ' Pacers ', ' blazers ', ' heat ', ' Suns ', ' Grizzlies ', ' wizards ', ' Pelicans ', ' Bucks ', ' Kings ', ' Rapto ' Rs ', ' nuggets ', ' Hawks ', ' Hornets ', ' pistons ', ' magic '] for Li in Team_list:forum_note_sum_code =[] Theme_tmp=li for I in Range (1,11,1): #由于虎扑反爬, can only crawl to 10 pages, follow-up can be put into Linux timed execution url = ' Https://bbs.hup u.com/{}-{} '. Format (li,i) print (URL) wb_string = requests.get (URL) bs_obj = Beautifulsou                P (wb_string.content, ' Html.parser ') with open (' Web_spider_original.txt ', ' W ', encoding= ' UTF8 ') as F: F.write (str (bs_obj)) f.write (' \ R ' *10+ '-----I am split line-----' + ' \ R ' *10) for J in Range (1,61,1): #每个                Page data has 60 posts Parent_li_web (J) with open (' Hupu_spider_spurs_load.txt ', ' W ', encoding= ' UTF8 ') as F: For item in Forum_note_sum:json.dump (Item,f,ensure_ascii=false) F.WR Ite (' \ R ') #insert into MySQL conn=pymysql.connect (host= ' localhost ', user= ' root ', passwd= ' 1234 ', db= ' spider ', port=3306,charset= ' UTF8 ') Cur=conn.cursor () cur.execute (' Delete from hupuforum_spurs_note_daytmp ') with open (' Hupu_spider_spurs_load.txt ', ' R ' , encoding= ' UTF8 ') as F:for item in F:item=json.loads (item) #how Convert string to Dict # PR int (item)) Cur.execute (' INSERT INTO hupuforum_spurs_note_daytmp (Subname,subname_link,author,author_link,au Thor_create_time,read_reply_number,last_reply_writer,last_reply_time,theme_title) VALUES (%s,%s,%s,%s,%s,%s,%s,% S,%s ', (item[' subname '],item[' subname_link '],item[' author '],item[' Author_link ', '],item[', ' author_create_time '],    item[' read_reply_number '],item[' last_reply_writer '],item[' last_reply_time '],item[' team_tmp '])) Conn.commit () Cur.close () Conn.close () print (' finished! This execution consumes time: {} seconds '. Format (Time.time ()-begin_time))

  Part2: Add content and update some fields

#Coding=utf8Import TimeImportRequests fromBs4ImportBeautifulSoupImportPymysqlImportSignalbegin_time=time.strftime ('%y-%m-%d%h:%m:%s', Time.localtime ()) Conn=pymysql.connect (host='localhost', port=3306,user='Root', passwd='1234', db='Spider', charset='UTF8') cur=conn.cursor () sub_cur=conn.cursor () Cur.execute ('INSERT into hupuforum_spurs_note select * from Hupuforum_spurs_note_daytmp WHERE subname_link not in (select A.subname _link from Hupuforum_spurs_note a);') Cur.execute ('Update hupuforum_spurs_note a,hupuforum_spurs_note_daytmp b set A.read_reply_number=b.read_reply_number,a.last_ Reply_writer=b.last_reply_writer,a.last_reply_time=b.last_reply_time where A.subname_link=b.subname_link')#Conn.commit ()Cur.execute ('Use spider;') Conn.commit () Cur.execute ('Select Subname_link from hupuforum_spurs_note where sub_text is null;') forUrlinchcur.fetchall (): URL=list (URL)#print (URL)    Try: Wb_page= Requests.get (url[0],timeout=2)#actual execution, there is a state of suspended animation, set timeoutBs_obj = BeautifulSoup (Wb_page.content,'Html.parser') Tmp_text= Bs_obj.select ('#tpc > div > Div.floor_box > Table.case > Tbody > TR > td > Div.quote-content') Sub_text=tmp_text[0].get_text (strip=True) Sub_text=sub_text.replace ('\ '',''') SQL="""Update hupuforum_spurs_note set sub_text=\ ' {}\ ' where subname_link={};""". Format ((sub_text[:1000]), str (URL). replace ('[',"'). Replace (']',"'))        #print (SQL)sub_cur.execute (SQL) Conn.commit ()Print('Success')    exceptIndexerror as E:#This error means that the page does not exist.Sql="""Update hupuforum_spurs_note set sub_text=\ ' {}\ ' where subname_link={};""". Format ('page does not exist', str (URL). replace ('[',"'). Replace (']',"') ) sub_cur.execute (SQL) Conn.commit ()exceptPymysql.err.InternalError as E:#The description contains Emoj and other UTF8 four character contentSql="""Update hupuforum_spurs_note set sub_text=\ ' {}\ ' where subname_link={};""". Format ('incorrect content format, resulting in an error! ', str (URL). replace ('[',"'). Replace (']',"') ) sub_cur.execute (SQL) Conn.commit ()exceptRequests.exceptions.ReadTimeout as E:#Web page response timed outSql="""Update hupuforum_spurs_note set sub_text=\ ' {}\ ' where subname_link={};""". Format ('Web page Open Timeout', str (URL). replace ('[',"'). Replace (']',"') ) sub_cur.execute (SQL) Conn.commit ()Else: SQL="""Update hupuforum_spurs_note set sub_text=\ ' {}\ ' where subname_link={};""". Format ('other types of errors', str (URL). replace ('[',"'). Replace (']',"')) Sub_cur.execute (SQL) Conn.commit () Conn.commit () Cur.close () Sub_cur.close () conn.close () End_time=time.strftime ('%y-%m-%d%h:%m:%s', Time.localtime ())Print('finished, the task start time is: {}, the end time is: {}'. Format (Begin_time,end_time))

PART3: crawling registered user information

#Coding=utf8Import TimeImportRequests fromBs4ImportBeautifulSoupImportPymysqlbegin_time=time.strftime ('%y-%m-%d%h:%m:%s', Time.localtime ()) Conn=pymysql.connect (host='localhost', port=3306,user='Root', passwd='1234', db='Spider', charset='UTF8') cur=conn.cursor () sub_cur=conn.cursor () Cur.execute ('SELECT distinct author_link from Hupuforum_spurs_note;') forAuthor_urlinchCur.fetchall ():Try: Author_url=list (author_url)Wb_obj=requests.get (author_url[0],timeout=2) Bs_obj=beautifulsoup (Wb_obj.content,'Html.parser') Author=bs_obj.select ('#main > Div.personal > Div.personal_right > H3 > Div') [0].string author_visited=bs_obj.select ('#main > Div.personal > Div.personal_right > H3 > Span') [0].string.replace ('have a',"'). Replace ('visits by visitors',"') Author_info=bs_obj.select ('#main > Div.personal > Div.personal_right > Div') [0].get_text (strip=True) Sub_cur.execute ('INSERT INTO Hupuforum_authors_info (author,author_link,author_visited,author_info,author_status) VALUES (%s,%s, %s,%s,%s)', (Author,author_url[0],author_visited,author_info,'Normal'))    exceptIndexerror as E:sub_cur.execute ('INSERT INTO Hupuforum_authors_info (author,author_link,author_visited,author_info,author_status) VALUES (%s,%s, %s,%s,%s)', (author, author_url[0],"',"','Cannot access')) Conn.commit () Conn.commit () Cur.close () conn.close () End_time=time.strftime ('%y-%m-%d%h:%m:%s', Time.localtime ())Print('finished, the task start time is: {}, the end time is: {}'. Format (Begin_time,end_time))

python-Tiger flapping crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.