Friend said he stood hanging, want to know how many pages are included in the dead chain, so I thought about the process, from the site to obtain the amount of course is not accurate, but there is no better, the real inclusion only search engine database inside only ...
Query the status code of the included page, process: Get ingest URL > parse real URL > get status code
But the implementation is slow, do not know whether it is beautifulsoup or location to get the real URL address this step is slow.
#Coding:utf-8Importurllib2,re,requests fromBs4ImportBeautifulSoup as Bsdomain='www.123.com' #the domain name to queryPage_num = 10 * 10#the first digit is the number of pages to crawldefgethtml (URL): Headers= { 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', #' accept-encoding ': ' gzip, deflate, SDCH ', 'Accept-language':'zh-cn,zh;q=0.8', 'Cache-control':'max-age=0', 'Connection':'keep-alive', 'Cookies':'bduss= Ng4ufvyuupwu2hur2r3b3hkamtpae9ocw40ltfzcgdwedbjbxkzde83edjqse5yqvfbqufbjcqaaaaaaaaaaaeaaadd3iysamfjaze1nduaaaaaaaaaaaaaaa Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahavs1d2r0txa; ispeed_lsm=2; pstm=1465195705; bidupsid=2274339847bbf9b1e97da3ece6469761; H_wise_sids=102907_106764_106364_101556_100121_102478_102628_106368_103569_106502_106349_106665_106589_104341_ 106323_104000_104613_104638_106071_106599_106795; baiduid=d94a8de66cf701ab5c3332b1bf883ddc:fg=1; BDSFRCVID=UEUSJEC62M80HJJROXZDHBOABEKAL6VTH6AIA6LTLB9ZX-72YRF7EG0PFOLQPYD-D1GYOGKK3GOTH4JP; H_bdclckid_sf=fr-foipbtkvsq5rvkboehpcx-fvqh4jxhd7ywcvg3455or5jj65ve58jm46n2bve3ibawbjp5lvh8kqc3ma--ff_ JXVN2PD8YJ-L_KOXLQLBSQ0X0-JCHH_QWT8LKTOXMCOMAHKB5H7XOKBF056JK4JKJH0QT5CP; signin_uc=70a2711cf1d3d9b1a82d2f87d633bd8a02157232777; bd_home=1; bd_upn=12314353; sug=3; sugstore=1; origin=0; bdime=0; bdrcvfr[fewj1vr5u3d]=i67x6tjhwwyf0; H_PS_645EC=A5CFUIPPKBO0UQPU%2F4QBUFVCQXU4W9G5GR5YRXTNJT10%2FELVEVJBBEYJWJQ8QUHGEPJD; Bd_ck_sam= 1; bdsvrtm=323; h_ps_pssid=1434_20317_12896_20076_19860_17001_15506_11866; __bsi=16130066511508055252_00_0_i_r_326_0303_c02f_n_i_i_0', #' Host ': ' www.baidu.com ', 'upgrade-insecure-requests':'1', 'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.112 safari/537.36',} req= Urllib2. Request (url=url,headers=headers) HTML= Urllib2.urlopen (Req,timeout = 30). Read ()returnHTMLdefStatus (URL):#Return Status CodeStatus =requests.get (URL). Status_codereturnStatusstatus_file= Open ('Url_status.txt','A +') forIinchRange (10,page_num,10): URL='Https://www.baidu.com/s?wd=site%3A'+ domain +'&pn='+STR (i) HTML=gethtml (URL) soup= BS (HTML,"lxml") forIinchSoup.select ('. C-showurl'): #print i.get (' href ')URLs = I.get ('href') #url_list.append (URLs)Header =Requests.head (URLs). Headers Header_url= header[' Location']#get the real URL ifInt (status (header_url)) = = 404: PrintStatus (Header_url), Header_url#Print status codes and real URLsStatus_file.write (str (Status (header_url)) +' '+ Header_url +'\ n')#Get status codes and links to write filesstatus_file.close ()#Get the status code function
Code Snippets for reference
#Coding:utf-8ImportSYSImportUrllibImportUrllib2 fromBeautifulSoupImportBeautifulSoup Question_word="Foodie Programmer"URL="http://www.baidu.com/s?wd="+ Urllib.quote (Question_word.decode (sys.stdin.encoding). Encode ('GBK')) HtmlPage=urllib2.urlopen (URL). Read () Soup=BeautifulSoup (htmlpage)PrintLen (Soup.findall ("Table", {"class":"result"})) forResult_tableinchSoup.findall ("Table", {"class":"result"}): A_click= Result_table.find ("a") Print "-----title----\ n"+ a_click.rendercontents ()#title Print "----link----\ n"+ STR (a_click.get ("href"))#links Print "----description----\ n"+ Result_table.find ("Div", {"class":"c-abstract"}). RenderContents ()#Description Print
Search for dead links in included pages by SEO