Search for dead links in included pages by SEO

Source: Internet
Author: User

Friend said he stood hanging, want to know how many pages are included in the dead chain, so I thought about the process, from the site to obtain the amount of course is not accurate, but there is no better, the real inclusion only search engine database inside only ...

Query the status code of the included page, process: Get ingest URL > parse real URL > get status code

But the implementation is slow, do not know whether it is beautifulsoup or location to get the real URL address this step is slow.

#Coding:utf-8Importurllib2,re,requests fromBs4ImportBeautifulSoup as Bsdomain='www.123.com'    #the domain name to queryPage_num = 10 * 10#the first digit is the number of pages to crawldefgethtml (URL): Headers= {        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',        #' accept-encoding ': ' gzip, deflate, SDCH ',        'Accept-language':'zh-cn,zh;q=0.8',        'Cache-control':'max-age=0',        'Connection':'keep-alive',        'Cookies':'bduss= Ng4ufvyuupwu2hur2r3b3hkamtpae9ocw40ltfzcgdwedbjbxkzde83edjqse5yqvfbqufbjcqaaaaaaaaaaaeaaadd3iysamfjaze1nduaaaaaaaaaaaaaaa Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahavs1d2r0txa; ispeed_lsm=2; pstm=1465195705; bidupsid=2274339847bbf9b1e97da3ece6469761; H_wise_sids=102907_106764_106364_101556_100121_102478_102628_106368_103569_106502_106349_106665_106589_104341_ 106323_104000_104613_104638_106071_106599_106795; baiduid=d94a8de66cf701ab5c3332b1bf883ddc:fg=1; BDSFRCVID=UEUSJEC62M80HJJROXZDHBOABEKAL6VTH6AIA6LTLB9ZX-72YRF7EG0PFOLQPYD-D1GYOGKK3GOTH4JP; H_bdclckid_sf=fr-foipbtkvsq5rvkboehpcx-fvqh4jxhd7ywcvg3455or5jj65ve58jm46n2bve3ibawbjp5lvh8kqc3ma--ff_ JXVN2PD8YJ-L_KOXLQLBSQ0X0-JCHH_QWT8LKTOXMCOMAHKB5H7XOKBF056JK4JKJH0QT5CP; signin_uc=70a2711cf1d3d9b1a82d2f87d633bd8a02157232777; bd_home=1; bd_upn=12314353; sug=3; sugstore=1; origin=0; bdime=0; bdrcvfr[fewj1vr5u3d]=i67x6tjhwwyf0; H_PS_645EC=A5CFUIPPKBO0UQPU%2F4QBUFVCQXU4W9G5GR5YRXTNJT10%2FELVEVJBBEYJWJQ8QUHGEPJD; Bd_ck_sam= 1; bdsvrtm=323; h_ps_pssid=1434_20317_12896_20076_19860_17001_15506_11866; __bsi=16130066511508055252_00_0_i_r_326_0303_c02f_n_i_i_0',        #' Host ': ' www.baidu.com ',        'upgrade-insecure-requests':'1',        'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.112 safari/537.36',} req= Urllib2. Request (url=url,headers=headers) HTML= Urllib2.urlopen (Req,timeout = 30). Read ()returnHTMLdefStatus (URL):#Return Status CodeStatus =requests.get (URL). Status_codereturnStatusstatus_file= Open ('Url_status.txt','A +') forIinchRange (10,page_num,10): URL='Https://www.baidu.com/s?wd=site%3A'+ domain +'&pn='+STR (i) HTML=gethtml (URL) soup= BS (HTML,"lxml")     forIinchSoup.select ('. C-showurl'):        #print i.get (' href ')URLs = I.get ('href')        #url_list.append (URLs)Header =Requests.head (URLs). Headers Header_url= header[' Location']#get the real URL        ifInt (status (header_url)) = = 404:            PrintStatus (Header_url), Header_url#Print status codes and real URLsStatus_file.write (str (Status (header_url)) +' '+ Header_url +'\ n')#Get status codes and links to write filesstatus_file.close ()#Get the status code function

Code Snippets for reference

#Coding:utf-8ImportSYSImportUrllibImportUrllib2 fromBeautifulSoupImportBeautifulSoup Question_word="Foodie Programmer"URL="http://www.baidu.com/s?wd="+ Urllib.quote (Question_word.decode (sys.stdin.encoding). Encode ('GBK')) HtmlPage=urllib2.urlopen (URL). Read () Soup=BeautifulSoup (htmlpage)PrintLen (Soup.findall ("Table", {"class":"result"})) forResult_tableinchSoup.findall ("Table", {"class":"result"}): A_click= Result_table.find ("a")    Print "-----title----\ n"+ a_click.rendercontents ()#title    Print "----link----\ n"+ STR (a_click.get ("href"))#links    Print "----description----\ n"+ Result_table.find ("Div", {"class":"c-abstract"}). RenderContents ()#Description    Print

Search for dead links in included pages by SEO

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.