Crawl 404 errors in the entire station with Python

Source: Internet
Author: User

Link is an important factor in SEO. In order to get a better ranking in search engines, be sure to check regularly if the links in the site are still valid. In particular, some great changes can cause bad links to occur. To detect link problems within these stations, there are some online tools available. such as Google analytics,bing Webmaster tools,brokenlinkcheck.com and so on. In spite of the tools available, we can write one ourselves. Using Python can be very easy.

Reference original: How to Check broken Links with 404 Error in Python

Xiao Ling

Translation: Yushulx

How to check site 404 errors

In order to make the website better by the search engine crawl, the general website will have a sitemap.xml. So the basic steps are:

    1. Read Sitemap.xml, get all the links inside the station.

    2. All links are read from each link and may contain inbound link or outbound link.

    3. Check the status of all links.

Software Installation

It is very convenient to use the BeautifulSoup library to analyze page elements:

Pip Install Beautifulsoup4
How to crawl a Web page using Python

Because the program can run for a long time, you need to inject keyboard events if you want to interrupt at any time:

def Ctrl_c (Signum, frame): Global shutdown_event Shutdown_event.set () Raise Systemexit (' \ncancelling ... ') global Shutdown_eventshutdown_event = Threading. Event () signal.signal (signal. SIGINT, Ctrl_c)

Use BeautifulSoup to analyze Sitemap.xml:

pages = []try:request = Build_request ("Http://kb.dynamsoft.com/sitemap.xml") F = urlopen (request, timeout=3) XML = F.read () soup = BeautifulSoup (XML) Urltags = Soup.find_all ("url") print "the number of URL tags in sitemap:" , str (len (urltags)) for sitemaps in Urltags:link = Sitemap.findnext ("loc"). Text pages.append (link) f . Close () except Httperror, Urlerror:print urlerror.code return pages

Parse HTML elements for all links:

Def querylinks (Self, result):    links = []     content =  '. Join (Result)     soup = beautifulsoup (content)      elements = soup.select (' a ')      for element in  Elements:        if shutdown_event.isset ():             return GAME_OVER          try:            link  = element.get (' href ')             if  Link.startswith (' http '):                 links.append (link)         except:             print  ' Href error!!! '             continue      Return links def readhref (Self, url):    result = []     try:        request = build_request (URL)         f = urlopen (request, timeout=3)          while 1 and not shutdown_event.isset ():             tmp = f.read (10240)              if len (TMP)  == 0:                 break             else:                 result.append (TMP)           f.close ()     except HTTPError, URLError:         print urlerror.code     if shutdown_ Event.isset ():        return game_over      return self.querylinks (Result)

Check link's response return value:

Def crawllinks (self, links, file=none):     for link in links :         if shutdown_event.isset ():             return GAME_OVER          status_code = 0         try:             request = build_request (link)              f = urlopen (Request)              status_code = f.code             f.close ()          except HTTPError, URLError:             status_code = urlerror.code         if status_code == 404:             if file != None:                 file.write (link +   ' \ n ')          print str (status_code),  ': ',  Link     return game_over
Source

https://github.com/yushulx/crawl-404


Crawl 404 errors in the entire station with Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.