Crawl 404 errors in the entire station with Python

Last Update:2015-08-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Link is an important factor in SEO. In order to get a better ranking in search engines, be sure to check regularly if the links in the site are still valid. In particular, some great changes can cause bad links to occur. To detect link problems within these stations, there are some online tools available. such as Google analytics,bing Webmaster tools,brokenlinkcheck.com and so on. In spite of the tools available, we can write one ourselves. Using Python can be very easy.

Reference original: How to Check broken Links with 404 Error in Python

Xiao Ling

Translation: Yushulx

How to check site 404 errors

In order to make the website better by the search engine crawl, the general website will have a sitemap.xml. So the basic steps are:

Read Sitemap.xml, get all the links inside the station.
All links are read from each link and may contain inbound link or outbound link.
Check the status of all links.

Software Installation

It is very convenient to use the BeautifulSoup library to analyze page elements:

Pip Install Beautifulsoup4

How to crawl a Web page using Python

Because the program can run for a long time, you need to inject keyboard events if you want to interrupt at any time:

def Ctrl_c (Signum, frame): Global shutdown_event Shutdown_event.set () Raise Systemexit (' \ncancelling ... ') global Shutdown_eventshutdown_event = Threading. Event () signal.signal (signal. SIGINT, Ctrl_c)

Use BeautifulSoup to analyze Sitemap.xml:

pages = []try:request = Build_request ("Http://kb.dynamsoft.com/sitemap.xml") F = urlopen (request, timeout=3) XML = F.read () soup = BeautifulSoup (XML) Urltags = Soup.find_all ("url") print "the number of URL tags in sitemap:" , str (len (urltags)) for sitemaps in Urltags:link = Sitemap.findnext ("loc"). Text pages.append (link) f . Close () except Httperror, Urlerror:print urlerror.code return pages

Parse HTML elements for all links:

Def querylinks (Self, result):    links = []     content =  '. Join (Result)     soup = beautifulsoup (content)      elements = soup.select (' a ')      for element in  Elements:        if shutdown_event.isset ():             return GAME_OVER          try:            link  = element.get (' href ')             if  Link.startswith (' http '):                 links.append (link)         except:             print  ' Href error!!! '             continue      Return links def readhref (Self, url):    result = []     try:        request = build_request (URL)         f = urlopen (request, timeout=3)          while 1 and not shutdown_event.isset ():             tmp = f.read (10240)              if len (TMP)  == 0:                 break             else:                 result.append (TMP)           f.close ()     except HTTPError, URLError:         print urlerror.code     if shutdown_ Event.isset ():        return game_over      return self.querylinks (Result)

Check link's response return value:

Def crawllinks (self, links, file=none):     for link in links :         if shutdown_event.isset ():             return GAME_OVER          status_code = 0         try:             request = build_request (link)              f = urlopen (Request)              status_code = f.code             f.close ()          except HTTPError, URLError:             status_code = urlerror.code         if status_code == 404:             if file != None:                 file.write (link +   ' \ n ')          print str (status_code),  ': ',  Link     return game_over

Source

https://github.com/yushulx/crawl-404

Crawl 404 errors in the entire station with Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawl 404 errors in the entire station with Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Crawl 404 errors in the entire station with Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support