Link is an important factor in SEO. In order to get a better ranking in search engines, be sure to check regularly if the links in the site are still valid. In particular, some great changes can cause bad links to occur. To detect link problems within these stations, there are some online tools available. such as Google analytics,bing Webmaster tools,brokenlinkcheck.com and so on. In spite of the tools available, we can write one ourselves. Using Python can be very easy.
Reference original: How to Check broken Links with 404 Error in Python
Xiao Ling
Translation: Yushulx
How to check site 404 errors
In order to make the website better by the search engine crawl, the general website will have a sitemap.xml. So the basic steps are:
Read Sitemap.xml, get all the links inside the station.
All links are read from each link and may contain inbound link or outbound link.
Check the status of all links.
Software Installation
It is very convenient to use the BeautifulSoup library to analyze page elements:
Pip Install Beautifulsoup4
How to crawl a Web page using Python
Because the program can run for a long time, you need to inject keyboard events if you want to interrupt at any time:
def Ctrl_c (Signum, frame): Global shutdown_event Shutdown_event.set () Raise Systemexit (' \ncancelling ... ') global Shutdown_eventshutdown_event = Threading. Event () signal.signal (signal. SIGINT, Ctrl_c)
Use BeautifulSoup to analyze Sitemap.xml:
pages = []try:request = Build_request ("Http://kb.dynamsoft.com/sitemap.xml") F = urlopen (request, timeout=3) XML = F.read () soup = BeautifulSoup (XML) Urltags = Soup.find_all ("url") print "the number of URL tags in sitemap:" , str (len (urltags)) for sitemaps in Urltags:link = Sitemap.findnext ("loc"). Text pages.append (link) f . Close () except Httperror, Urlerror:print urlerror.code return pages
Parse HTML elements for all links:
Def querylinks (Self, result): links = [] content = '. Join (Result) soup = beautifulsoup (content) elements = soup.select (' a ') for element in Elements: if shutdown_event.isset (): return GAME_OVER try: link = element.get (' href ') if Link.startswith (' http '): links.append (link) except: print ' Href error!!! ' continue Return links def readhref (Self, url): result = [] try: request = build_request (URL) f = urlopen (request, timeout=3) while 1 and not shutdown_event.isset (): tmp = f.read (10240) if len (TMP) == 0: break else: result.append (TMP) f.close () except HTTPError, URLError: print urlerror.code if shutdown_ Event.isset (): return game_over return self.querylinks (Result)
Check link's response return value:
Def crawllinks (self, links, file=none): for link in links : if shutdown_event.isset (): return GAME_OVER status_code = 0 try: request = build_request (link) f = urlopen (Request) status_code = f.code f.close () except HTTPError, URLError: status_code = urlerror.code if status_code == 404: if file != None: file.write (link + ' \ n ') print str (status_code), ': ', Link return game_over
Source
https://github.com/yushulx/crawl-404
Crawl 404 errors in the entire station with Python