python3.6.5 + pycharm
Attention:
First, the Python3 in the URLLIB2 has not, changed to urllbi.request, therefore, directly imported import urllib.request can.
Second, the reference variable in the regular expression must be formatted and transformed. Decode (' utf-8 '), otherwise the error will be that you cannot use string formatting on an object of a byte class.
As shown in the following code.
### first, the site map crawler, control the user agent settings, can catch exceptions, retry downloading and set up the user agent. wswp:web scraping with PythonImport Urllib.request# #--Written by Lisongbodef rocky_dnload (url,user_agent= ' wswp ', num_retries = 2): print (' Downloading: ', url) lisongbo_he={' user-agent ': user_a Gent} request = Urllib.request.Request (URL, headers=lisongbo_he) try:# #--Written by Lisongbohtml = urllib.request.urlopen (Request). read () except Urllib.request.URLError as E:# #--Written by LisongboPrint (' Download error: ', E.reason) HTML = None if num_retries > 0:# #--Written by LisongboIf Hasattr (E, ' Code ') and <= E.code < 600:return rocky_dnload (url,user_agent,num_retries-1)# # Retry 5xx HTTP ErrorsReturn Htmlimport RE# #--Written by Lisongbodef rocky_crawl_sitemap (URL):# #--Written by LisongboSitemap = rocky_dnload (URL)# # Download the Sitmap file# sitemap = Sitemap.decode (' Utf-8 ') # # must add this.Links = Re.findall (' <loc> (. *?) </loc> ', sitemap)# # Extract the Sitemap links from Flag locFor link in Links:# # Download each linkhtml = rocky_dnload (link)# # Crape HTML hereRocky_crawl_sitemap (' Http://example.webscraping.com/sitemap.xml ')
Operation Result Error:
Downloading:http://example.webscraping.com/sitemap.xml
Traceback (most recent):
File "c:/users/klooa/my_env/book9/test.py", line, in <module>
rocky_crawl_sitemap (' Http://example.webscraping.com/sitemap.xml ')
File "c:/users/klooa/my_env/book9/test.py", line A, in Rocky_crawl_sitemap
links = re.findall (' <loc> (. *?) </loc> ', sitemap # # Extract the Sitemap links from Flag loc
File "C:\Users\klooa\AppData\Local\Programs\Python\Python36\lib\re.py", line 222, in FindAll
return _compile (pattern, flags). FindAll (String)
Typeerror:cannot Use a string pattern on a Bytes-like object
The next line of the Sitemap must be added
Sitemap = Sitemap.decode (' Utf-8 ')
The result of the modified operation is:
Downloading:http://example.webscraping.com/sitemap.xml
downloading:http://example.webscraping.com/places/default/view/afghanistan-1
downloading:http://example.webscraping.com/places/default/view/aland-islands-2
downloading:http://example.webscraping.com/places/default/view/albania-3
downloading:http://example.webscraping.com/places/default/view/algeria-4
downloading:http://example.webscraping.com/places/default/view/american-samoa-5
downloading:http://example.webscraping.com/places/default/view/andorra-6
downloading:http://example.webscraping.com/places/default/view/angola-7
downloading:http://example.webscraping.com/places/default/view/anguilla-8
downloading:http://example.webscraping.com/places/default/view/antarctica-9
downloading:http://example.webscraping.com/places/default/view/antigua-and-barbuda-10
downloading:http://example.webscraping.com/places/default/view/argentina-11
downloading:http://example.webscraping.com/places/default/view/armenia-12
Download Error:too Many requests
downloading:http://example.webscraping.com/places/default/view/aruba-13
Download Error:too Many requests
......
# #--Written by Lisongbo
Python3 Study (2): cannot use a string pattern on a Bytes-like object problem resolution on a Web site map crawler