Python3 Study (2): cannot use a string pattern on a Bytes-like object problem resolution on a Web site map crawler

Source: Internet
Author: User

python3.6.5 + pycharm

Attention:

First, the Python3 in the URLLIB2 has not, changed to urllbi.request, therefore, directly imported import urllib.request can.

Second, the reference variable in the regular expression must be formatted and transformed. Decode (' utf-8 '), otherwise the error will be that you cannot use string formatting on an object of a byte class.

As shown in the following code.

### first, the site map crawler, control the user agent settings, can catch exceptions, retry downloading and set up the user agent. wswp:web scraping with PythonImport Urllib.request# #--Written by Lisongbodef rocky_dnload (url,user_agent= ' wswp ', num_retries = 2): print (' Downloading: ', url) lisongbo_he={' user-agent ': user_a Gent} request = Urllib.request.Request (URL, headers=lisongbo_he) try:# #--Written by Lisongbohtml = urllib.request.urlopen (Request). read () except Urllib.request.URLError as E:# #--Written by LisongboPrint (' Download error: ', E.reason) HTML = None if num_retries > 0:# #--Written by LisongboIf Hasattr (E, ' Code ') and <= E.code < 600:return rocky_dnload (url,user_agent,num_retries-1)# # Retry 5xx HTTP ErrorsReturn Htmlimport RE# #--Written by Lisongbodef rocky_crawl_sitemap (URL):# #--Written by LisongboSitemap = rocky_dnload (URL)# # Download the Sitmap file# sitemap = Sitemap.decode (' Utf-8 ') # # must add this.Links = Re.findall (' <loc> (. *?) </loc> ', sitemap)# # Extract the Sitemap links from Flag locFor link in Links:# # Download each linkhtml = rocky_dnload (link)# # Crape HTML hereRocky_crawl_sitemap (' Http://example.webscraping.com/sitemap.xml ')
Operation Result Error:

Downloading:http://example.webscraping.com/sitemap.xml
Traceback (most recent):
File "c:/users/klooa/my_env/book9/test.py", line, in <module>
rocky_crawl_sitemap (' Http://example.webscraping.com/sitemap.xml ')
File "c:/users/klooa/my_env/book9/test.py", line A, in Rocky_crawl_sitemap
links = re.findall (' <loc> (. *?) </loc> ', sitemap # # Extract the Sitemap links from Flag loc
File "C:\Users\klooa\AppData\Local\Programs\Python\Python36\lib\re.py", line 222, in FindAll
return _compile (pattern, flags). FindAll (String)
Typeerror:cannot Use a string pattern on a Bytes-like object

The next line of the Sitemap must be added

Sitemap = Sitemap.decode (' Utf-8 ')

The result of the modified operation is:

Downloading:http://example.webscraping.com/sitemap.xml
downloading:http://example.webscraping.com/places/default/view/afghanistan-1
downloading:http://example.webscraping.com/places/default/view/aland-islands-2
downloading:http://example.webscraping.com/places/default/view/albania-3
downloading:http://example.webscraping.com/places/default/view/algeria-4
downloading:http://example.webscraping.com/places/default/view/american-samoa-5
downloading:http://example.webscraping.com/places/default/view/andorra-6
downloading:http://example.webscraping.com/places/default/view/angola-7
downloading:http://example.webscraping.com/places/default/view/anguilla-8
downloading:http://example.webscraping.com/places/default/view/antarctica-9
downloading:http://example.webscraping.com/places/default/view/antigua-and-barbuda-10
downloading:http://example.webscraping.com/places/default/view/argentina-11
downloading:http://example.webscraping.com/places/default/view/armenia-12
Download Error:too Many requests
downloading:http://example.webscraping.com/places/default/view/aruba-13

Download Error:too Many requests

......

# #--Written by Lisongbo

Python3 Study (2): cannot use a string pattern on a Bytes-like object problem resolution on a Web site map crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.