Python3 Study (2): cannot use a string pattern on a Bytes-like object problem resolution on a Web site map crawler

Last Update:2018-06-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

python3.6.5 + pycharm

Attention:

First, the Python3 in the URLLIB2 has not, changed to urllbi.request, therefore, directly imported import urllib.request can.

Second, the reference variable in the regular expression must be formatted and transformed. Decode (' utf-8 '), otherwise the error will be that you cannot use string formatting on an object of a byte class.

As shown in the following code.

### first, the site map crawler, control the user agent settings, can catch exceptions, retry downloading and set up the user agent. wswp:web scraping with PythonImport Urllib.request# #--Written by Lisongbodef rocky_dnload (url,user_agent= ' wswp ', num_retries = 2): print (' Downloading: ', url) lisongbo_he={' user-agent ': user_a Gent} request = Urllib.request.Request (URL, headers=lisongbo_he) try:# #--Written by Lisongbohtml = urllib.request.urlopen (Request). read () except Urllib.request.URLError as E:# #--Written by LisongboPrint (' Download error: ', E.reason) HTML = None if num_retries > 0:# #--Written by LisongboIf Hasattr (E, ' Code ') and <= E.code < 600:return rocky_dnload (url,user_agent,num_retries-1)# # Retry 5xx HTTP ErrorsReturn Htmlimport RE# #--Written by Lisongbodef rocky_crawl_sitemap (URL):# #--Written by LisongboSitemap = rocky_dnload (URL)# # Download the Sitmap file# sitemap = Sitemap.decode (' Utf-8 ') # # must add this.Links = Re.findall (' <loc> (. *?) </loc> ', sitemap)# # Extract the Sitemap links from Flag locFor link in Links:# # Download each linkhtml = rocky_dnload (link)# # Crape HTML hereRocky_crawl_sitemap (' Http://example.webscraping.com/sitemap.xml ')

Operation Result Error:

Downloading:http://example.webscraping.com/sitemap.xml
Traceback (most recent):
File "c:/users/klooa/my_env/book9/test.py", line, in <module>
rocky_crawl_sitemap (' Http://example.webscraping.com/sitemap.xml ')
File "c:/users/klooa/my_env/book9/test.py", line A, in Rocky_crawl_sitemap
links = re.findall (' <loc> (. *?) </loc> ', sitemap # # Extract the Sitemap links from Flag loc
File "C:\Users\klooa\AppData\Local\Programs\Python\Python36\lib\re.py", line 222, in FindAll
return _compile (pattern, flags). FindAll (String)
Typeerror:cannot Use a string pattern on a Bytes-like object

The next line of the Sitemap must be added

Sitemap = Sitemap.decode (' Utf-8 ')

The result of the modified operation is:

Downloading:http://example.webscraping.com/sitemap.xml
downloading:http://example.webscraping.com/places/default/view/afghanistan-1
downloading:http://example.webscraping.com/places/default/view/aland-islands-2
downloading:http://example.webscraping.com/places/default/view/albania-3
downloading:http://example.webscraping.com/places/default/view/algeria-4
downloading:http://example.webscraping.com/places/default/view/american-samoa-5
downloading:http://example.webscraping.com/places/default/view/andorra-6
downloading:http://example.webscraping.com/places/default/view/angola-7
downloading:http://example.webscraping.com/places/default/view/anguilla-8
downloading:http://example.webscraping.com/places/default/view/antarctica-9
downloading:http://example.webscraping.com/places/default/view/antigua-and-barbuda-10
downloading:http://example.webscraping.com/places/default/view/argentina-11
downloading:http://example.webscraping.com/places/default/view/armenia-12
Download Error:too Many requests
downloading:http://example.webscraping.com/places/default/view/aruba-13

Download Error:too Many requests

......

# #--Written by Lisongbo

Python3 Study (2): cannot use a string pattern on a Bytes-like object problem resolution on a Web site map crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python3 Study (2): cannot use a string pattern on a Bytes-like object problem resolution on a Web site map crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python3 Study (2): cannot use a string pattern on a Bytes-like object problem resolution on a Web site map crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support