Python Learning Lesson Two: Using BeautifulSoup to crawl linked regular expressions
- View the BeautifulSoup document (view the corresponding document according to the version of your installation)
Document Link https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#
#!/usr/bin/env python# -*- coding:utf-8 -*-import io import sysfrom urllib import requestfrom bs4 import BeautifulSoupimport resys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=‘utf8‘) #改变标准输出的默认编码 resp = request.urlopen("http://news.baidu.com/").read().decode("utf-8")soup =BeautifulSoup(resp,"html.parser")listUrls=soup.find_all("a",href=re.compile(".*\/\/news\.baidu.*"))for url in listUrls:print (url["href"])
Final effect:
http://news.baidu.com/view.htmlhttp://news.baidu.com/advanced_news.htmlhttp://news.baidu.com/pianhao.htmlhttp://news.baidu.com/n?bypass=lamp&m=pagesother&v=newsgxhttp://news.baidu.com/n?cmd=6&loc=0&name=%B1%B1%BE%A9http://news.baidu.com/history.htmlhttp://news.baidu.com/newscode.htmlhttp://news.baidu.com/licence.html
Python Learning Lesson Two: Using BeautifulSoup to crawl linked regular expressions