Tool: python3.6 Pycharm
Library: BS4 + urllib
First step: Read HTML source code
from Import BeautifulSoup Import urllib.request# importing Urllib library 'https://www.p y t hontab.com/html/ pythonhexinbiancheng/index.html'# get page link request == Request.read () # read Web page source code
Step two: Get content and titles
Soup = BeautifulSoup (HTML,'Html.parser')#Parsing HTMLTitle_links = Soup.select ('#catlist > li > A')#Find titles and linksSource_list = []#Store A dictionary of headings and links forTitle_linkinchTitle_links:data= { 'title': Title_link.get_text (),'Link': Title_link.get ('href')} source_list.append (data)
Step three: Create a new Lesson folder under the current directory and store the files in this folder
forDicinchSource_list:#Traverse every dictionaryRequest = Urllib.request.urlopen (dic["Link"]) HTML=request.read () Soup= BeautifulSoup (HTML,'Html.parser') text_p= Soup.select ('#Article > Div.content > P')#get the data under the P tagText = []#Store article content forAinchText_p:text.append (A.get_text (). Encode ('Utf-8'))#Remove the text portion of the P tag, which is the content of the articleName = dic["title"] with open ('Lesson/%s.txt'% Name,'WB') as F:#write an article to a file forLineinchText:f.write (line)
Data crawling is complete.
Note: The above to complete a page crawl, if you want to crawl more pages, you can use the following code:
fromBs4ImportBeautifulSoupImportUrllib.request#Import Urllib LibraryUrl_list = ['https://www.p y t hontab.com/html/pythonhexinbiancheng/index.html']#get links to Web pages forIinchRange (2,20): URL='https://www.py tho ntab.com/html/pythonhexinbiancheng/%s.html'%i url_list.append (URL) forwr.inchurl_list:request=urllib.request.urlopen (URL) HTML= Request.read ()#read Web page source codeSoup = BeautifulSoup (HTML,'Html.parser')#Parsing HTMLTitle_links = Soup.select ('#catlist > li > A')#Find titles and linksSource_list = []#Store A dictionary of headings and links forTitle_linkinchTitle_links:data= { 'title': Title_link.get_text (),'Link': Title_link.get ('href')} source_list.append (data) forDicinchSource_list:#Traverse every dictionaryRequest = Urllib.request.urlopen (dic["Link"]) HTML=request.read () Soup= BeautifulSoup (HTML,'Html.parser') text_p= Soup.select ('#Article > Div.content > P')#get the data under the P tagText = []#Store article content forAinchText_p:text.append (A.get_text (). Encode ('Utf-8'))#Remove the text portion of the P tag, which is the content of the articleName = dic["title"] Directory='%s.txt'%name dir= Directory.replace ('/','_'). Replace ('*','@'). Replace ('"','o'). Replace ('?','W'). Replace (':','m') with open ('lesson/'+dir,'WB') as F:#write an article to a file forLineinchText:f.write (line)
First-BSP crawl all advanced tutorials in the Python Chinese developer community