Use python to crawl articles from the prose network,
Image.png
Configure the python 2.7
bs4 requests
Install sudo pip with pip install bs4
Sudo pip install requests
This section briefly describes how to use bs4 because it crawls a webpage, so we will introduce find and find_all.
The difference between find and find_all is that different things are returned. find returns the first matching tag and the content in the tag.
Find_all returns a list.
For example, we wrote test.html to test the difference between find and find_all. Content:
Then the code for test. py is:
from bs4 import BeautifulSoupimport lxmlif __name__=='__main__': s = BeautifulSoup(open('test.html'),'lxml') print s.prettify() print "------------------------------" print s.find('div') print s.find_all('div') print "------------------------------" print s.find('div',id='one') print s.find_all('div',id='one') print "------------------------------" print s.find('div',id="two") print s.find_all('div',id="two") print "------------------------------" print s.find('div',id="three") print s.find_all('div',id="three") print "------------------------------" print s.find('div',id="four") print s.find_all('div',id="four") print "------------------------------"
After running, we can see that the results are not very different when getting a specified tag. When getting a group of tags, the differences between the two are displayed.
Image.png
So we should pay attention to what we need when using it. Otherwise, an error will occur.
The next step is to get web page information through requests. I don't know why others want to write heard and other things.
I directly access the web page, get several secondary web pages of the prose network through the get method, and then crawl all the web pages through a group of tests.
def get_html(): url = "https://www.sanwen.net/" two_html = ['sanwen','shige','zawen','suibi','rizhi','novel'] for doc in two_html: i=1 if doc=='sanwen': print "running sanwen -----------------------------" if doc=='shige': print "running shige ------------------------------" if doc=='zawen': print 'running zawen -------------------------------' if doc=='suibi': print 'running suibi -------------------------------' if doc=='rizhi': print 'running ruzhi -------------------------------' if doc=='nove': print 'running xiaoxiaoshuo -------------------------' while(i<10): par = {'p':i} res = requests.get(url+doc+'/',params=par) if res.status_code==200: soup(res.text) i+=i
In this part of the code, I did not process res. status_code if it is not 200. The problem is that no error is displayed and the crawled content is lost. Then I analyzed the prose web page and found that it was www.sanwen.net/rizhi/&p;1.
The maximum value of p is 10, which is hard to understand. The last time I crawled more than 100 pages, I will analyze it after I forget it. Then get the content of each page through the get method.
After each page is obtained, the author and question are analyzed. The code is like this.
def soup(html_text): s = BeautifulSoup(html_text,'lxml') link = s.find('div',class_='categorylist').find_all('li') for i in link: if i!=s.find('li',class_='page'): title = i.find_all('a')[1] author = i.find_all('a')[2].text url = title.attrs['href'] sign = re.compile(r'(//)|/') match = sign.search(title.text) file_name = title.text if match: file_name = sign.sub('a',str(title.text))
When you get the title, there is something wrong with it. Could you tell me why did you add a slash to the title, not only add one but also add two, this problem directly leads to an error in the file name when I write a file later, so I wrote a regular expression and changed it for you.
Finally, the essay content is obtained. Through analysis on each page, the article address is obtained, and then the content is obtained directly. Originally, I wanted to directly obtain the content one by modifying the webpage address, which saves time.
def get_content(url): res = requests.get('https://www.sanwen.net'+url) if res.status_code==200: soup = BeautifulSoup(res.text,'lxml') contents = soup.find('div',class_='content').find_all('p') content = '' for i in contents: content+=i.text+'\n' return content
The last step is to write the file and save OK.
f = open(file_name+'.txt','w') print 'running w txt'+file_name+'.txt' f.write(title.text+'\n') f.write(author+'\n') content=get_content(url) f.write(content) f.close()
The problem is that I don't know why some essays are lost. I can only get more than 400 articles, this is a lot different from the article on the prose network, but it is indeed obtained on a one-page basis. I hope you can take a look at this issue. The webpage may not be accessible. Of course, I think it has something to do with my dormitory.
f = open(file_name+'.txt','w') print 'running w txt'+file_name+'.txt' f.write(title.text+'\n') f.write(author+'\n') content=get_content(url) f.write(content) f.close()
I almost forgot.
The code is messy, but I never stop.