Fiction Net https://www.qu.la/paihangbang/
Function: Fetch the name of the novel and the corresponding link in each leaderboard, and then write it into the Excel table.
Press F12 to review the page elements to get the class of information you want to locate.
See the code for a detailed explanation.
#Coding:utf-8 #为了正常转码 must writeImportcodecs #为下面新建excel, transcoding properly prepared for a package__author__='Administrator'ImportRequests fromBs4ImportBeautifulSoup "" "
The get_html function is to crawl the HTML page of the corresponding URL and return to this page.
In fact, you can write all of a function, but it will appear that the function is very bloated.
This public function is written independently, encapsulated, and useful for later reuse.
"""defget_html (URL):Try: R= Requests.get (url,timeout = 3000) R.raise_for_statusR.encoding ='Utf-8' returnR.textexcept: return"""
The Get_content function is used to extract the information you need and to write the information to an Excel table.
"""defget_content (URL): Url_list= []html = get_html (URL). Encode ('Utf-8') Soup= BeautifulSoup (HTML,"Html.parser") Category_list= Soup.find_all ('Div', class_='index_toplist mright Mbottom') History_list= Soup.find_all ('Div', Class_ ='index_toplist Mbottom') forCateinchCategory_list:name= Cate.find ('Div', Class_ ='Toptab'). Span.text name= Name.encode ('Utf-8') with Codecs.open ('Novel_list.csv','A +','Utf-8') as F:f.write ('\ n Fiction Category: {}\n'. Format (name)) Book_list= Cate.find ('Div', Class_ ='Topbooks'). Find_all ('Li') forBookinchBook_list:link='http://www.qu.la/'+book.a['href'] Title= book.a['title'].encode ('Utf-8') url_list.append (link) with Codecs.open ('Novel_list.csv','A +','Utf-8') as F:f.write ('novel name: {} \ t novel address: {}\n'. Format (title,link)) forCateinchHistory_list:name= Cate.find ('Div', class_='Toptab'). Span.string with Codecs.open ('Novel_list.csv','A +','Utf-8') as F:f.write ("\ n Fiction type: {} \ n". Format (name)) General_list= Cate.find (style='Display:block;')#find the total leaderboardBook_list = General_list.find_all ('Li') forBookinchBook_list:link='http://www.qu.la/'+ book.a['href'] Title= book.a['title'] Url_list.append (link) with Codecs.open ('Novel_list.csv','A +','Utf-8') as F:f.write ("novel name: {: <} \ t novel address: {: <} \ n". Format (title, link))returnurl_listdefMain ():#Leaderboard AddressBase_url ='http://www.qu.la/paihangbang/' #get links to all the novels in the leaderboardUrl_list =get_content (Base_url)if __name__=='__main__': Main ()
This is mainly a record coding problem.
After the run is finished, it is a garbled Excel table.
And then we start the Debug.
Set breakpoints at each step, observe each variable name title What exactly are these codes?
Now this version, I basically have added. Encode (' Utf-8 ')
Each of them comes out as a string variable
However, after the addition or garbled.
Then I tried to write the message to the TXT document and found it successful.
So the problem is that writing excel,excel is not encoded correctly, so I changed it to Codecs.open (' novel_list.csv ', ' a ', ' utf-8 ')
Finally successfully solve the problem.
The idea is only for reference coding problem always has, really is the head big.
Python Crawler Learning 3-Simple crawl fiction web information