Introduction:
Yesterday in the NetEase cloud classroom self-taught "Python Network crawler actual combat", the video link teacher speaks very clear, follow the practice once can master the Crawler Foundation, highly recommended!
In addition, on the Internet to see a student collation of the course records, very detailed, can be a priority reference to study. Portal: Please click
This article is your own synchronization with the video learning record, welcome to read ~ ~ ~
Experiment: Sina News homepage Crawler Practice
http://news.sina.com.cn/china/
First, prepare
Browser built-in developer tools (for example, Chrome)
Python3 Requests Library
Python3 BEAUTIFULSOUP4 Library (note that BEAUTIFULSOUP4 and BeautifulSoup are not the same)
Jupyter Notebook
Second, the analysis before grasping
Take Chrome as an example, before you crawl the analysis steps
- Press
F12
enter to developer tools;
- Click
Network
;
刷新页面
; (Press F5)
- Find
Doc
;
- Find the first one on the left side of the
Name
column (the case where you need to crawl the link 90% is the first one);
- Click on the right
Headers
;
- Find the requested URL and how to request it.
Third, began to write the first web crawler
Requests Library
- Network Resource Capture Kit
- Improve URLLIB2 's shortcomings and allow users to access network resources in the simplest way
- Access to network resources using rest operations
Jupyter
Use Jupyter to crawl the page and print it in the browser, then press Ctrl-F
find the corresponding content to determine what we want to crawl in the page.
Test Example:
1 Import Requests 2 res = requests.get ('http://www.sina.com.cn/')3' Utf-8 ' 4 Print (Res.text)
Iv. analyzing web Elements with BEAUTIFULSOUP4
Test Example:
1 fromBs4ImportBeautifulSoup2Html_sample =' 3 4 <body>5 6 <a href= "#" class= "link" >this is link1</a>7 <a href= "# link2" class= "link" >this is link2</a>8 </body>9 'Ten OneSoup = BeautifulSoup (Html_sample,'lxml') A Print(Soup.text)
V. Basic operation of BeautifulSoup
Use Select to find the element containing the H1 tag
Soup == soup.select ('H1')print(header)print (header[0])print(header[0].text)
Use Select to find a label containing a
' lxml ' = soup.select ('a')print(ALink) for inch ALink: Print (link) Print (Link.txt)
Use Select to find all elements with ID title (ID preceded by #)
ALink = Soup.select ('#title')print(ALink)
Use Select to find all elements of the class link (the class is preceded by a plus.)
Soup = BeautifulSoup (html_sample) for in Soup.select ('. Link') ): print(link)
Use Select to find all the A tag's href links
Alinks = Soup.select ('a') for in alinks: Print(link['href'# principle: Wraps the label's attributes into a dictionary
Vi. observing how to crawl Sina news information
The key is to find CSS positioning
Vii. production of Sina News Network crawler
Crawl time, title, content
ImportRequests fromBs4ImportBeautifulsoupres= Requests.get ('Http://news.sina.com.cn/china') res.encoding='Utf-8'Soup= BeautifulSoup (Res.text,'lxml') forNewsinchSoup.select ('. News-item'): if(Len (News.select ('H2')) >0): H2= News.select ('H2') [0].text time= News.select ('. Time') [0].text a= News.select ('a') [0]['href'] Print(Time, H2, a)
Crawl news in the text page
The news site is: http://news.sina.com.cn/o/2017-12-06/doc-ifypnyqi1126795.shtml
Nevin information description diagram. png
Get news text title, time, source
Which involves time and string conversions
from Import datetime//String Turn time---'%y%M Month%d%h:%m')//Time to string--- Strftimedt.strftime (%y-%m-%d)
Organize news articles, get edit names
Organize the news in the text step:
1, crawl;
2, access to the paragraph;
3, remove the last line of editor information;
4, remove the space;
5, replace the space \n
, here can be replaced by a variety of other forms;
The final shorthand is a sentence.
Crawl News Comment Number
Explain:
The comment is passed through the JS code, since it is JS, then the probability of passing through Ajax is very high, so point to see XHR
, but found that there is no total number of comments in the response, 2
and then only to go JS
inside, carpet-type search, Find out which response in the total number of comments 2
, and finally found.
Find links and how to request
Today, the number of comments added in real time, please do not feel strange ^_^
Then you can get the code.
Explain:
var data={......}
Look like a json
string, remove var data=
it and make it into a json
string.
As you can see, the jd
message in the string is the comment.
Go back to the Chrome Developer tool and view the number of comments.
Get news identifier (news ID)
Mode 1: Cutting method
# get the news number ' http://news.sina.com.cn/o/2017-12-06/doc-ifypnyqi1126795.shtml ' = newsurl.split ('/') [ -1].rstrip ('. shtml'). Lstrip ('doc-i') NewSID
Mode 2: Regular expressions
Import= Re.search ('doc-i (. *). shtml'= m.group (1) NewSID
Viii. establishment of a function to obtain comments
Make a general arrangement and make a function of the method that has just got the comment number. After that, there is a link to the news page that can be used to get its total number of comments.
Ix. setting up the information extraction function of news text
Ten. Remove each news item from the list link
If Doc
there is nothing we want to find below, then there is reason to suspect that the way this web page generates data is generated in an unsynchronized manner. So we need to go XHR
and JS
find the following.
Sometimes you will find that the non-synchronous data XHR
is not, but JS
below. This is because the data will be JS
wrapped in a function, and Chrome's developer tool thinks it's a JS file, so it's placed JS
below.
JS
Find the information we are interested in, then click Preview
Preview, if sure is what we are looking for, you can go to the Headers
view Request URL
and Request Method
.
The JS
first one in general is probably what we are looking for, pay special attention to the first one.
1. Select Network Tag
2. Select JS
3. Find page link page=2
Working with paging links
Note that you need to turn back and tail and turn it into a standard json
format.
Xi. establishing a link function for profiling lists
Tidy up the previous steps and encapsulate them in a function.
defparselistlinks (URL): Newsdetails=[] Res=requests.get (URL) JD= Json.loads (Res.text.lstrip ('Newsloadercallback ()'). Rstrip (');')) forEntinchIdx'result']['Data']: Newsdetails.append (getnewsdetail (ent['URL'])) returnNewsdetails
12. Use the For loop to generate multiple page links
13, Batch crawl every page of news in the text
14. Use Pandas to organize data
Python for Data analysis
- Originated from R
- Table-like format
- Provides efficient, easy-to-use data Frame that allows users to quickly manipulate and analyze data
Save data to Database
Continue fighting here, the first web crawler is finally finished. Look at the final result, a sense of accomplishment ah! ^_^
Everyone interested can try, welcome to discuss Exchange ~ ~ ~
If you find the article useful, please click to praise, thank you for your support!
Special gift: GitHub code Portal
Thank you for your patience to read, if you can have a little help, welcome to light my GitHub star, thank you ~ ~ ~
Python crawler Learning record "enclosing code, detailed steps"