Python crawler Learning record "enclosing code, detailed steps"

Source: Internet
Author: User
Tags chrome developer chrome developer tools jupyter jupyter notebook

Introduction:

Yesterday in the NetEase cloud classroom self-taught "Python Network crawler actual combat", the video link teacher speaks very clear, follow the practice once can master the Crawler Foundation, highly recommended!

In addition, on the Internet to see a student collation of the course records, very detailed, can be a priority reference to study. Portal: Please click

This article is your own synchronization with the video learning record, welcome to read ~ ~ ~

Experiment: Sina News homepage Crawler Practice

http://news.sina.com.cn/china/

First, prepare

    • Browser built-in developer tools (for example, Chrome)

    • Python3 Requests Library

    • Python3 BEAUTIFULSOUP4 Library (note that BEAUTIFULSOUP4 and BeautifulSoup are not the same)

    • Jupyter Notebook

Second, the analysis before grasping

Take Chrome as an example, before you crawl the analysis steps

    1. Press F12 enter to developer tools;
    2. Click Network ;
    3. 刷新页面; (Press F5)
    4. Find Doc ;
    5. Find the first one on the left side of the Name column (the case where you need to crawl the link 90% is the first one);
    6. Click on the right Headers ;
    7. Find the requested URL and how to request it.

Third, began to write the first web crawler

Requests Library
    • Network Resource Capture Kit
    • Improve URLLIB2 's shortcomings and allow users to access network resources in the simplest way
    • Access to network resources using rest operations
Jupyter

Use Jupyter to crawl the page and print it in the browser, then press Ctrl-F find the corresponding content to determine what we want to crawl in the page.

Test Example:

1 Import Requests 2 res = requests.get ('http://www.sina.com.cn/')3' Utf-8 ' 4 Print (Res.text)

Iv. analyzing web Elements with BEAUTIFULSOUP4

Test Example:

1  fromBs4ImportBeautifulSoup2Html_sample =' 3 4 <body>5 

6 <a href= "#" class= "link" >this is link1</a>7 <a href= "# link2" class= "link" >this is link2</a>8 </body>9 'Ten OneSoup = BeautifulSoup (Html_sample,'lxml') A Print(Soup.text)

V. Basic operation of BeautifulSoup

Use Select to find the element containing the H1 tag

Soup == soup.select ('H1')print(header)print  (header[0])print(header[0].text)

Use Select to find a label containing a

' lxml '  = soup.select ('a')print(ALink) for inch ALink:     Print (link)     Print (Link.txt)

Use Select to find all elements with ID title (ID preceded by #)

ALink = Soup.select ('#title')print(ALink)

Use Select to find all elements of the class link (the class is preceded by a plus.)

Soup = BeautifulSoup (html_sample) for in Soup.select ('. Link')  ):    print(link)

Use Select to find all the A tag's href links

Alinks = Soup.select ('a') for in alinks:      Print(link['href'#  principle: Wraps the label's attributes into a dictionary

Vi. observing how to crawl Sina news information

The key is to find CSS positioning

    • Chrome developer Tools (after entering developer tools, click on the element in the upper left corner to see the elements)

      Chrome looks for element positioning. png

    • Firefox developer Tools
    • Infolite (Requires FQ)

Vii. production of Sina News Network crawler

Crawl time, title, content

ImportRequests fromBs4ImportBeautifulsoupres= Requests.get ('Http://news.sina.com.cn/china') res.encoding='Utf-8'Soup= BeautifulSoup (Res.text,'lxml') forNewsinchSoup.select ('. News-item'):    if(Len (News.select ('H2')) >0): H2= News.select ('H2') [0].text time= News.select ('. Time') [0].text a= News.select ('a') [0]['href']        Print(Time, H2, a)

Crawl news in the text page

The news site is: http://news.sina.com.cn/o/2017-12-06/doc-ifypnyqi1126795.shtml

Nevin information description diagram. png

Get news text title, time, source

Which involves time and string conversions

 from Import datetime//String Turn time---'%y%M Month%d%h:%m')//Time to string---  Strftimedt.strftime (%y-%m-%d)

Organize news articles, get edit names

Organize the news in the text step:

1, crawl;

2, access to the paragraph;

3, remove the last line of editor information;

4, remove the space;

5, replace the space \n , here can be replaced by a variety of other forms;

The final shorthand is a sentence.

Crawl News Comment Number

Explain:

The comment is passed through the JS code, since it is JS, then the probability of passing through Ajax is very high, so point to see XHR , but found that there is no total number of comments in the response, 2 and then only to go JS inside, carpet-type search, Find out which response in the total number of comments 2 , and finally found.

Find links and how to request

Today, the number of comments added in real time, please do not feel strange ^_^

Then you can get the code.

Explain:

var data={......}Look like a json string, remove var data= it and make it into a json string.

As you can see, the jd message in the string is the comment.

Go back to the Chrome Developer tool and view the number of comments.

Get news identifier (news ID)

Mode 1: Cutting method

# get the news number ' http://news.sina.com.cn/o/2017-12-06/doc-ifypnyqi1126795.shtml '  = newsurl.split ('/') [ -1].rstrip ('. shtml'). Lstrip ('doc-i') NewSID

Mode 2: Regular expressions

Import= Re.search ('doc-i (. *). shtml'= m.group (1) NewSID

Viii. establishment of a function to obtain comments

Make a general arrangement and make a function of the method that has just got the comment number. After that, there is a link to the news page that can be used to get its total number of comments.

Ix. setting up the information extraction function of news text

Ten. Remove each news item from the list link

If Doc there is nothing we want to find below, then there is reason to suspect that the way this web page generates data is generated in an unsynchronized manner. So we need to go XHR and JS find the following.

Sometimes you will find that the non-synchronous data XHR is not, but JS below. This is because the data will be JS wrapped in a function, and Chrome's developer tool thinks it's a JS file, so it's placed JS below.

JSFind the information we are interested in, then click Preview Preview, if sure is what we are looking for, you can go to the Headers view Request URL and Request Method .

The JS first one in general is probably what we are looking for, pay special attention to the first one.

1. Select Network Tag

2. Select JS

3. Find page link page=2

Working with paging links

Note that you need to turn back and tail and turn it into a standard json format.

Xi. establishing a link function for profiling lists

Tidy up the previous steps and encapsulate them in a function.

defparselistlinks (URL): Newsdetails=[] Res=requests.get (URL) JD= Json.loads (Res.text.lstrip ('Newsloadercallback ()'). Rstrip (');'))     forEntinchIdx'result']['Data']: Newsdetails.append (getnewsdetail (ent['URL']))    returnNewsdetails

12. Use the For loop to generate multiple page links

13, Batch crawl every page of news in the text

14. Use Pandas to organize data

Python for Data analysis

    • Originated from R
    • Table-like format
    • Provides efficient, easy-to-use data Frame that allows users to quickly manipulate and analyze data

Save data to Database

Continue fighting here, the first web crawler is finally finished. Look at the final result, a sense of accomplishment ah! ^_^

Everyone interested can try, welcome to discuss Exchange ~ ~ ~

If you find the article useful, please click to praise, thank you for your support!

Special gift: GitHub code Portal

Thank you for your patience to read, if you can have a little help, welcome to light my GitHub star, thank you ~ ~ ~

Python crawler Learning record "enclosing code, detailed steps"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.