Python crawler: crawls data that everyone is a product manager, python Product Manager

Source: Internet
Author: User

Python crawler: crawls data that everyone is a product manager, python Product Manager

Crawl content:

Everyone is the title, page views, and thumbnail of the latest article on the left side of the product manager homepage (www.woshipm.com.

Ideas:

1. Use BeautifulSoup to parse webpages

Variable name = BeautifulSoup (webpage information, 'lxml ')

2. Describe where the content to be crawled is

Variable name = variable name. select ()

3. Crawl the content we want

Next, let's look at the specific implementation.

1. We need to first install the database to be used: BeautifulSoup, requests, and lxml. For the installation method, see my previous article: Python Getting Started: how to use a third-party library? BeautifulSoup and lxml are common third-party libraries used to parse webpages. Then insert the BeautifulSoup and requests libraries.

1 from bs4 import BeautifulSoup2 import requests

2. After inserting a third-party library, you need to describe where the information we want to crawl is.

1 url = 'http://www.woshipm.com'2 web_data = requests.get(url)3 soup = BeautifulSoup(web_data.text,'lxml')4 titles = soup.select('h2.stream-list-title > a')5 pageviews = soup.select('footer > span.post-views')6 imgs = soup.select('div.stream-list-image > a > img')

Let's look at it one by one.

Line 1: Specify the webpage from which the information is obtained;

Row 2: web_data is the variable name, And the get of the requests library is used to request information on this webpage;

Row 3: soup is the variable name, And BeautifulSoup and lxml libraries are called to parse the webpage. Specifically, web_data.text is the text information of the webpage to be obtained;

Row 4: specify the specific location of titles, which is implemented using the select method. The title is in the brackets on the webpage. The location is obtained as follows:

Open the page in Google browser-move the mouse to the title, right-click and choose check-the right side of the page shows the code corresponding to the title, right-click the code and choose Copy> Copy selector ]. BeautifulSoup supports selector and does not support XPath.

At this time, we can get the title path. The copied path should be a long path. I deleted the previous part and kept the 2-3 layers before the title to represent its path.

The methods for the fifth line pageviews and sixth line imgs are the same as above.

3. after completing the above two steps, we will crawl our target information and load them into the dictionary. The usage of the dictionary is in the article "Python Getting Started: the four basic types of data structures are described in.

1 for title,pageview,img in zip(titles,pageviews,imgs):2     data = {3         'title':title.get_text(),4         'pageview':pageview.get_text(),5         'img':img.get('src')6     }7     print(data)

In this case, three data items are crawled: Title (titles), page views (pageviews), and images (imgs). Three data items can be put together and implemented using zip. The usage of the for Loop is mentioned in the article "Python entry: for Loop, while loop.

After running the program, you can get the expected results (10 articles loaded on the woshipm homepage by default ).

Finally, complete code is provided:

 1 from bs4 import BeautifulSoup 2 import requests 3  4 url = 'http://www.woshipm.com' 5 web_data = requests.get(url) 6 soup = BeautifulSoup(web_data.text,'lxml') 7 titles = soup.select('h2.stream-list-title > a') 8 pageviews = soup.select('footer > span.post-views') 9 imgs = soup.select('div.stream-list-image > a > img')10 11 for title,pageview,img in zip(titles,pageviews,imgs):12     data = {13         'title':title.get_text(),14         'pageview':pageview.get_text(),15         'img':img.get('src')16     }17     print(data)

It is recommended that you use a similar method to crawl and remove author information, collections, or try another webpage. Learning by doing.

Operating Environment: Python version, 3.6; PyCharm version, 2016.2; Computer: Mac

----- End -----

Author: du wangdan, Public Account: du wangdan, Internet product manager.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.