Python crawler: crawls data that everyone is a product manager, python Product Manager

Last Update:2017-05-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawl content:

Everyone is the title, page views, and thumbnail of the latest article on the left side of the product manager homepage (www.woshipm.com.

Ideas:

1. Use BeautifulSoup to parse webpages

Variable name = BeautifulSoup (webpage information, 'lxml ')

2. Describe where the content to be crawled is

Variable name = variable name. select ()

3. Crawl the content we want

Next, let's look at the specific implementation.

1. We need to first install the database to be used: BeautifulSoup, requests, and lxml. For the installation method, see my previous article: Python Getting Started: how to use a third-party library? BeautifulSoup and lxml are common third-party libraries used to parse webpages. Then insert the BeautifulSoup and requests libraries.

1 from bs4 import BeautifulSoup2 import requests

2. After inserting a third-party library, you need to describe where the information we want to crawl is.

1 url = 'http://www.woshipm.com'2 web_data = requests.get(url)3 soup = BeautifulSoup(web_data.text,'lxml')4 titles = soup.select('h2.stream-list-title > a')5 pageviews = soup.select('footer > span.post-views')6 imgs = soup.select('div.stream-list-image > a > img')

Let's look at it one by one.

Line 1: Specify the webpage from which the information is obtained;

Row 2: web_data is the variable name, And the get of the requests library is used to request information on this webpage;

Row 3: soup is the variable name, And BeautifulSoup and lxml libraries are called to parse the webpage. Specifically, web_data.text is the text information of the webpage to be obtained;

Row 4: specify the specific location of titles, which is implemented using the select method. The title is in the brackets on the webpage. The location is obtained as follows:

Open the page in Google browser-move the mouse to the title, right-click and choose check-the right side of the page shows the code corresponding to the title, right-click the code and choose Copy> Copy selector ]. BeautifulSoup supports selector and does not support XPath.

At this time, we can get the title path. The copied path should be a long path. I deleted the previous part and kept the 2-3 layers before the title to represent its path.

The methods for the fifth line pageviews and sixth line imgs are the same as above.

3. after completing the above two steps, we will crawl our target information and load them into the dictionary. The usage of the dictionary is in the article "Python Getting Started: the four basic types of data structures are described in.

1 for title,pageview,img in zip(titles,pageviews,imgs):2     data = {3         'title':title.get_text(),4         'pageview':pageview.get_text(),5         'img':img.get('src')6     }7     print(data)

In this case, three data items are crawled: Title (titles), page views (pageviews), and images (imgs). Three data items can be put together and implemented using zip. The usage of the for Loop is mentioned in the article "Python entry: for Loop, while loop.

After running the program, you can get the expected results (10 articles loaded on the woshipm homepage by default ).

Finally, complete code is provided:

 1 from bs4 import BeautifulSoup 2 import requests 3  4 url = 'http://www.woshipm.com' 5 web_data = requests.get(url) 6 soup = BeautifulSoup(web_data.text,'lxml') 7 titles = soup.select('h2.stream-list-title > a') 8 pageviews = soup.select('footer > span.post-views') 9 imgs = soup.select('div.stream-list-image > a > img')10 11 for title,pageview,img in zip(titles,pageviews,imgs):12     data = {13         'title':title.get_text(),14         'pageview':pageview.get_text(),15         'img':img.get('src')16     }17     print(data)

It is recommended that you use a similar method to crawl and remove author information, collections, or try another webpage. Learning by doing.

Operating Environment: Python version, 3.6; PyCharm version, 2016.2; Computer: Mac

----- End -----

Author: du wangdan, Public Account: du wangdan, Internet product manager.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler: crawls data that everyone is a product manager, python Product Manager

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler: crawls data that everyone is a product manager, python Product Manager

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support