Python crawler: crawls data that everyone is a product manager, python Product Manager
Crawl content:
Everyone is the title, page views, and thumbnail of the latest article on the left side of the product manager homepage (www.woshipm.com.
Ideas:
1. Use BeautifulSoup to parse webpages
Variable name = BeautifulSoup (webpage information, 'lxml ')
2. Describe where the content to be crawled is
Variable name = variable name. select ()
3. Crawl the content we want
Next, let's look at the specific implementation.
1. We need to first install the database to be used: BeautifulSoup, requests, and lxml. For the installation method, see my previous article: Python Getting Started: how to use a third-party library? BeautifulSoup and lxml are common third-party libraries used to parse webpages. Then insert the BeautifulSoup and requests libraries.
1 from bs4 import BeautifulSoup2 import requests
2. After inserting a third-party library, you need to describe where the information we want to crawl is.
1 url = 'http://www.woshipm.com'2 web_data = requests.get(url)3 soup = BeautifulSoup(web_data.text,'lxml')4 titles = soup.select('h2.stream-list-title > a')5 pageviews = soup.select('footer > span.post-views')6 imgs = soup.select('div.stream-list-image > a > img')
Let's look at it one by one.
Line 1: Specify the webpage from which the information is obtained;
Row 2: web_data is the variable name, And the get of the requests library is used to request information on this webpage;
Row 3: soup is the variable name, And BeautifulSoup and lxml libraries are called to parse the webpage. Specifically, web_data.text is the text information of the webpage to be obtained;
Row 4: specify the specific location of titles, which is implemented using the select method. The title is in the brackets on the webpage. The location is obtained as follows:
Open the page in Google browser-move the mouse to the title, right-click and choose check-the right side of the page shows the code corresponding to the title, right-click the code and choose Copy> Copy selector ]. BeautifulSoup supports selector and does not support XPath.
At this time, we can get the title path. The copied path should be a long path. I deleted the previous part and kept the 2-3 layers before the title to represent its path.
The methods for the fifth line pageviews and sixth line imgs are the same as above.
3. after completing the above two steps, we will crawl our target information and load them into the dictionary. The usage of the dictionary is in the article "Python Getting Started: the four basic types of data structures are described in.
1 for title,pageview,img in zip(titles,pageviews,imgs):2 data = {3 'title':title.get_text(),4 'pageview':pageview.get_text(),5 'img':img.get('src')6 }7 print(data)
In this case, three data items are crawled: Title (titles), page views (pageviews), and images (imgs). Three data items can be put together and implemented using zip. The usage of the for Loop is mentioned in the article "Python entry: for Loop, while loop.
After running the program, you can get the expected results (10 articles loaded on the woshipm homepage by default ).
Finally, complete code is provided:
1 from bs4 import BeautifulSoup 2 import requests 3 4 url = 'http://www.woshipm.com' 5 web_data = requests.get(url) 6 soup = BeautifulSoup(web_data.text,'lxml') 7 titles = soup.select('h2.stream-list-title > a') 8 pageviews = soup.select('footer > span.post-views') 9 imgs = soup.select('div.stream-list-image > a > img')10 11 for title,pageview,img in zip(titles,pageviews,imgs):12 data = {13 'title':title.get_text(),14 'pageview':pageview.get_text(),15 'img':img.get('src')16 }17 print(data)
It is recommended that you use a similar method to crawl and remove author information, collections, or try another webpage. Learning by doing.
Operating Environment: Python version, 3.6; PyCharm version, 2016.2; Computer: Mac
----- End -----
Author: du wangdan, Public Account: du wangdan, Internet product manager.