Crawler Lesson Two: parsing elements in a Web page

Last Update:2018-06-09 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the basic steps

After we understand how tags are nested in a Web page, and what constitutes a Web page, we can begin to learn to filter out the data we want in a Web page using the third-party library beautifulsoup in Python.

Next, let's take a look at the steps to crawl the page information.

Three steps to getting the data we need

Step One: Parsing Web pages with BeautifulSoup

Soup = BeautifulSoup (HTML, ' lxml ')

Step two : Describe where you want to crawl the information.

info = soup.select ('??? ')

To know what its name is, how to locate it.

Step three : Get the information you want from the tag

<p>Something</p>

Get the information we need from the tag, remove some unused structure, and put the information we get into a data container in a certain format, so that we can query.

Second, two ways to obtain a detailed description of the path

Next, we first take the first step, how to parse the Web page using BeautifulSoup

Soup = BeautifulSoup (HTML, ' lxml ')

In fact, we construct a parsing file that requires a Web page file and an Analysis query library . Just like the Soup on the left is soup, theHTML is soups, and lxml is the recipe.

Today we need to tell the beautifulsoup, by telling it the exact location of the element, we can specify the information we want to crawl.

Find the appropriate element right-click to see the code information of the element right-clicked, we have two ways to get the location of the label specific description way

1. Using copy selector

2. Using the copy XPath

What's the difference between the two types of replication, and then let's take a look

Right-click label Copy selector the path copied out

Body > Div.body-wrapper > Div.content-wrapper > div > Div.main-content > Div:nth-child (All) > A

Right-click Label Copy XPath to copy the path

/html/body/div[4]/div[2]/div/div[2]/div[14]/a

These two different path descriptions, the path copied using copy Selector is called CSS Selector, which is called XPathusing copy XPath.

The description of the two paths in the future learning we can use to, but we have to learn today BeautifulSoup It only recognized the first, that is, CSS Selector.

But in order to facilitate our future study, better understand the different elements of the structure of the page, we first talk about XPath, after learning it, CSS selector will also be better understand, at the same time, some of the libraries we need to learn in the future need to use XPath To describe the location of some elements.

For XPath and

Xpath

1. What is XPath

XPath uses a path expression to navigate through an XML document to parse the XML element to which the path is traced.

2.XPath Path Expression

The path expression is an XPath's incoming parameter, and XPath uses a path expression to locate the node (or nodes) in the XML document.

The path expression is similar to this:/html/body/div[4]/div[2]/div/div[2]/div[14]/a or /html/body/div[@class = "Content"] , The [@class = "content"] in the second path is to locate a label in more than one of the same tags.

The path to the XPath that you just got,/html/body/div[4]/div[2]/div/div[2]/div[14]/a, for this string, the full path to this element, called the absolute path, where each '/' is a node , we can get a brief look at the structure diagram below.

In order to understand the relationship between nodes more clearly, we can understand it more intuitively through the picture below.

HTML is the parent node relative to the following node is the parent node, the following body and head relative to the HTML is a child node, the DIV tag is a descendant node relative to the HTML tag.

Where the body is the same as the sub-node of the DIV tag, the div tag is the sibling node, the basic structure and rank is this.

CSS Selector

1. What is CSS Selector

CSS selector positioning is actually the HTML CSS selector tag positioning.

As the name implies, CSS selector is a way of choosing a label based on style.

2.CSS Selector path expression

In the CSS selector path, the first path is the body, not the first in XPath is HTML, we get the path just looked at.

Body > Div.body-wrapper > Div.content-wrapper > div > Div.main-content > Div:nth-child (All) > A

In this path, a . Body-wrapper is appended to the first Div, this is the style of the tag, which is selected by style.

If the path of the XPath is as follows: who , where, the choice of the first few ways

Then the CSS selector is according to: who , where, the first few , what look like to choose

Iii. crawling web information using Python code

Through the path of the various elements in our team's Web page, we can simply use Python's BeautifulSoup library for code-level implementation of Web page information crawling, nonsense to say, directly on the code.

Here is the page to crawl information to use before I contact the page, the source of the Web address in https://www.cnblogs.com/liudi2017/p/7614919.html.

How to open a local Web page it is recommended that you use a new HTML file in Pycharm to copy the source code, and if you use your own Notepad, you need to change the save format to Utf-8.

#!/user/bin/env python#-*-coding:utf-8-*-ImportRequests fromBs4ImportBeautifulSoup as BS#1. Parsing Web pages with BeautifulSoupWith open ('./ddw.html','R', encoding='Utf-8') as Wb_data:#I am using a local file, so use the open function to open the page below the local pathSoup = BS (Wb_data,'lxml')#The parsing file is constructed here, Wb_data is the page we want to parse, lxml is the parse library    #images = soup.select (' Body > Div:nth-child (2) > Div.body > Div.body_moth > Div:nth-child (6) > div:nth- Child (1) > img ')                                #here, the Soup method is used to add the corresponding path directly to the brackets in the select. Here you get the information about the picture.                                #but this line of code will have an error message, we just need to follow the processing in the error message to modify itImages =Soup.select ('Body > Div:nth-of-type (2) > Div.body > Div.body_moth > Div > Div:nth-of-type (1) > IMG'    )    #we get the correct picture information through the line above, but we delete the CSS style of the tag of the two Div, so that we don't go to a single image .    #Simply sift through all the same types of images, why delete the second-to-last Div's CSS style, because the sibling tag of the DIV tag is the parent tag of the other picture tagTitles = Soup.select ('Body > Div:nth-of-type (2) > Div.head_top > Div.head_top_ee > Ul > li > A')    #then get the title information on this page, the same to get all the same title tag, according to location information to remove the Li tag CSS style    #print (images,titles,sep= ' \ r-----------------\ r ') forImage,titleinchZip (images,titles): Data= {        'Image': Image.get ('src'),        'title': Title.get_text (),}Print(data)

Here, the simple page information crawl is completed, we through this code to get the page of the product image of the address, as well as the Web page classification label, this time only for simple information crawl, and then continue to update more examples for everyone to reference the study.

Crawler Lesson Two: parsing elements in a Web page

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More