We use Python2.7 for development. Pay attention to configuring environment variables.
We use Pycharm for development, which is the same as the well-known Android Studio and IDEA-Jet Brains.
The Requests module is a third-party network request library used to replace Python URLLib2.
However, due to some strange reasons, this download process is very difficult, so we often need to use such a website to help us download:
The image collects almost all third-party Python libraries. Search for Requests and click Download.
After the download is complete, change the suffix to zip. And put the decompressed Requests folder under the Lib folder of Python.
No anti-crawler mechanism is available for obtaining webpage source code through Requests
Directly use the get method of the Requests library to obtain the webpage source code:
On the terminal, we can see the source code of the generated webpage.
However, many websites do not easily allow crawlers to obtain webpage information. At this time, we need to modify the Http header information to obtain it.
For example we use the same code to crawl my blog http://blog.csdn.net/eclipsexys in the terminal, we can see such a paragraph:
403. At this time, we need to modify the crawler code.
First, right-click the review element on the page, find the Network, refresh it, select any element, and find the final User-Agent:
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36
This is our Http request header. Now let's modify the code:
import requestshead = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}html = requests.get('http://blog.csdn.net/eclipsexys', headers = head)print(html.text.encode('utf-8'))
Add request headers and set the encoding format to UTF-8. (The default is GBK in Windows, please change coding to UTF-8 first)
Ps: In the Python file, if you want to enter Chinese characters, you need to specify the character set of the file:
# Coding = UTF-8
See https://www.python.org/dev/peps/pep-0263/ for details
Now we can get the source code normally.
Requests regular search The content obtained directly is all the source code of the webpage. This is certainly not what we need. Therefore, we can extract the content we need through regular expressions.
For example, if we want to extract all the hyperlinks in the web page, OK, let's see how to implement:
Re Module First, we need to introduce the re module. The re module is a regular expression module, which is the same as the web-side Regular Expressions:
import requestsimport rehead = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}html = requests.get('http://www.hujiang.com/', headers=head)html.encoding = "utf-8"href = re.findall('<a target="_blank" href="(.*?)"', html.text, re.S)for each in href: print each
Submit data Get and Post to the webpage Their differences are as follows:
-Get is used to obtain data from the server.
-Post is used to transmit data to the server.
-Get functions are implemented by constructing parameters in the url.
-Post: place the data in the header to submit the data.
Web analysis tools Chrome debugging function-Network debugging
In the Network, locate the place where Post is submitted and find Form Data, which is Form Data.
Construct a form Post form.
import requestsimport reurl = 'https://www.crowdfunder.com/browse/deals&template=false'data = { 'entities_only':'true', 'page':'2'}html_post = requests.post(url,data=data)title = re.findall('"card-title">(.*?)</div>',html_post.text,re.S)for each in title: print each
XPath XPath is the XML Path Language. If we use a regular expression when looking for an element, we can say that we are looking for a woman with long hair and 180cm. If we use XPath to express it, it is the front-end of XX Department of XX Company.
Lxml To use XPath in Python, we need to use a third-party module lxml to install it like Requests.
Obtain the HTML XPath path Open Chrome's review element. Right-click any element and select copy XPath.
Of course, you can also write it by hand. Its basic syntax is as follows:
- // Locate the root node
- /Search at the lower layer
- Extract text content:/text ()
- Extract attribute content: // @ xxxx
For example we choose this address: http://www.imooc.com/course/list? C = android & page = 2
Open review elements:
In this way, we can easily obtain the element's XPath, and we can also manually modify it according to the rules.
Crawl content To use XPath, follow these steps:
- From lxml import etree
- Selector = etree. HTML (HTML Source)
- Selector. xpath (XPath)
We also use the previous website as an example to capture the title of the selected course:
# coding=utf-8import requestsfrom lxml import etreehtml = requests.get("http://www.imooc.com/course/list?c=android&page=2")html.encoding = 'utf-8'selector = etree.HTML(html.text)content = selector.xpath('//*[@id="main"]/div/div/div[3]/div[1]/ul/li[1]/a/h5/span/text()')for each in content: print each
In this way, we get the corresponding content and search method, which is actually the same as we locate through the address, China-Shanghai-Pudong New Area (when the content is unique, you can skip it before) -Zhangjiang Gaoke-Hujiang network-Xu Yisheng
What if we need to crawl all course information? We can see that there is a li [1] In the generated XPath, which corresponds to the list in our source code. We select 1 because we select a specific item. If we remove this 1, the returned result is a list of all the elements we want. We will not detail them here.
XPath advanced uses the same string to start with, but the attributes are different For example:
<Div id = "test-1"> required content 1 </div> <div id = "test-2"> required content 2 </div> <div id = "test-3"> required content 3 </div>
We need to keep everything in advance, but their attributes are different. Let's first look at the XPath of the next element:
//*[@id="test-1"]/text()
It can be seen that ID Determines the element, so we need to use the XPathStarts-with (@ attribute name, with the same attribute characters)Method:
//*[starts-with(@id,"test")]/text()
You only need to use the starts-with method to match the content in.
Nested Tag For example:
<div id=“class”>text1 <font color=red>text2</font> text3</div>
Similar to this nested label, if we use XPath to obtain the first level of text, we can only get text1 and text3. If we want to obtain text2, we need to useString (.)Method.
data = selector.xpath('//div[@id="class"]')[0]info = data.xpath('string(.)')content = info.replace('\n','').replace(' ','')print content
Using string (.), we can get the text of nested tags, which is equivalent to traversing sub-tags and obtaining text.
Last The purpose of this blog is to hold a lucky draw later. As you all know, my new book "Android group Legend" has been officially launched. In order to repay your suggestions, I am going to prepare a lucky draw on the CSDN blog. What this blog says is naturally the preparatory knowledge required for the lucky draw. Please warm up ~~~
Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.