Python crawler uses Fiddler+postman+python's requests module to crawl national flags

Source: Internet
Author: User

Introduced

?? This blog will introduce a Python crawler to crawl the national flag, and the main goal is to show how to use the Post method in Python's requests module to crawl Web content.
?? In order to know the HTTP request header and the request body that the Post method needs to pass, we can use Fiddler to grab the packet and crawl the Post method in the HTTP request during the Internet. In order to verify the POST request that Fiddler crawled to, you can use postman for test validation. After testing in postman, we can use the Python request.post () method to write our crawlers.

Process

?? As a demonstration of the above process, we used the URL: http://country.911cha.com/, the page is as follows:

Enter Germany in the form and the following page jumps:

We can see that in the results of the search, German search results will appear. Click the search results, the following page:


On this page there is the flag of Germany we need. But, how to know the specific URL of the page? In other words, how to get http://country.911cha.com/GER.html? Don't worry, in the German search results that just came out, we look at its source code, it is not difficult to find that in the HTML source code, there are things we want:

In the source code we can see "ger.html", which means that, as long as the results of the search, we can analyze the HTML source to get the search results of the connection URL, and then in the connection URL to get the country's flag. So, in this reptile, the hardest part is how to get search results? That is, the result after the submission of the form, that is, the Post method after the submission of the response. We use Fiddler to crawl The Post method.
?? We open the Fiddler and repeat the above operation, we can get the HTTP request of the process, such as:

Fiddler helped us find a POST request in the process of submitting the form, specifically analyzing the POST request with the following header:

The request body is as follows:

?? In order to verify the Fiddler fetch POST request, we need to postman to test. Before testing with postman, we need to ask: is it necessary to have all the data in the request header? The answer is no, in fact, we only need user-agent and content-type. In postman, enter the request header first, as follows:

Then enter the request body as follows:

Click the "SEND" button to get the result after the response, as follows:

OK, so we're done with the postman test.

Crawler

?? So, with this information to complete the submission of Request.post (), at the same time, with the help of BeautifulSoup to parse the page, get the national flag and complete the download. The specific Python code is as follows:

#-*-Coding:utf-8-*-import urllib.requestimport requestsfrom bs4 import beautifulsoup# function: Download flag of designated Country # parameter: Country: Country def Download_flag (Country): # request Header headers = {' user-agent ': ' mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.87 safari/537.36 ', ' content-type ': ' Applicat ion/x-www-form-urlencoded; Charset=utf-8 ',} # Post data = {' Q ': country} # url = ' http://country.911cha.com/' # commit P OST Request R = Requests.post (Url=url, Data=data, headers=headers) # Use BeautifulSoup to parse Web page content = BeautifulSoup (r.text  , ' lxml ') # Get search results (country) where the page address country = content.find_all (' div ', class_= ' Mcon ') [1] (' ul ') [0] (' li ') [0] (' a ') [0] link = country[' href '] #利用GET方法得到搜索国家的网页 r2 = requests.get (url= '%s/%s '% (URL, link)) # Use BeautifulSoup to parse web content =         BeautifulSoup (R2.text, ' lxml ') # Gets the picture in the Web page images = Content.find_all (' img ') # Gets the flag name of the specified country and for image in images: IF ' alt ' in IMAGE.ATTRS:IF ' flag ' in image[' Alt ']: name = image[' alt '].replace (' flag ', ') link = image[' src '] # download national flag Picture Urllib.request.urlretrieve ('%s/%s '% (URL, link), ' E://flag/%s.gif '%name) def main () : # Countries.txt stores the names of each country file = ' e://flag/countries.txt ' with open (file, ' R ') as F:counties = [_.strip (            ) for _ in F.readlines ()] # Traverse individual countries, download national flag for country in Counties:try:download_flag (country) Print ('%s ' flag downloaded successfully! '%country) except:print ('%s flag download failed ~ '%country ') main ()

Some of the contents of Countries.txt are as follows:

Running the python code above, we found that in the flag folder of E-drive, the national flag has been downloaded, as follows:

So we can finish the task of this reptile!

Summarize

?? This crawler uses the Post method of Python's requests module to simulate form submissions in Web pages. In order to get the HTTP request in the form submission process, that is, the request header and the request body, we took advantage of the grab tool fiddler, and Postman's role is to help us verify whether the Fiddler crawl Post request is exactly the POST request we need, It can also verify the request head and the request body.
?? Although the whole crawler process is difficult to write, but the operation of the idea should be clear, and then, make perfect, more than a few times, you will be familiar with the whole process. This crawler is only as a simple display of the entire process, readers can on this basis, to achieve more complex crawler, hope that this sharing can help readers. Thank you for reading here, also welcome to Exchange ~ ~

Note: I have now opened two public number: because Python (number: Python_math) and easy to learn the Python crawler (number: Easy_web_scrape), welcome to the attention OH ~ ~

Python crawler uses Fiddler+postman+python's requests module to crawl national flags

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.