How to Use Python to implement Web crawling ?, Pythonweb

Source: Internet
Author: User

How to Use Python to implement Web crawling ?, Pythonweb

  [Editor's note] Shaumik Daityari, co-founder of Blog Bowl, describes the basic implementation principles and methods of Web crawling. Article: Domestic ITOM Management PlatformOneAPMCompile and present the text below.

  

With the rapid development of e-commerce, I have become more and more fascinated by price comparison applications in recent years. Every purchase I make on the Internet (or even offline) is the result of in-depth research on the websites of major e-commerce companies.

The author's frequently used price comparison applications include RedLaser, ShopSavvy, and BuyHatke. These applications effectively increase the price transparency, thus saving consumers considerable time.

However, have you ever wondered how these applications obtain the important data? Generally, they useWeb captureTechnology to complete this task.

Web capture Definition

Web crawling is the process of extracting network data. Any data you can see can be extracted with appropriate tools. In this article, we will focus on the Automated extraction process programs to help you collect large amounts of data in a short period of time. In addition to the use cases mentioned above, capture technology also applies to SEO tracking, job tracking, news analysis, and my favorite social media sentiment analysis!

One-point reminder

Before starting the Web crawling adventure, make sure you understand the relevant legal issues. Many websites explicitly prohibit content crawling in their terms of service. For example, the Medium website wrote: "It is acceptable to perform the crawler operation in accordance with the regulations in the robots.txt file of the website, but we do not allow the Scraping operation ." Crawling websites that are not allowed to be crawled may bring you to their blacklist! Like any tool, Web crawling may also be used to copy website content and other undesirable purposes. In addition, there are not a few legal proceedings caused by Web capture.

Set code

After fully understanding the need to be careful, let's start learning Web crawling. In fact, Web crawling can be implemented in any programming language, and we have implemented it using Node not long ago. In this article, considering its simplicity and rich package support, we will use Python to capture programs.

Basic Web crawling Process

When you open a website on the Internet, the HTML code is downloaded and analyzed and displayed by your web browser. The HTML code contains all the information you see. Therefore, you can obtain the required information (such as the price) by analyzing the HTML code ). You can use a regular expression to search for the information you need in the data ocean, or use a function library to interpret HTML. You can also obtain the required data.

In Python, we will use a module named beauul ul Soup to analyze HTML data. You can install it by using pip and other installation programs. Run the following code:

Pip install beautifulsoup4

Alternatively, you can build it based on the source code. On the document description page of this module, You can see detailed installation steps.

After the installation is complete, we will follow the steps below to achieve web crawling:

  • Send request to URL

  • Receive response

  • Analyze the response to find the required data

As a demonstration, we will use the author's blog http://dada.theblogbowl.in/. As the target URL.

The first two steps are relatively simple and can be completed as follows:

From urllib import urlopen # Sending the http requestwebpage = urlopen ('HTTP: // my_website.com/'). read ()

Next, send the response to the previously installed module:

From bs4 import BeautifulSoup # making the soup! Yummy;) soup = BeautifulSoup (webpage, "html5lib ")

Note that html5lib is selected as the parser. According to the BeautifulSoup document, you can also select different Resolvers for them.

Parse HTML

After passing HTML to BeautifulSoup, we can try some commands. For example, to check whether the HTML Tag code is correct, you can verify the title of the page (in the Python interpreter ):

>>> Soup. title <title> Transcendental Tech Talk </title> soup. title. textu 'transcendental Tech tal' >>>

Next, extract specific elements from the page. For example, I want to extract a list of titles from a blog. For this reason, I need to analyze the HTML structure, which can be done with the Chrome checker. Other browsers also provide similar tools.

  

Use Chrome Checker to check the HTML structure of a page

As you can see, the titles of all articles are labeled with h3 and have two class attributes: post-title and entry-title. Therefore, you can use the post-title class to search for all h3 elements to obtain the article title list on this page. In this example, we use the find_all function provided by BeautifulSoup and use the class _ parameter to determine the required class:

>>> Titles = soup. find_all ('h3 ', class _ = 'post-title') # Getting all titles >>> titles [0]. textu '\ nKolkata # BergerXP IndiBlogger meet, Marketing Insights, and some Blogging Tips \ n' >>>

The same results should be obtained if you search for entries using the post-title class only:

>>> Titles = soup. find_all (class _ = 'post-title') # Getting all items with class post-title >>> titles [0]. textu '\ nKolkata # BergerXP IndiBlogger meet, Marketing Insights, and some Blogging Tips \ n' >>>

If you want to learn more about the link indicated by the entry, run the following code:

>>> For title in titles :... # Each title is in the form of

BeautifulSoup has many built-in methods to help you play with HTML. Some of the methods are listed as follows:

>>> Titles [0]. contents [u '\ n', <a href = "http://dada.theblogbowl.in/2015/09/kolkata-bergerxp-indiblogger-meet.html"> Kolkata # BergerXP IndiBlogger meet, Marketing Insights, and some Blogging Tips </a>, u' \ n'] >>>

Note that you can also use the children attribute, but it is a bit like a generator:

>>> Titles [0]. parent <div class = "post hentry uncustomized-post-template"> \ n <a name = "6501973351448547458"> </a> \ n

You can also use regular expressions to search for CSS classes. This document provides details.

Simulate logon by using mechanic

So far, we only download a page to analyze its content. However, web developers may block requests from non-browsers, or some website content can only be read after logon. So how should we deal with these situations?

In the first case, we need to simulate a browser when sending a request to the page. Each HTTP request contains related headers, including information such as the visitor's browser, operating system, and screen size. We can change these data headers to send requests for browsers.

In the second case, to access the content with visitor restrictions, we need to log on to the website and use cookies to maintain the session. Next, let's look at how to accomplish this while disguising as a browser.

We will use the cookielib module to manage sessions using cookies. In addition, we will also use mechanisms, which can be installed using installation programs such as pip.

We will log on through the Blog Bowl page and visit the notification page. The following code is explained through intra-row Annotations:

Import mechanic izeimport cookielibfrom urllib import urlopenfrom bs4 import BeautifulSoup # Cookie Jarcj = cookielib. LWPCookieJar () browser = mechanic. browser () browser. set_cookiejar (cj) browser. set_handle_robots (False) browser. set_handle_redirect (True) # Solving issue #1 by emulating a browser by adding HTTP headersbrowser. addheaders = [('user-agent', 'mozilla/5.0 (X11; U; Linux i686; en-US; rv: 1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1. fc9 Firefox/3.0.1 ')] # Open Login Pagebrowser. open ("http://theblogbowl.in/login/") # Select Login form (1st form of the page) browser. select_form (nr = 0) # Alternate syntax-browser. select_form (name = "form_name") # The first <input> tag of the form is a CSRF token # Setting the 2nd and 3rd tags to email and passwordbrowser. form. set_value ("email@example.com", nr = 1) browser. form. set_value ("password", nr = 2) # Logging inresponse = browser. submit () # Opening new page after loginsoup = BeautifulSoup (browser. open ('HTTP: // theblogbowl. in/notifications /'). read (), "html5lib ")

  

Structure of the notification page

# Print icationsprint sprint soup. find (class _ = "search_results"). text

  

Result After logging on to the notification page

Conclusion

Many developers will tell you that any information you see on the Internet can be crawled. Through this article, you learned how to easily extract content that can be viewed after logon. In addition, if your IP address is blocked, you can mask your IP address (or use another address ). At the same time, in order to seem like a human being accessing, you should keep a certain interval between requests.

With the increasing demand for data, web crawling (for good or bad reasons) technology will only be more widely used in the future. Therefore, it is very important to understand the principle of this technology, whether it is to make effective use of this technology or to protect it from harm.

OneAPM can help you view all aspects of Python applications. It not only monitors the user experience of the terminal, but also monitors server performance. It also supports tracking various problems of databases, third-party APIs, and Web servers, there are also cloud pressure testing tools. For more technical articles, visit the OneAPM official technical blog.

  

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.