Cute python: Easy collection of web data using mechanize and beautiful soup

Source: Internet
Author: User
Tags in python

Using the basic Python module, you can write scripts to interact with the Web site, but you don't want to do that if you don't need to. The modules Urllib and urllib2 in Python 2.x, and the Unified urllib.* Child package in Python 3.0, can obtain resources at the end of the URL. However, you need to use the Mechanize library when you want to make some more complex interactions with what you find in your Web pages.

One of the biggest difficulties in an automated WEB scrap or interactive simulation between a user and a Web site is that the server uses cookies to track session progress. It is clear that cookies are part of the HTTP header, which is naturally displayed when Urllib open the resource. Also, standard module cookies (Http.cookie in Python 3) and Cookielib (Http.cookiejar in Python 3) Help to process these headers at a higher level than the original text processing. Even so, performing processing at this level is cumbersome. The Mechanize library promotes this process to a higher level of abstraction and makes your script-or interactive Python shell-behave very similar to the actual Web browser.

Python's mechanize are inspired by Perl's Www:mechanize, which has a similar set of features. Of course, as a long-time Python supporter, I think mechanize is more robust, and it seems to inherit the common patterns of both languages.

A close partner of Mechanize is the same excellent beautiful Soup library. This is a very magical "rough parser" for parsing valid HTML contained in an actual Web page. You do not need to use beautiful Soup for mechanize, and vice versa, but in most cases you will want to use both tools when interacting with the actual Web.

A practical example

I have used mechanize in multiple programming projects. The most recent project is to collect a list of names that match a certain condition from a popular Web site. The site provides some search tools, but does not provide any formal APIs to perform such searches. Although the visitor may be able to guess more clearly what I was doing, I will modify the details of the code given to avoid exposing too much information about the scrap site or my client. In general, the code I give is generic for similar tasks.

Getting Started tool

During the actual development of the Web scrap/parsing code, I found it important to view, process, and analyze the content of Web pages interactively to understand what actually happened on the Web page. Typically, some pages in a site are dynamically generated by a query (but have a consistent pattern) or are built on a very strict template.

An important way to accomplish this interactive experience is to use the mechanize itself within the Python shell, especially in an enhanced shell, such as IPython. In this way, you can request a variety of linked resources, submit forms, maintain or manipulate site cookies, and so on before writing a final script that performs the interaction you want to use in production.

However, I found that many of the experimental interactions I had with the Web site were better implemented in the actual modern web browsers. Rendering a page easily enables you to learn more quickly about what is happening in a given page or form. The problem is that the rendering page is only half done, perhaps less than half. Getting "page source" will take you further. To really understand the rationale behind a given web page or a series of interactions with a Web server, you need to learn more.

To understand this, I often use the Firebug or Firefox-oriented web Developer plug-in (or the built-in optional develop menu in the latest Safari version, but the target audience is different). All of these tools can perform actions such as displaying form fields, displaying passwords, checking the DOM of a page, viewing or running Javascript, and observing Ajax communications. Comparing the pros and cons of these tools requires writing an additional article, but if you are going to do Web-oriented programming, you must be familiar with these tools.

Regardless of which tool you use to experiment with a WEB site that is ready for automated interaction, you'll need to spend more time than writing concise mechanize code to perform your task to understand what the site actually does.

Search results Scraper

Taking into account the intent of the project mentioned above, I will divide the script that contains 100 lines of code into two functions:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.