With the basic Python module, you can write scripts to interact with Web sites. However, if you do not need them, you do not want to do so. The modules urllib and urllib2 in Python2.x and the unified urllib. * sub-package in Python3.0 can get resources at the end of the URL. However, when you want to perform a complex interaction with the content found on the Web page, you need to use the mechanize library.
BKJIA recommended topics: Python practical development guide
One of the greatest difficulties in automated Webscrap or Web site interaction simulation is that the server uses cookies to track session progress. Obviously, cookies are part of the HTTP header and are displayed automatically when urllib opens resources.
Even so, execution at this level is also very cumbersome. The Mechanism Library promotes this processing to a higher degree of abstraction and enables your script-or interactive Pythonshell-to behave very similar to the actual Web browser.
Python's mechanism was inspired by Perl's WWW: mechanism, which has a similar set of features. Of course, as a long-term supporter of Python, I think the mechanism is more robust, and it seems to inherit the general pattern of the two languages.
One of the close partners of machize is also an outstanding BeautifulSoup library. This is an amazing "rough parser" used to parse the valid HTML contained in the actual Web page. You do not need to use BeautifulSoup for mechanize, and vice versa. However, in most cases, you want to use both tools when interacting with the "actually existing Web.
An actual example
I have used mechanics in multiple programming projects. The most recent project collects a list of names that match certain conditions from a popular Web site. This site provides some search tools, but does not provide any formal APIs to perform such searches. Although visitors may be able to guess more clearly what I was doing in the past, I will modify the details of the Code to avoid exposing too much information about scrap sites or my clients. Generally, the code I provide is generic for similar tasks.
Getting started tools
In the process of actually developing Webscrap/analyzing code, I found that it is very important to view, process, and analyze the content of Web pages in interactive mode to understand the actual operations on related Web pages. Generally, some pages on the site are dynamically generated by the query but have the same pattern), or are pre-generated according to a very strict template.
An important way to complete this interactive experience is to use the mechanism itself in the Pythonshell, especially in an enhanced shell, such as IPython. In this way, you can request various linked resources, submit forms, maintain or operate site cookies before writing the final script that you want to use for interaction in production.
However, I found that many lab interactions between me and my Web sites have been better executed in modern Web browsers. Easy page rendering allows you to quickly understand what is happening on a given page or form. The problem is that the rendering page is only half done, maybe less than half done. Obtaining the "page source code" will let you go further. To really understand the principles behind a given Web page or a series of interactions with Web servers, you need to know more.
To learn about this, I often use Firebug or the WebDeveloper plug-in for Firefox. All these tools can perform operations such as displaying form fields, displaying passwords, checking DOM on pages, viewing or running Javascript, and observing Ajax communication. To compare the advantages and disadvantages of these tools, you need to write another article, but if you want Web-oriented programming, you must be familiar with these tools.
No matter which tool you use to perform experiments on the Web site that is prepared for automated interaction, you need to spend more time coding your job) more time to understand the actual behavior of the site.
Search Result scraper
Considering the project intent mentioned above, I will divide the script containing 100 lines of code into two functions:
◆ Retrieve all interested results
◆ Pull information of interest from the retrieved page
This method is used to organize scripts for ease of development. When I start a task, I need to know how to complete these two functions. I think the information I need is located in a common page collection, but I have not checked the specific layout of these pages.
First, I will retrieve a group of pages and save them to the disk, and then execute the second task to pull the required information from these saved files. Of course, if a task involves using the retrieved information to form a new interaction in the same session, you will need to use a slightly different development procedure. So first let's look at my fetch () function:
- Listing 1. Retrieving page content
- Importsys, time, OS
- Frommechanic izeimportbrowser
-
- LOGIN_URL = 'HTTP: // www.example.com/login'
- USERNAME = 'David Mertz'
- PASSWORD = 'thespanishinquisition'
- SEARCH_URL = 'HTTP: // www.example.com/search? '
- FIXED_QUERY = 'Food = spam & ''' utensil = spork & ''date = the_future &'
- VARIABLE_QUERY = ['actor = % s' % actorforactorin
- ('Grahamchapman ',
- 'Johncleese ',
- 'Terrygillim ',
- 'Ericidle ',
- 'Terryjones ',
- 'Michaelpalin')]
-
- Deffetch ():
- Result_no = 0 # Numbertheoutputfiles
- Br = Browser () # Createabrowser
- Br. open (LOGIN_URL) # Opentheloginpage
- Br. select_form (name = "login") # Findtheloginform
- Br ['username'] = username # Settheformvalues
- Br ['Password'] = password
- Resp = br. submit () # Submittheform
- #Automaticredirectsometimesfails,followmanuallywhenneeded
- if'Redirecting'inbr.title():
- resp=br.follow_link(text_regex='clickhere')
-
- #Loopthroughthesearches,keepingfixedqueryparameters
- foractorininVARIABLE_QUERY:
- #Iliketowatchwhat'shappeningintheconsole
- print>>sys.stderr,'***',actor
- #Letsdotheactualquerynow
- br.open(SEARCH_URL+FIXED_QUERY+actor)
- #Thequeryactuallygivesuslinkstothecontentpageswelike,
- #buttherearesomeotherlinksonthepagethatweignore
- nice_links=[lforlinbr.links()
- if'good_path'inl.url
- and'credential'inl.url]
- ifnotnice_links:#Maybetherelevantresultsareempty
- break
- forlinkinnice_links:
- try:
- response=br.follow_link(link)
- #Moreconsolereportingontitleoffollowedlinkpage
- print>>sys.stderr,br.title()
- #Incrementoutputfilenames,openandwritethefile
- result_no+=1
- out=open(result_%04d'%result_no,'w')
- print>>out,response.read()
- out.close()
- #Nothingevergoesperfectly,ignoreifwedonotgetpage
- exceptmechanize._response.httperror_seek_wrapper:
- print>>sys.stderr,"Responseerror(probably404)"
- #Let'snothammerthesitetoomuchbetweenfetches
- time.sleep(1)
After conducting interactive research on the sites that interest me, I found that the queries I want to perform contain fixed elements and changed elements. I just concatenate these elements into a large GET request and view the "results" page. The result list contains links to the resources I actually need.
Therefore, when I access these links and encounter some errors during this process, I will throw a try/try t block) and save any content found on these content pages. It's easy, isn't it? This simple example shows you the general functions of the mechanism.