Use Python to easily collect Web site data (1)

Last Update:2013-12-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With the basic Python module, you can write scripts to interact with Web sites. However, if you do not need them, you do not want to do so. The modules urllib and urllib2 in Python2.x and the unified urllib. * sub-package in Python3.0 can get resources at the end of the URL. However, when you want to perform a complex interaction with the content found on the Web page, you need to use the mechanize library.

BKJIA recommended topics: Python practical development guide

One of the greatest difficulties in automated Webscrap or Web site interaction simulation is that the server uses cookies to track session progress. Obviously, cookies are part of the HTTP header and are displayed automatically when urllib opens resources.

Even so, execution at this level is also very cumbersome. The Mechanism Library promotes this processing to a higher degree of abstraction and enables your script-or interactive Pythonshell-to behave very similar to the actual Web browser.

Python's mechanism was inspired by Perl's WWW: mechanism, which has a similar set of features. Of course, as a long-term supporter of Python, I think the mechanism is more robust, and it seems to inherit the general pattern of the two languages.

One of the close partners of machize is also an outstanding BeautifulSoup library. This is an amazing "rough parser" used to parse the valid HTML contained in the actual Web page. You do not need to use BeautifulSoup for mechanize, and vice versa. However, in most cases, you want to use both tools when interacting with the "actually existing Web.

An actual example

I have used mechanics in multiple programming projects. The most recent project collects a list of names that match certain conditions from a popular Web site. This site provides some search tools, but does not provide any formal APIs to perform such searches. Although visitors may be able to guess more clearly what I was doing in the past, I will modify the details of the Code to avoid exposing too much information about scrap sites or my clients. Generally, the code I provide is generic for similar tasks.

Getting started tools

In the process of actually developing Webscrap/analyzing code, I found that it is very important to view, process, and analyze the content of Web pages in interactive mode to understand the actual operations on related Web pages. Generally, some pages on the site are dynamically generated by the query but have the same pattern), or are pre-generated according to a very strict template.

An important way to complete this interactive experience is to use the mechanism itself in the Pythonshell, especially in an enhanced shell, such as IPython. In this way, you can request various linked resources, submit forms, maintain or operate site cookies before writing the final script that you want to use for interaction in production.

However, I found that many lab interactions between me and my Web sites have been better executed in modern Web browsers. Easy page rendering allows you to quickly understand what is happening on a given page or form. The problem is that the rendering page is only half done, maybe less than half done. Obtaining the "page source code" will let you go further. To really understand the principles behind a given Web page or a series of interactions with Web servers, you need to know more.

To learn about this, I often use Firebug or the WebDeveloper plug-in for Firefox. All these tools can perform operations such as displaying form fields, displaying passwords, checking DOM on pages, viewing or running Javascript, and observing Ajax communication. To compare the advantages and disadvantages of these tools, you need to write another article, but if you want Web-oriented programming, you must be familiar with these tools.

No matter which tool you use to perform experiments on the Web site that is prepared for automated interaction, you need to spend more time coding your job) more time to understand the actual behavior of the site.

Search Result scraper

Considering the project intent mentioned above, I will divide the script containing 100 lines of code into two functions:

◆ Retrieve all interested results

◆ Pull information of interest from the retrieved page

This method is used to organize scripts for ease of development. When I start a task, I need to know how to complete these two functions. I think the information I need is located in a common page collection, but I have not checked the specific layout of these pages.

First, I will retrieve a group of pages and save them to the disk, and then execute the second task to pull the required information from these saved files. Of course, if a task involves using the retrieved information to form a new interaction in the same session, you will need to use a slightly different development procedure. So first let's look at my fetch () function:

 
 
  
  Listing 1. Retrieving page content
  
  Importsys, time, OS
  
  Frommechanic izeimportbrowser
  
   
  
  LOGIN_URL = 'HTTP: // www.example.com/login'
  
  USERNAME = 'David Mertz'
  
  PASSWORD = 'thespanishinquisition'
  
  SEARCH_URL = 'HTTP: // www.example.com/search? '
  
  FIXED_QUERY = 'Food = spam & ''' utensil = spork & ''date = the_future &'
  
  VARIABLE_QUERY = ['actor = % s' % actorforactorin
  
  ('Grahamchapman ',
  
  'Johncleese ',
  
  'Terrygillim ',
  
  'Ericidle ',
  
  'Terryjones ',
  
  'Michaelpalin')]
  
   
  
  Deffetch ():
  
  Result_no = 0 # Numbertheoutputfiles
  
  Br = Browser () # Createabrowser
  
  Br. open (LOGIN_URL) # Opentheloginpage
  
  Br. select_form (name = "login") # Findtheloginform
  
  Br ['username'] = username # Settheformvalues
  
  Br ['Password'] = password
  
  Resp = br. submit () # Submittheform

 
 
  
  #Automaticredirectsometimesfails,followmanuallywhenneeded  
  
  if'Redirecting'inbr.title():  
  
  resp=br.follow_link(text_regex='clickhere')  
  
   
  
  #Loopthroughthesearches,keepingfixedqueryparameters  
  
  foractorininVARIABLE_QUERY:  
  
  #Iliketowatchwhat'shappeningintheconsole  
  
  print>>sys.stderr,'***',actor  
  
  #Letsdotheactualquerynow  
  
  br.open(SEARCH_URL+FIXED_QUERY+actor)  
  
  #Thequeryactuallygivesuslinkstothecontentpageswelike,  
  
  #buttherearesomeotherlinksonthepagethatweignore  
  
  nice_links=[lforlinbr.links()  
  
  if'good_path'inl.url  
  
  and'credential'inl.url]  
  
  ifnotnice_links:#Maybetherelevantresultsareempty  
  
  break  
  
  forlinkinnice_links:  
  
  try:  
  
  response=br.follow_link(link)  
  
  #Moreconsolereportingontitleoffollowedlinkpage  
  
  print>>sys.stderr,br.title()  
  
  #Incrementoutputfilenames,openandwritethefile  
  
  result_no+=1  
  
  out=open(result_%04d'%result_no,'w')  
  
  print>>out,response.read()  
  
  out.close()  
  
  #Nothingevergoesperfectly,ignoreifwedonotgetpage  
  
  exceptmechanize._response.httperror_seek_wrapper:  
  
  print>>sys.stderr,"Responseerror(probably404)"  
  
  #Let'snothammerthesitetoomuchbetweenfetches  
  
  time.sleep(1)

After conducting interactive research on the sites that interest me, I found that the queries I want to perform contain fixed elements and changed elements. I just concatenate these elements into a large GET request and view the "results" page. The result list contains links to the resources I actually need.

Therefore, when I access these links and encounter some errors during this process, I will throw a try/try t block) and save any content found on these content pages. It's easy, isn't it? This simple example shows you the general functions of the mechanism.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Python to easily collect Web site data (1)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support