Python simple crawler and nested data types

Source: Internet
Author: User

One: Cause

(0) crawler is the web spider, crawling the content of the HTML page of the specified URL, so it will need to URLLIB2 package, string operation is definitely required, and string matching package re.

(1) Python nesting type, generally in the basic tutorial is rarely involved in; Python's more advanced applications will certainly involve, but only limited personal ability, and now do not go deep, look forward to not in the future contact learning.

(2) Speaking of nested types, this should be from Java or C + + nested type, as long as you process data, adding or deleting the search will generally involve nested types, the original language involved, naturally think of Python can certainly also apply nested types.

(3) The following simple introduction of nested types through an example, combined with the previous blog, you can basically process the text data.

Two: Code combat

(1) Python dictionary nesting (hash nested) and list (list) nested types example

Import itertools#encoding = Utf-8#!/user/bin/pythonprint "**********test Multi_dic_hash (Hashtable order by key Automatic) ******** "data = {1: {1: ' A ', 2: ' B ', 3: ' C '}, 2: {4: ' d ', 5: ' E ', 6: ' F '}};p rint Data;del data[2][4];p rint data;data[2 ][5] = ' W ';p rint data;tjudata = {"cs": {"091": +, "093":, "092": (+), "CCS": {"081": +, "083":, "082": +, "is": {"071" : +, "073": "072": 091}};p rint tjudata, #删除del tjudata["cs" ["091"];p rint tjudata; #更改 add tjudata["cs" [] = 85; tjudata["CS" ["092"] = 101; #遍历s_keys = Tjudata.keys (); for S_key in S_keys:print S_key, ":"; s_data = Tjudata.get (S_key); # Print S_key, "--", S_data;c_keys = S_data.keys (); for c_key in c_keys:c_value = S_data.get (c_key);p rint C_key, "--" , C_value;print "**********test Multi_list (Hashtable order by key Automatic) ********" #当里面的lst_one第一项字母相等时, The second color is stitched together. That is, the ultimate realization [[' A ', ' Yellow green white '],[' B ', ' Yellow '],[' C ', ' Violet Blue '] ...] Lst_all = [[' A ', ' Blue '],[' a ', ' Green '],[' a ', ' Yellow '],[' B ', ' Red '],[' C ', ' Red '],[' C ', ' white ']];collector = [];for k, Lstgroup in Itertools.groupby (soRTed (Lst_all), Lambda X:x[0]): Collector.append ([k, ', '. Join ([c[1] for C in Lstgroup])]);p rint collector; #删除print lst_ All;del lst_all[0][0];p rint lst_all; #更改 Add lst = Lst_all[0];lst.insert (0, ' d '); #lst_all. insert[0][0] = ' d '; lst_all[1][1 ] = ' RDS '; #遍历for lst in lst_all:for ele in Lst:print ele;

(2) Simple web crawler

#coding =utf-8import urllibdef gethtml (URL):    page = urllib.urlopen (URL)    html = page.read ()    return htmlhtml = gethtml ("https://www.baidu.com/") Print HTML

(3) urllib* profile reprinted from http://www.cnblogs.com/yuxc/archive/2011/08/01/2123995.html

1). Urllib:
A description of the official website is: Open any resource via URL. From the introduction of the official website, this module was originally implemented by the analog file module, just to change the local file path to a remote Internet URL. Common operations are:
Urlopen (URL, [, data])--To open a Web page according to the URL, according to the parameters to distinguish the post or get
Urlretrieve ()--Copies the Web page contents of the specified URL to a specified local file
QUOTE ()--encode a special character or character in a URL into a specified encoding
Unquote ()--decodes the URL encoding in the URL
For details see: http://docs.python.org/2/library/urllib.html
2). URLLIB2:
The official website of a sentence description is very general: open the extension of the URL library. Basically, some of the more complex operations that are open to URLs, such as some authentication, redirection, and cookies related to operations. In this way, it is more confirmed that the Urllib module is a simulation of the implementation of file operation ideas. Because these "complex operations" are unique to the open URL, the file operation does not exist. There is no operation of quote and Unquote in the URLLIB2, this is only in urllib, and there is no urlretrieve. Common operations:
Urlopen (Url,[,data],[,timeout])--Increased request response time-out, in urllib need to use the socket module to achieve, it is more convenient, the rate is higher
request.*--added support for request, and can easily manipulate the contents of the header.
For details see: http://docs.python.org/2/library/urllib2.html
3). URLLIB3:
First of all, this is not a standard library, it is an expansion library. The module's introduction focuses on its role: it provides a post function for connection pooling and files that are not available in Urllib and URLLIB2. Because it is not a standard library, it needs to be downloaded and installed separately, see: HTTPS://PYPI.PYTHON.ORG/PYPI/URLLIB3. Although he started a project for the missing features of Urllib and Urllib2 in the python2.* release, it also now supports development in python3.3. Since the individual has no python3 experience, there is nothing more to say.

Python simple crawler and nested data types

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.