Python simple crawler and nested data types

Last Update:2015-04-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One: Cause

(0) crawler is the web spider, crawling the content of the HTML page of the specified URL, so it will need to URLLIB2 package, string operation is definitely required, and string matching package re.

(1) Python nesting type, generally in the basic tutorial is rarely involved in; Python's more advanced applications will certainly involve, but only limited personal ability, and now do not go deep, look forward to not in the future contact learning.

(2) Speaking of nested types, this should be from Java or C + + nested type, as long as you process data, adding or deleting the search will generally involve nested types, the original language involved, naturally think of Python can certainly also apply nested types.

(3) The following simple introduction of nested types through an example, combined with the previous blog, you can basically process the text data.

Two: Code combat

(1) Python dictionary nesting (hash nested) and list (list) nested types example

Import itertools#encoding = Utf-8#!/user/bin/pythonprint "**********test Multi_dic_hash (Hashtable order by key Automatic) ******** "data = {1: {1: ' A ', 2: ' B ', 3: ' C '}, 2: {4: ' d ', 5: ' E ', 6: ' F '}};p rint Data;del data[2][4];p rint data;data[2 ][5] = ' W ';p rint data;tjudata = {"cs": {"091": +, "093":, "092": (+), "CCS": {"081": +, "083":, "082": +, "is": {"071" : +, "073": "072": 091}};p rint tjudata, #删除del tjudata["cs" ["091"];p rint tjudata; #更改 add tjudata["cs" [] = 85; tjudata["CS" ["092"] = 101; #遍历s_keys = Tjudata.keys (); for S_key in S_keys:print S_key, ":"; s_data = Tjudata.get (S_key); # Print S_key, "--", S_data;c_keys = S_data.keys (); for c_key in c_keys:c_value = S_data.get (c_key);p rint C_key, "--" , C_value;print "**********test Multi_list (Hashtable order by key Automatic) ********" #当里面的lst_one第一项字母相等时, The second color is stitched together. That is, the ultimate realization [[' A ', ' Yellow green white '],[' B ', ' Yellow '],[' C ', ' Violet Blue '] ...] Lst_all = [[' A ', ' Blue '],[' a ', ' Green '],[' a ', ' Yellow '],[' B ', ' Red '],[' C ', ' Red '],[' C ', ' white ']];collector = [];for k, Lstgroup in Itertools.groupby (soRTed (Lst_all), Lambda X:x[0]): Collector.append ([k, ', '. Join ([c[1] for C in Lstgroup])]);p rint collector; #删除print lst_ All;del lst_all[0][0];p rint lst_all; #更改 Add lst = Lst_all[0];lst.insert (0, ' d '); #lst_all. insert[0][0] = ' d '; lst_all[1][1 ] = ' RDS '; #遍历for lst in lst_all:for ele in Lst:print ele;

(2) Simple web crawler

#coding =utf-8import urllibdef gethtml (URL):    page = urllib.urlopen (URL)    html = page.read ()    return htmlhtml = gethtml ("https://www.baidu.com/") Print HTML

(3) urllib* profile reprinted from http://www.cnblogs.com/yuxc/archive/2011/08/01/2123995.html

1). Urllib:
A description of the official website is: Open any resource via URL. From the introduction of the official website, this module was originally implemented by the analog file module, just to change the local file path to a remote Internet URL. Common operations are:
Urlopen (URL, [, data])--To open a Web page according to the URL, according to the parameters to distinguish the post or get
Urlretrieve ()--Copies the Web page contents of the specified URL to a specified local file
QUOTE ()--encode a special character or character in a URL into a specified encoding
Unquote ()--decodes the URL encoding in the URL
For details see: http://docs.python.org/2/library/urllib.html
2). URLLIB2:
The official website of a sentence description is very general: open the extension of the URL library. Basically, some of the more complex operations that are open to URLs, such as some authentication, redirection, and cookies related to operations. In this way, it is more confirmed that the Urllib module is a simulation of the implementation of file operation ideas. Because these "complex operations" are unique to the open URL, the file operation does not exist. There is no operation of quote and Unquote in the URLLIB2, this is only in urllib, and there is no urlretrieve. Common operations:
Urlopen (Url,[,data],[,timeout])--Increased request response time-out, in urllib need to use the socket module to achieve, it is more convenient, the rate is higher
request.*--added support for request, and can easily manipulate the contents of the header.
For details see: http://docs.python.org/2/library/urllib2.html
3). URLLIB3:
First of all, this is not a standard library, it is an expansion library. The module's introduction focuses on its role: it provides a post function for connection pooling and files that are not available in Urllib and URLLIB2. Because it is not a standard library, it needs to be downloaded and installed separately, see: HTTPS://PYPI.PYTHON.ORG/PYPI/URLLIB3. Although he started a project for the missing features of Urllib and Urllib2 in the python2.* release, it also now supports development in python3.3. Since the individual has no python3 experience, there is nothing more to say.

Python simple crawler and nested data types

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python simple crawler and nested data types

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python simple crawler and nested data types

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support