Python3 Example: Crawl the list of Taobao products __python

Source: Internet
Author: User
This example is from Taobao crawl data, the original is: http://www.cnblogs.com/nima/p/5324490.html
Because I am more concerned about the network this piece, so the article did a lot of pruning. Focus on understanding the request, cookie two modules
As for how to save the data to Excel, how to typesetting, these are completely meaningless, not the formal production environment, how beautiful do not make sense.

A lot of new modules or concepts are used this time:
Image-related library pillow,
Download Address click Open link
The original is the picture link written to excel inside, I changed after no, just download down.

Mozilla:
Often see this word. It turned out to be the Mozilla Foundation, a non-profit organization set up to support and lead the Open-source Mozilla.

Cx_freeze: Packaging tools, packaging python programs into EXE files.
Download Address: Click to open the link
The package will be studied later.

The contents of the cookie are manipulated through the Cookiejar module, where the subclass Mozillacookiejar is used.
Cookie detailed explanation Look here click Open link
This class differs from the parent class in that it only saves and loads files.
Take a look at what's saved and split it with \ t. For easy viewing, use "" Wrapped up.
Cookie.domain: "s.m.taobao.com", access to the domain name
Initial_dot: "False", the domain name is "." To start with, do some special processing.
Cookie.path: "/" as if it were a file directory
Secure: "False" safe
Expires: "" should be the expiration time
Name: "Jsessionid"
Value: "770326e8f4997185c7db2714d7569ff1"

Request
It's not a new module, but it's a detailed understanding.

Click to open the link


Let's say the specific code

1. Here is the mobile phone Taobao page, access to information incredibly is JSON.

2. The core of the code is downloading Web pages from the Web. Because Taobao may have an anti crawler, it is necessary to use cookies to build the head. Try to disguise yourself as a browser.

3. Write the content to Excel. It doesn't matter how it's written. There's nothing to look at anyway. An example of an exercise is always downloading data.

But one step from data to information is a business secret. So there is no more data to be converted into information, and ultimately it's just practice.

OK, let's look at the code.

#-*-Coding:utf-8-*-import urllib.request, urllib.parse, Http.cookiejar import OS, time,re from PIL Import JSON from OPENPYXL Import Workbook # Find files for all HTML suffixes under Folders def listfiles (RootDir, prefix= '. xml '): File = [] for par
                Ent, _, filenames in Os.walk (RootDir): if parent = = rootdir:for filename in filenames:
        If Filename.endswith (prefix): file.append (RootDir + '/' + filename) return file Else:pass def writeexcel (path,dealcontent): workbook = Workbook () #构造一个workBook的对象 worksheet = Work Book.create_sheet (' 1 ', 0) #构造一个表格.
    The coordinates will start at 1. For I in range (0,len (dealcontent)): for J in Range (0,len (dealcontent[i)): If I!=0 and J==len (Dealcont Ent[i]) -1:if dealcontent[i][j]!= ': Try:worksheet.cell (RO
                  w=i+1,column=j+1). Value = Dealcontent[i][j] #写入sheet中 except:      Pass Else:if Dealcontent[i][j]: Worksheet.cell (row=i+1,column=j+1). Value = Dealcontent[i][j].replace (', ') workbook.save (path) #这里才是代码的核心 def gethtml (URL, myproxy= ", postdata={

    }: "" "Crawl Web page: Support cookies URL URL, postdata for post Data" "" # Cookie File save path filename = ' Cookie.txt ' # declares a Mozillacookiejar object instance saved in the file CJ = Http.cookiejar.MozillaCookieJar (filename) # reads the cookie contents from the file to the variable # ignore _discard means that even if the cookie is discarded, it will be preserved. # Ignore_expires means to keep it if it expires # if present, read the primary cookie if os.path.exists (filename) : Cj.load (filename, ignore_discard=true, ignore_expires=true) # Build processor with cookie Cookiehandler = urllib.requ Est. Httpcookieprocessor (CJ) If myproxy:# open the proxy support #使用代理, it is necessary to use the proxy handler Proxyhandler = Urllib.request.ProxyHa Ndler ({' http ': ' http://' +myproxy}) print (' Agent: ' +myproxy+ ' startup ') opener = Urllib.request.build_opener (Proxyhand Ler, Cookiehandler)
    Else:opener = Urllib.request.build_opener (cookiehandler) # Open expert plus head opener.addheaders = [(' User-agen T ', ' mozilla/5.0 (IPAD; U CPU os 4_3_3 like Mac os X; En-US) applewebkit/533.17.9 (khtml, like Gecko) version/5.0.2 mobile/8j2 safari/6533.18.5 '), (' Re Ferer ', ' http://s.m.taobao.com '), (' Host ', ' h5.m.taobao.com ')] With expert Urllib.request.install_opener (opener) # There is data needed to post if postdata: # data URL encoding postdata = Urllib . Parse.urlencode (postdata) html_bytes = Urllib.request.urlopen (URL, Postdata.encode ()). Read () else:ht
    ml_bytes = Urllib.request.urlopen (URL). Read () # Save cookies to File Cj.save (Ignore_discard=true, Ignore_expires=true)  Return Html_bytes # Remove the illegal characters in the title (Windows) def validatetitle (title): Rstr = r "[\/\\\:\*\?\" \<\>\|] " # '/\:*?"
    <>| ' New_title = Re.sub (Rstr, "", title) return New_Title # recursively Create folder def makefolder (path): Try:os.makedirs (path) except:print (' Directory already exists: ' +path ' if _  
    _name__ = = ' __main__ ': #对应目录 datadir = './data ' Imagedir = './image ' Makefolder (datadir) #表单参数 Keyword = R ' ordertype = 1 #1. According to the sales priority, 2. Price is low to high, 3. High to Low, 4. Credit sort, 5. Comprehensive sort Pagenum = #需要抓取的页数 Waitseconds = 4# per Crawl after pause time Isgetimage = + ' crawl picture press 1, do not crawl press 2: ' #构建表单 postdata = {} postdata[' Event_submit_do_new_search_auctio N ']= 1 postdata[' search ']= ' submit query ' postdata[' _input_charset ']= ' utf-8 ' postdata[' ' The "Topsearch ']= ' 1 postdata[' a Type ']= ' B ' postdata[' Searchfrom ']= 1 postdata[' action ']= ' home:redirect_app_action ' postdata[' from ']= 1 p ostdata[' q ']= keyword postdata[' sst ']= 1 postdata[' n ']= postdata[' buying ']= ' the ' buyitnow ' postdata[' m ']= ' Api4h5 ' postdata[' abtest ']= postdata[' wlsort ']= postdata[' style ']= ' postdata[' closemodues ' Nav,selecthot,onesearch '
    if OrderType = = 1:postdata[' sort '] = ' _sale ' elif ordertype = 2:postdata[' sort '] = ' bid ' E Lif ordertype== 2:postdata[' sort ']= ' _bid ' elif ordertype==4:postdata[' sort ']= ' _ratesum ' #获取 Data for page in range (0, pagenum): postdata[' page ']= page taobaourl = ' Http://s.m.taobao.com/sear 
        Ch? " Try:content1 = gethtml (Taobaourl, ', postdata) file = open (DataDir + '/' + str (page) + '. Json ', ' WB ' #这是手机淘宝, get the JSON file File.write (content1) except Exception as E:if hasattr (E, ' Cod
                    E '): Print (' page does not exist or time is too long. ')
                        Print (' Error code: ', E.code) elif hasattr (E, ' reason '): print ("Unable to reach host.") Print (' Reason: ', E.reason) Else:print (e) time.sleep (waitsec    
             
  onds) print (' Pause ' +str (waitseconds) + ' seconds ')  Files = listfiles (DataDir, '. JSON ' total = [[' Page ', ' name ', ' title ', ' Discount price ', ' shipping address ', ' comments ', ' original price ', ' number of sold ', ' ' policy enjoyment ', ' payer
            Number ', ' Gold coin discount ', ' URL address ', ' image url ', ' image '],] for filename in files:try:doc = open (filename, ' RB ')
            Doccontent = Doc.read (). Decode (' utf-8 ', ' ignore ') Product = Doccontent.replace (', ', '). replace (' \ n ', ') Product = json.loads (product) OneFile = product[' ListItem '] except:print (' Can't catch ' + filename ' continue for item in onefile:itemlist = [filename, item[' Nick '], item[' titl 
            E '], item[' price '], item[' location ', item[' Commentcount '] itemlist.append (item[' OriginalPrice '])
            Itemlist.append (item[' sold ']) itemlist.append (item[' Zktype ']) itemlist.append (item[' act ')) Itemlist.append (item[' coinlimit ']) itemlist.append (' http: ' +item[' url ')) picpath=item[' pic_p Ath '].replace (' 60x60 ', ' 720x720 ') itemlist.append (picpath) if Isgetimage==1:if os.path.exists (imagedir): Pass Else:makefolder (Imagedir) url=urllib.parse.qu OTE (Picpath). Replace ('%3a ', ': ') urllib.request.urlcleanup () try:pic=u Rllib.request.urlopen (URL) picno=time.strftime ('%h%m%s ', Time.localtime ()) Filenam
                    ep=imagedir+ '/' +picno+validatetitle (item[' Nick ']+ '-' +item[' title ')] filenamepp=filenamep+ '. jpeg '
                    sfilename=filenamep+ ' S.jpeg ' Filess=open (Filenamepp, ' WB ') #从网络上获得图片
                    Filess.write (Pic.read ()) filess.close () img = Image.open (FILENAMEPP) #以图片的格式打开
                    W, h = img.size SIZE=W/6,H/6 img.thumbnail (size, Image.antialias) Img.Save (sFileName, ' jpeg ') itemlist.append (sfilename) print (' Capture Picture: ' +sfilename ')
                        Except Exception as E:if hasattr (E, ' code '): Print (' page does not exist or time is too long. ')
                            Print (' Error code: ', E.code) elif hasattr (E, ' reason '):
                            Print ("Unable to reach host.") Print (' Reason: ', E.reason) Else:print (e) Itemlist.appen  
        D (') else:itemlist.append (') total.append (itemlist) If Len (total) > 1:
    
    
 Writeexcel (keyword + ' taobao mobile products. xlsx ', total) else:print (' Nothing to catch ')




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.