A Preliminary Study on Python crawlers (1) and a preliminary study on python Crawlers

Source: Internet
Author: User

A Preliminary Study on Python crawlers (1) and a preliminary study on python Crawlers

Background: I have learned basic Python syntax and a few regular expressions. It can be said that it is about equal to zero --. This should be a series of notes that record my crawler-related technical knowledge. It is very basic ~~

Programming Language: Python 3.6

Environment: Win7

Crawlers work in an understanding. crawlers can be imagined as a spider crawling information on the Internet. First, they initialize a webpage portal and the program obtains the content of the webpage, identify links to useful information (for example, the useful information on the Netease News rankings homepage is news of various categories ). The links are then stored in the queue. crawlers crawl the webpage information one by one until all the webpage information involved is captured to the server.

The technical knowledge of crawlers is as follows:

1. webpage crawling: The GET/POST method crawlers web pages and Cookie logins, how to handle anti-crawler attacks, distributed crawlers, and how to improve crawling efficiency;

2. Extraction and storage of content;

Well, for now there are so many things that have not yet started learning. Record the first crawler to implement a single static webpage.

1 #! /Usr/bin/python32 3 import requests4 5 start_url = "http://news.163.com/rank/" 6 contents = requests. get (start_url ). content7 fp = open (" .txt", "w +") 8 fp. write (contents. decode ("gbk "))
View Code

You must use the requests Library and the Win7 environment to load a third-party Python library. The following uses requests as an example:

Download this file in the https://pypi.python.org/pypi/requests/#downloads and change the. whlsuffix to the. Zip suffix.

Place the first file in the figure in the lib library folder under the Python installation path. Here, we will not use the Python IDLE tool for coding, which is very difficult to use --

fp.write(contents.decode("gbk"))

This code writes the crawled Netease News ranking page to a file for transcoding because the webpage contains Chinese characters.

 

The next step is to capture all the news sublinks on the webpage. This code is entirely from the crawler source code in the learning materials. I only need to make comments.

1 #! /Usr/bin/python3 2 3 import OS 4 import re 5 import requests 6 from lxml import etree 7 8 def Page_Info (myPage ): 9 '''regex' 10 # Here's the re. findall returns a list of tuples with the content (. *?) 11 # extract the title and link 12 myPage_Info = re. findall (R' <div class = "titleBar" id = ".*? "> <H2> (.*?) </H2> <div class = "more"> <a href = "(.*?) "> .*? </A> </div> ', myPage, re. s) 13 return myPage_Info14 15 def StringListSave (save_path, filename, slist): 16 if not OS. path. exists (save_path): 17 OS. makedirs (save_path) 18 path = save_path + "/" + filename + ". txt "19 with open (path," w + ") as fp: 20 for s in slist: 21 # transcode utf8 and convert it to a terminal-identifiable code system 22 fp. write ("% s \ t % s \ n" % (s [0]. encode ("utf8"), s [1]. encode ("utf8") 23 # fp. write ("% s \ t % s \ n" % (s [0], s [1]) 24 25 # Test new_page content 26 def testNewPage (save_path, filename, new_page): 27 if not OS. path. exists (save_path): 28 OS. makedirs (save_path) 29 path = save_path + "/" + filename + ". txt "30 fp = open (path," w + ") 31 fp. write (new_page) 32 33 def New_Page_Info (new_page): 34''' Regex (slowly) or Xpath (fast) ''' 35 # new_page_info = re. findall (R' <td class = ". *? "> .*? <A href = "(.*?) \. Html. *?> (.*?) </A> </td> ', new_page, re. s) 36 # results = [] 37 # for url, item in new_page_info: 38 # results. append (item, url + ". html ") 39 # return results40 41 # convert the content of new_page into a tree in html format 42 dom = etree. HTML (new_page) 43 # extract <tr <td <a's text 44 new_items = dom. xpath ('// tr/td/a/text ()') 45 # extract the link in <tr <td <a, @ href is a property 46 new_urls = dom. xpath ('// tr/td/a/@ href') 47 assert (len (new_items) = len (new_urls) 48 return zip (new_items, new_urls) 49 50 51 def Spider (url): 52 I = 053 print ("downloading", url) 54 myPage = requests. get (url ). content. decode ("gbk") 55 myPageResults = Page_Info (myPage) 56 save_path = "Netease news capture" 57 filename = str (I) + "_ Netease News rankings" 58 StringListSave (save_path, filename, myPageResults) 59 I + = 160 for item, url in myPageResults: 61 print ("downloading", url) 62 new_page = requests. get (url ). content. decode ("gbk") 63 testNewPage (save_path, item, new_page) 64 newPageResults = New_Page_Info (new_page) 65 filename = str (I) + "_" + item66 StringListSave (save_path, filename, newPageResults) 67 I ++ = 168 69 70 if _ name _ = '_ main _': 71 print ("start ") 72 start_url = "http://news.163.com/rank/" 73 Spider (start_url) 74 print ("end ")
View Code

It is worth noting that using xpath is much more efficient than using re. findall.

 

Learning and reading materials:

Crawler open source code: https://github.com/lining0806

Python basics: http://www.runoob.com/python/python-tutorial.html

Regular expression syntax basics: http://www.runoob.com/regexp/regexp-syntax.html

Lxml learning document: http://lxml.de/tutorial.html

XPath: http://blog.csdn.net/raptor/article/details/451644

XPath documentation: https://www.w3.org/TR/xpath/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.