How do I use Python to crawl the beauty picture and save it locally?
1. What data is available?
Homecoming Name
School Campus School
The URL of the beauty queen picture Img_ulr
2. How do I get it?
Open the Web page http://www.xiaohuar.com/hua/, open the development tool, study each picture corresponding HTML, find the law
Bulk crawl through Python's scrapy framework
Environmental statement:
Python 3.50
Scrapy Library
What are the problems?
1. How do I remove a duplicate URL?
By MD5 the acquired URL, the
2. How are all the beauty information on the site?
Get all the A tags on the page first, and pass the other content by hand
3. How is content permanent?
Write file, database, etc., the crawl beauty picture program, I choose to write file save
How do I create and handle logic?
1. New Crawler Project
Scrapy startproject pa_girls (via command line)
2. Spiders directory, set up a call, school_girls.py file
Write in the school_girls.py file:
#!/usr/bin/python3import scrapyfrom scrapy.selector Import htmlxpathselectorimport hashlib# Add the item module to the environment variable from items Import pa1item# # Final Get information List # school_girl = []# get total URL, purpose to go to heavy all_urls = {}class schoolgirls (scrapy. Spider): name = ' school_girls ' # initial URL, start_urls = [' http://www.xiaohuar.com/hua/',] def parse (Self, Response): # crawler main try: # find Tag hxs = htmlxpathselector (response) Girls = Pa1item () # Gets the specified data in the label school = Hxs.select ('//div[@class = "img"]/div[@class = "Btns"]/a/text () '). extr Act () name = Hxs.select ('//div[@class = "img"]/span[@class = "Price"]/text () '). Extract () Img_url = HXS.S Elect ('//div[@class = ' img ']/a/img/@src '). Extract () If school and name and Img_url:girls[' School ' ] = Hxs.select ('//div[@class = "img"]/div[@class = "Btns"]/a/text () '). Extract () girls[' name '] = Hxs.select ('//d iv[@class = "img"]/span[@class = "Price"]/text ()). Extract () girls[' img_url ' = Hxs.select ('//div[@class = "img"]/a/img/@src '). Extract () Yield girls Else:pass # # gets all the connections to the page Page_urls = Hxs.select ('//a/@href '). Ext Ract () page_urls.append (' http://www.xiaohuar.com/hua/') # print (page_urls) # URL de-weight Url_list = {} for URL in page_urls:if url.startswith (' JavaScript ') or Url.startswith (' # ') ) or not url:continue else:m = HASHLIB.MD5 () m.u Pdate (bytes (URL, encoding= ' utf-8 ')) img_id = M.hexdigest () # To determine if a URL is duplicated, repeat it without having to access it again If img_id in All_urls:continue else:al L_URLS[IMG_ID] = URL url_list[img_id] = URL # recursively finds the page all URLs for URLs in Url_lis T.values (): Yield scrapy. Request (Url=url, callback=self.parse) except Exception as E:print (e)
3. Write in the items.py file
Import Scrapyclass Pa1item (scrapy. Item): name = Scrapy. Field () school = scrapy. Field () Img_url = scrapy. Field ()
4. Write in the pipelines.py file
Import Osimport Requestsclass Girlsmessage (object): ' Get valid data ' Def process_item (self, item, spider): For I In range (len (item[' name ')): If item[' name '][i].strip () and item[' School '][i].strip () and item[' Img_url '][i].st RIP (): # Write information to file Message_girls = item[' name '][i] + ' _ ' + item[' school '][i] + ': ' + ' http://www.xiaohuar.com/' + item[' Img_url '][i] with open (' E:\scrapy_new\img\message _girls.text ', ' A + ', encoding= ' Utf-8 ') as F_girls:f_girls.write (message_girls) # Download image Img_path = Os.path.join (' E:\scrapy_new\img ', item[' name '][i] + ' _ ' + item[' school '][i] + '. jpg ') Img_url = ' http://www.xiaohuar.com/' + item[' Img_url '][i] try:img_date = Requests.get ( Img_url). Content with open (Img_path, ' bw ',) as F_img:f_img.write (img_date) F_img. Flush () except Exception as E:print (e) Return item
5. Add in the setting file
# set Crawl Depth Depth_limit = # Activate pipelines in class Item_pipelines = { ' pa_1.pipelines. Girlsmessage ': 200,}
What problems may arise?
1. The items module can not be imported, how to solve
In the __init__.py file in the Spiders directory, add:
Import Osimport syssys.path.append (Os.path.dirname (Os.path.dirname (Os.path.abspath)))
How do I start a project?
Scrapy Crawl school_girls (requires typing command under project, spiders directory)
Python_ climbing the Queen's campus