Python_ climbing the Queen's campus

Source: Internet
Author: User

How do I use Python to crawl the beauty picture and save it locally?

1. What data is available?

Homecoming Name

School Campus School

The URL of the beauty queen picture Img_ulr

2. How do I get it?

Open the Web page http://www.xiaohuar.com/hua/, open the development tool, study each picture corresponding HTML, find the law

Bulk crawl through Python's scrapy framework

Environmental statement:

Python 3.50

Scrapy Library

What are the problems?

1. How do I remove a duplicate URL?

By MD5 the acquired URL, the

2. How are all the beauty information on the site?

Get all the A tags on the page first, and pass the other content by hand

3. How is content permanent?

Write file, database, etc., the crawl beauty picture program, I choose to write file save

How do I create and handle logic?

  1. New Crawler Project

Scrapy startproject pa_girls (via command line)

2. Spiders directory, set up a call, school_girls.py file

Write in the school_girls.py file:

#!/usr/bin/python3import scrapyfrom scrapy.selector Import htmlxpathselectorimport hashlib# Add the item module to the environment variable from items Import pa1item# # Final Get information List # school_girl = []# get total URL, purpose to go to heavy all_urls = {}class schoolgirls (scrapy. Spider): name = ' school_girls ' # initial URL, start_urls = [' http://www.xiaohuar.com/hua/',] def parse  (Self, Response): # crawler main try: # find Tag hxs = htmlxpathselector (response) Girls = Pa1item () # Gets the specified data in the label school = Hxs.select ('//div[@class = "img"]/div[@class = "Btns"]/a/text () '). extr Act () name = Hxs.select ('//div[@class = "img"]/span[@class = "Price"]/text () '). Extract () Img_url = HXS.S Elect ('//div[@class = ' img ']/a/img/@src '). Extract () If school and name and Img_url:girls[' School ' ] = Hxs.select ('//div[@class = "img"]/div[@class = "Btns"]/a/text () '). Extract () girls[' name '] = Hxs.select ('//d iv[@class = "img"]/span[@class = "Price"]/text ()). Extract ()                girls[' img_url ' = Hxs.select ('//div[@class = "img"]/a/img/@src '). Extract () Yield girls Else:pass # # gets all the connections to the page Page_urls = Hxs.select ('//a/@href '). Ext            Ract () page_urls.append (' http://www.xiaohuar.com/hua/') # print (page_urls) # URL de-weight Url_list = {} for URL in page_urls:if url.startswith (' JavaScript ') or Url.startswith (' # ') ) or not url:continue else:m = HASHLIB.MD5 () m.u                    Pdate (bytes (URL, encoding= ' utf-8 ')) img_id = M.hexdigest () # To determine if a URL is duplicated, repeat it without having to access it again If img_id in All_urls:continue else:al L_URLS[IMG_ID] = URL url_list[img_id] = URL # recursively finds the page all URLs for URLs in Url_lis             T.values ():   Yield scrapy. Request (Url=url, callback=self.parse) except Exception as E:print (e)

3. Write in the items.py file

Import Scrapyclass Pa1item (scrapy. Item):    name = Scrapy. Field ()    school = scrapy. Field ()    Img_url = scrapy. Field ()

4. Write in the pipelines.py file

Import Osimport Requestsclass Girlsmessage (object): ' Get valid data ' Def process_item (self, item, spider): For I In range (len (item[' name ')): If item[' name '][i].strip () and item[' School '][i].strip () and item[' Img_url '][i].st                                 RIP (): # Write information to file Message_girls = item[' name '][i] + ' _ ' + item[' school '][i] + ': ' + ' http://www.xiaohuar.com/' + item[' Img_url '][i] with open (' E:\scrapy_new\img\message                _girls.text ', ' A + ', encoding= ' Utf-8 ') as F_girls:f_girls.write (message_girls) # Download image                Img_path = Os.path.join (' E:\scrapy_new\img ', item[' name '][i] + ' _ ' + item[' school '][i] + '. jpg ') Img_url = ' http://www.xiaohuar.com/' + item[' Img_url '][i] try:img_date = Requests.get (                        Img_url). Content with open (Img_path, ' bw ',) as F_img:f_img.write (img_date) F_img. Flush () except Exception as E:print (e) Return item 

5. Add in the setting file

# set Crawl Depth Depth_limit = # Activate pipelines in class Item_pipelines = {    ' pa_1.pipelines. Girlsmessage ': 200,}

What problems may arise?

1. The items module can not be imported, how to solve

In the __init__.py file in the Spiders directory, add:

Import Osimport syssys.path.append (Os.path.dirname (Os.path.dirname (Os.path.abspath)))

How do I start a project?

Scrapy Crawl school_girls (requires typing command under project, spiders directory)

Python_ climbing the Queen's campus

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.