Python_ climbing the Queen's campus

Last Update:2017-07-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How do I use Python to crawl the beauty picture and save it locally?

1. What data is available?

Homecoming Name

School Campus School

The URL of the beauty queen picture Img_ulr

2. How do I get it?

Open the Web page http://www.xiaohuar.com/hua/, open the development tool, study each picture corresponding HTML, find the law

Bulk crawl through Python's scrapy framework

Environmental statement:

Python 3.50

Scrapy Library

What are the problems?

1. How do I remove a duplicate URL?

By MD5 the acquired URL, the

2. How are all the beauty information on the site?

Get all the A tags on the page first, and pass the other content by hand

3. How is content permanent?

Write file, database, etc., the crawl beauty picture program, I choose to write file save

How do I create and handle logic?

　　1. New Crawler Project

Scrapy startproject pa_girls (via command line)

2. Spiders directory, set up a call, school_girls.py file

Write in the school_girls.py file:

#!/usr/bin/python3import scrapyfrom scrapy.selector Import htmlxpathselectorimport hashlib# Add the item module to the environment variable from items Import pa1item# # Final Get information List # school_girl = []# get total URL, purpose to go to heavy all_urls = {}class schoolgirls (scrapy. Spider): name = ' school_girls ' # initial URL, start_urls = [' http://www.xiaohuar.com/hua/',] def parse  (Self, Response): # crawler main try: # find Tag hxs = htmlxpathselector (response) Girls = Pa1item () # Gets the specified data in the label school = Hxs.select ('//div[@class = "img"]/div[@class = "Btns"]/a/text () '). extr Act () name = Hxs.select ('//div[@class = "img"]/span[@class = "Price"]/text () '). Extract () Img_url = HXS.S Elect ('//div[@class = ' img ']/a/img/@src '). Extract () If school and name and Img_url:girls[' School ' ] = Hxs.select ('//div[@class = "img"]/div[@class = "Btns"]/a/text () '). Extract () girls[' name '] = Hxs.select ('//d iv[@class = "img"]/span[@class = "Price"]/text ()). Extract ()                girls[' img_url ' = Hxs.select ('//div[@class = "img"]/a/img/@src '). Extract () Yield girls Else:pass # # gets all the connections to the page Page_urls = Hxs.select ('//a/@href '). Ext            Ract () page_urls.append (' http://www.xiaohuar.com/hua/') # print (page_urls) # URL de-weight Url_list = {} for URL in page_urls:if url.startswith (' JavaScript ') or Url.startswith (' # ') ) or not url:continue else:m = HASHLIB.MD5 () m.u                    Pdate (bytes (URL, encoding= ' utf-8 ')) img_id = M.hexdigest () # To determine if a URL is duplicated, repeat it without having to access it again If img_id in All_urls:continue else:al L_URLS[IMG_ID] = URL url_list[img_id] = URL # recursively finds the page all URLs for URLs in Url_lis             T.values ():   Yield scrapy. Request (Url=url, callback=self.parse) except Exception as E:print (e)

3. Write in the items.py file

Import Scrapyclass Pa1item (scrapy. Item):    name = Scrapy. Field ()    school = scrapy. Field ()    Img_url = scrapy. Field ()

4. Write in the pipelines.py file

Import Osimport Requestsclass Girlsmessage (object): ' Get valid data ' Def process_item (self, item, spider): For I In range (len (item[' name ')): If item[' name '][i].strip () and item[' School '][i].strip () and item[' Img_url '][i].st                                 RIP (): # Write information to file Message_girls = item[' name '][i] + ' _ ' + item[' school '][i] + ': ' + ' http://www.xiaohuar.com/' + item[' Img_url '][i] with open (' E:\scrapy_new\img\message                _girls.text ', ' A + ', encoding= ' Utf-8 ') as F_girls:f_girls.write (message_girls) # Download image                Img_path = Os.path.join (' E:\scrapy_new\img ', item[' name '][i] + ' _ ' + item[' school '][i] + '. jpg ') Img_url = ' http://www.xiaohuar.com/' + item[' Img_url '][i] try:img_date = Requests.get (                        Img_url). Content with open (Img_path, ' bw ',) as F_img:f_img.write (img_date) F_img. Flush () except Exception as E:print (e) Return item

5. Add in the setting file

# set Crawl Depth Depth_limit = # Activate pipelines in class Item_pipelines = {    ' pa_1.pipelines. Girlsmessage ': 200,}

What problems may arise?

1. The items module can not be imported, how to solve

In the __init__.py file in the Spiders directory, add:

Import Osimport syssys.path.append (Os.path.dirname (Os.path.dirname (Os.path.abspath)))

How do I start a project?

Scrapy Crawl school_girls (requires typing command under project, spiders directory)

Python_ climbing the Queen's campus

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python_ climbing the Queen's campus

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python_ climbing the Queen's campus

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support