Scrapy Crawl the Web (IMOOC) All course data and deposit it into MySQL database

Source: Internet
Author: User
Tags xpath

Crawl target: Use Scrapy to crawl all course data, respectively

1. Course Name 2. Course Description 3. Course Level 4. Number of learners

and stored in MySQL database (destination URL http://www.imooc.com/course/list)

I. Exporting data files to a local

1. New IMOOC Project

1 scrapy startproject IMOOC

2. Modify items.py, add Project item

1  from  Import Item,field 2 class Imoocitem (Item): 3     Course_name=field ()# Course name 4     Course_content=field ()# Course Content  5     Course_level=field ()# Course level 6     course_attendance= Field ()# Course number of learners

3. Making crawlers in the Spiders directory

VI imooc_spider.py

1 #-*-coding:utf-8-*-2  fromScrapy.spidersImportCrawlspider3  fromScrapy.selectorImportSelector4  fromImooc.itemsImportImoocitem5  fromScrapy.httpImportRequest6 7 8 classImooc (crawlspider):9Name='IMOOC'TenAllowed_domains = ['imooc.com'] OneStart_urls = [] A      forPninchRange (1,31): -URL ='http://www.imooc.com/course/list?page=%s'%PN - start_urls.append (URL) the  -     defParse (self,response): -item=Imoocitem () -Selector=Selector (response) +Course = Selector.xpath ('//a[@class = "Course-card"]') -  +          forEachcourseinchCourse: ACourse_name = Eachcourse.xpath ('div[@class = "Course-card-content"]/h3[@class = "Course-card-name"]/text ()'). Extract () [0] atCourse_content = Eachcourse.xpath ('div[@class = "Course-card-content"]/div[@class = "Clearfix course-card-bottom"]/p[@class = "Course-card-desc"]/ Text ()'). Extract () -Course_level = Eachcourse.xpath ('div[@class = "Course-card-content"]/div[@class = "Clearfix course-card-bottom"]/div[@class = "Course-card-info"]/ Span/text ()'). Extract () [0] -Course_attendance = Eachcourse.xpath ('div[@class = "Course-card-content"]/div[@class = "Clearfix course-card-bottom"]/div[@class = "Course-card-info"]/ Span/text ()'). Extract () [1] -item['Course_name'] =Course_name -item['course_content'] =';'. Join (course_content) -item['Course_level'] =Course_level initem['course_attendance'] =course_attendance -             yieldItem

4. Now you can run the crawler to export the data, now in the CVS format test

1 scrapy crawl imooc-o data.csv-t CSV

View Files

Two. Crawling data and storing it in MySQL database

1. Using the MySQL database to store data, you need to use the MYSQLDB package to ensure that you have installed

First build the database and table

--Create a databaseCreate DatabaseImoocDEFAULT CHARACTER SETUTF8 COLLATE utf8_general_ci;--Create a tableCreate TableImooc_info2 (Titlevarchar(255) not NULLCOMMENT'Course Name', Contentvarchar(255) not NULLCOMMENT'Course Introduction', Level varchar(255) not NULLCOMMENT'Course Level', sumsint   not NULLCOMMENT'number of courses studied')

2. Modify pipelines.py

1 #-*-coding:utf-8-*-2 3 #Define your item pipelines here4 #5 #Don ' t forget to add your pipeline to the Item_pipelines setting6 #see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html7 8 9 ImportJSONTen  fromTwisted.enterpriseImportAdbapi One  fromScrapyImportLog A ImportMySQLdb - Importmysqldb.cursors - ImportCodecs the  - classImoocpipeline (object): -     def __init__(self): -Self.file = Codecs.open ('Imooc.json','W', encoding='Utf-8') +     defProcess_item (self, item, spider): -line = Json.dumps (Dict (item), Ensure_ascii=false) +"\ n" + Self.file.write (line) A         returnItem at     defspider_closed (self, spider): - self.file.close () -  - classMysqlpipeline (object): -  -     def __init__(self): inSelf.dbpool = Adbapi. ConnectionPool ("MySQLdb", -db ="IMOOC",#Database name touser ="Root",#Database user name +passwd ="hwfx1234",#Password -Cursorclass =MySQLdb.cursors.DictCursor, theCharSet ="UTF8", *Use_unicode =True $                                            )Panax Notoginseng     defProcess_item (self, item, spider): -query =self.dbpool.runInteraction (Self._conditional_insert, item) the Query.adderrback (self.handle_error) +         returnItem A  the     def_conditional_insert (self, TB, item): +Tb.execute ("""INSERT INTO Imooc_info2 (title,content,level,sums) VALUES (%s,%s,%s,%s)""", (item['Course_name'],item['course_content'],item['Course_level'],item['course_attendance'])) -Log.msg ("Item data in db:%s"% Item, level=log. DEBUG) $  $     defHandle_error (Self, e): -Log.err (e)

3. Modify setting.py

Added MySQL config, add new class in pipelines.py

1 #Start MySQL Database Configure setting2Mysql_host ='localhost'3Mysql_dbname ='IMOOC'4Mysql_user ='Root'5MYSQL_PASSWD ='hwfx1234'6 #End of MySQL database Configure setting7Item_pipelines = {8             'Imooc.pipelines.ImoocPipeline': 300,9             'Imooc.pipelines.MySQLPipeline': 300,Ten}

4. Start the crawler

1 scrapy crawl Imooc

View database table data, data is already in storage.

Summary: Scrapy simple application, have not considered anti-crawler, distributed and other problems, but also need more practice.

Scrapy Crawl the Web (IMOOC) All course data and deposit it into MySQL database

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.