Python crawls the starting point using the Scrapy framework

Last Update:2017-12-02 Source: Internet

Author: User

Tags mongoclient unique id

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Go ahead and review the details and ideas after you've finished, and then code together.

1.Mongodb build a library called Qidian,
And then set up a called Novelclass (Fiction category table)
Novelclass (can be stored in class two categories: Fantasy--First class category, Oriental Fantasy--level two category) table

Client = Pymongo. Mongoclient (host= "127.0.0.1")
db = client. Qidian
Collection = db. Novelclass

2. Use the parse callback method to obtain a first-level category. Loop out (pay attention to the stitching problem-"https:" +).
and deposit the first-level category to MongoDB (note that the first-class PID is none at this time). First-class links do not need to be deposited into the Redis database (the first-class links are only to find level two).

3. At this point, get the Level two category (Orient Fantasy), at which time the parse method is recalled
Get level Two category (name + link) While the two-level category name and first-level category name are placed in the same MongoDB (novelclass), the link is credited to the Redis library (classid = Self.insertmongo (name[0],pid), Self.pushredis (Classid,url,pid))

def Insertmongo (self,classname,pid):
ClassID = Collection.insert ({' classname ': ClassName, ' PID ':p ID})
return ClassID
def Pushredis (Self,classid,url,pid,):
Novelurl = '%s,%s,%s '% (classid,url,pid)
R.lpush (' Novelurl ', Novelurl)
At this point the first step is complete. The first level category and the level two category are obtained.

4. With the level Two link (Oriental Fantasy), next to get each level two link under the novel Name and link.
(Similarly, the name of the novel is deposited into the Mongodb--novelname form, and the link is deposited to redis--novelnameurl)
Notice here that we define a dictionary (for us to get the ID of arr[0]--, which is level two (Orient Fantasy's ID))
Because we want to determine which category each novel belongs to (Oriental Fantasy or Western fantasy. ）
Dict = {}
Novelurl = Bytes.decode (item)
arr = Novelurl.split (', ') # split string
QidianNovelSpider.start_urls.append (Arr[1])
PID = arr[0] Level two (East fantasy) of the _id, that is, the water ID, take care not to take the East fantasy PID (his PID is fantasy ID)
url = arr[1] level Two (Oriental Fantasy) Link
Self.dict[url] = {"pid":p ID, "num": 0}
Num is now in order to control how many pages we crawl.
The same parse callback method, because we're going to the next page, so we're going to make sure the link is the same
ClassInfo = Self.dict[response.url]--response.url fixed that's how it's written
PID = classinfo[' pid ']--ok so pid=arr[0]
num = classinfo[' num ']

If num>3:--------here the use of Num comes, I only go to the first 4 pages of each level two link (because you have finished the first four pages to loop back, so not 3 pages)
Return None
Also pay attention to the stitching problem of the link. (Otherwise it will report keyerror error)
Take the names and links to MongoDB and Redis, respectively.
ClassID = Collection.insert ({' novelname ': Name, ' pid ': pid}) At this point the PID is (the ID of the Orient Fantasy is the unique ID, not the ID of the fantasy)
Print (name)
Self.pushredis (ClassID, C, PID)-----(ClassID is the water id--_id,c is the link after stitching, the PID is the East fantasy ID)
Now the first page of classification can be taken, now start to write the next page,
HXS = Htmlxpathselector (response)
HX = Hxs.select ('//li[@class = "Lbf-pagination-item"]/a[@class = "Lbf-pagination-next"]
URLs = Hx.select ("@href"). Extract ()
D = "https:" + urls[0]
classinfo[' num '] +=1-------(per page, num + 1)
SELF.DICT[D] = ClassInfo
Print (d)
Request = Request (d, Callback=self.parse)-------(call the previous callback method, that is, take the page name and link to that place.) ）
Yield request
This code is to take the next page of the link and then call the previous method, the next page of the name and link to take down.
After that is the warehousing operation, above.
MONGODB---novelname,redis----novelnameurl

5. The next thing to do is to update the book information (that is, to update the Novelname table)
Now MongoDB----Novelname table has only the name of the book, and no specific information. (So the author, signing, serial or end, free or VIP updates)
Note: I am updating instead of creating a new table. So I get the link is also in the Novelnameurl, but to get the information without a table, but to insert the previous Novelname
Client = Pymongo. Mongoclient (host= "127.0.0.1")
db = client. Qidian
Collection = db. Novelname (This table is the same as the previous PY file)
The same, I still need (East Fantasy's _id)
PID = arr[0]
url = arr[1]
Self.dict[url] = {"pid":p ID}
But this time, it was prepared to update the MONGODB database,
Nameinfo = Self.dict[response.url]
PID1 = nameinfo[' pid ')
PID = ObjectId (PID1)-------(here is to wait for me to update, "_id" This key corresponds to the same ObjectId)
After that, the information is taken.
HX = Hxs.select ('//div[@class = "Book-info"]/h1/span/a[@class = "Writer"]
Hx1 =hxs.select ('//p[@class = "Tag"]/span[@class = "Blue"])
HX2 =hxs.select ('//p[@class = "Tag"]/a[@class = "Red"])
For Secitem in HX:
writer = Secitem.select ("text ()"). Extract ()
Print (writer)
For secItem1 in HX1:
State = Secitem1.select ("text ()"). Extract ()
Print (state)
For secItem2 in HX2:
Classes = Secitem2.select ("text ()"). Extract ()
Print (classes)
It is possible to take and print out the order, not to say that the writer of each novel is printed in the print state.
Update MongoDB----novelname table
Db. Novelname.update ({"_id": pid}, {"$set": {"writer": Writer, "State": State, "Classes": Classes}})
At this point the PID above has played a role, the update is complete.

6. The next step is to crawl through the chapter names and links of each novel. (The name of the novel with the above is just simpler, no next page)
Here is still to pay attention to the ID problem, the chapter name and link should correspond to each novel's own _id (rather than the two-level PID).
Insert MongoDB-----Chaptername,redis----Chapterurl

7. The final step, according to the chapter links to the content of the novel, because we have taken every novel of all the links, so do not consider the next chapter of the question.
At the same time, we take the content, but also to update the content to the chapter table, here we need to note is that we get the content of the novel is the P tag (is a string form),
So we have a problem inserting MongoDB, the string put in, one will give you an ID, this is not want, is a chapter of the novel on a _id just fine.
string concatenation was used.
Ii= ""-------give an empty first.
HX = Hxs.select ('//div[@class = "read-content j_readcontent"]/p ')

For Secitem in HX:
Contents = Secitem.select ("text ()"). Extract ()
Content1 = contents[0]-----Take to the content of a P
# Print (CONTENT1)
Ii=content1+ii-------Take it in.
Print (ii)------The results we want
Last updated into Chaptername,
Db. Chaptername.update ({"_id": pid}, {"$set": {"content": II}})

Mongodb (Novelclass (Class I, Level Two-fantasy, Orient Fantasy); Novelname (the name of the novel, updated into the author, serial, etc.); Chaptername (chapter name,----update to the section))
Redis (Novelurl (Simple level Two link---oriental fantasy, with less than one link); Novelnameurl (link to the novel name); Chapterurl (chapter link))

First PY file

#-*-Coding:utf-8-*-import refrom urllib.request import urlopenfrom scrapy.http import request# from Urllib.request Imp ORT requestfrom BS4 Import beautifulsoupfrom lxml import etreeimport pymongoimport scrapyfrom scrapy.selector Import HtmlX Pathselectorclient = Pymongo. Mongoclient (host= "127.0.0.1") db = client. Qidiancollection = db. Novelclass #表名classificationimport Redis #导入redis数据库r = Redis. Redis (host= ' 127.0.0.1 ', port=6379, Db=0) class Qidianclassspider (Scrapy. Spider): name = "QidianClass2" allowed_domains = ["qidian.com"] #允许访问的域 start_urls = ["Https://www.qidi  An.com/all ",] #每爬完一个网页会回调parse方法 def parse (self, response): HxS = htmlxpathselector (response) HX =            Hxs.select ('//div[@class = "Work-filter type-filter"]/ul[@type = "category"]/li[@class = "]/a ') for Secitem in HX: url = secitem.select ("@href"). Extract () c = "https:" +url[0] name = Secitem.select ("text ()"). E Xtract () ClassID = Self.insertmongo (name[0],none) print (c) # a = db.            Novelclass.find () # for item in a: # Print (Item.get (' _id ')) # b = Item.get (' _id ') # novelurl = '%s,%s '% (item.get (' _id '), c) # R.lpush (' Novelurl ', novelurl) request = Reques T (C,callback=lambda response,pid=str (ClassID): Self.parse_subclass (RESPONSE,PID)) Yield request Def parse_sub Class (Self, response,pid): HxS = htmlxpathselector (response) HX = Hxs.select ('//div[@class = ' sub-type ']/dl[@c Lass= ""]/dd[@class = ""]/a ") for secitem in hx:urls = Secitem.select (" @href "). Extract () URL =             "https:" + urls[0] name = Secitem.select ("text ()"). Extract () ClassID = Self.insertmongo (name[0],pid) Self.pushredis (CLASSID,URL,PID) def Insertmongo (self,classname,pid): ClassID = Collection.insert ({' C Lassname ': classname, ' PID ':p ID}) return ClassID def Pushredis (sElf,classid,url,pid,): Novelurl = '%s,%s,%s '% (classid,url,pid) r.lpush (' Novelurl ', Novelurl)

Second py file

#-*-Coding:utf-8-*-import refrom urllib.request import urlopenfrom scrapy.http import requestimport pymongoimport SCRA Pyfrom Time Import sleepfrom scrapy.selector import htmlxpathselectorclient = Pymongo. Mongoclient (host= "127.0.0.1") db = client. Qidiancollection = db. Novelnameimport Redis # import Redis Database R = Redis. Redis (host= ' 127.0.0.1 ', port=6379, db=0) II = 0class Qidiannovelspider (scrapy. Spider): name = "QIDIANCLASS3" allowed_domains = ["qidian.com"] dict = {} Start_urls = [] def __init__ (self ): # defines a method a = R.lrange (' Novelurl ', 0,-1) # II = 0 for item in A:novelurl = Bytes.deco De (item) arr = Novelurl.split (', ') # split string QidianNovelSpider.start_urls.append (Arr[1]) p id = arr[0] url = arr[1] self.dict[url] = {"pid":p ID, "num": 0} # II +=1 # If II >3: # break # qidiannovelspider.start_urls = start_urls # print (start_urls)    Def parse (self, response): ClassInfo = self.dict[response.url] pid = classinfo[' pid ') num = Class        info[' num '] # print (self.dict) if Num>3:return None hxs = htmlxpathselector (response) HX = Hxs.select ('//div[@class = "Book-mid-info"]/h4/a ') for secitem in Hx:url = Secitem.select ("@h  Ref "). Extract () c =" https: "+ url[0] name = Secitem.select (" text () "). Extract () ClassID =        Collection.insert ({' novelname ': Name, ' pid ': pid}) print (name) Self.pushredis (ClassID, C, PID) Print ('-----------recursive--------------') HxS = htmlxpathselector (response) HX = Hxs.select ('//li[@class = ' lbf- Pagination-item "]/a[@class =" Lbf-pagination-next "] ') urls = Hx.select (" @href "). Extract () d =" https: "+ URLs [0] classinfo[' num '] +=1 self.dict[d] = ClassInfo print (d) request = Request (d, callback=self. Parse) yield requEST print ('--------End--------------') def pushredis (self, classid, C, pid): Novelnameurl = '%s,%s,%s '% (ClassID, C, PID) R.lpush (' Novelnameurl ', Novelnameurl)

Third py file

#-*-Coding:utf-8-*-import refrom urllib.request import urlopenfrom scrapy.http import requestimport pymongoimport SCRA Pyfrom Time Import sleepfrom scrapy.selector import htmlxpathselectorfrom bson.objectid Import objectidclient = Pymongo. Mongoclient (host= "127.0.0.1") db = client. Qidiancollection = db. Novelnameimport Redis # import Redis Database R = Redis. Redis (host= ' 127.0.0.1 ', port=6379, db=0) # II = 0class qidianNovelSpider1 (scrapy. Spider): name = "QIDIANCLASS4" allowed_domains = ["qidian.com"] dict = {} Start_urls = [] def __init__ (self ): # defines a method a = R.lrange (' Novelnameurl ', 0,-1) # II = 0 for item in A:novelnameurl = by             Tes.decode (item) arr = Novelnameurl.split (', ') # split string QidianNovelSpider1.start_urls.append (Arr[1])        pid = arr[0] url = arr[1] self.dict[url] = {"pid":p ID} def parse (self, Response): Nameinfo = Self.dict[response.url] pid1 = nameinfo[' pid ') PID =ObjectId (PID1) print (PID) HxS = Htmlxpathselector (response) HX = Hxs.select ('//div[@class = ' Book-info "]/h1/span/a[@class =" Writer "] hx1 =hxs.select ('//p[@class =" Tag "]/span[@class =" Blue "] ') hx2 =hxs.select ('//            p[@class = "Tag"]/a[@class = "Red"] ') for secitem in hx:writer = Secitem.select ("text ()"). Extract () Print (writer) for secItem1 in hx1:state = Secitem1.select ("text ()"). Extract () Print (stat            e) for secItem2 in hx2:classes = Secitem2.select ("text ()"). Extract () print (classes) # for item in a: # b = item.get (' _id ') # print (b) db. Novelname.update ({"_id": pid}, {"$set": {"writer": Writer, "State": State, "Classes": Classes}}) print ('-------            -----------------------------------') # ClassID = Collection.insert ({' novelname ': Name, ' pid ': pid}) # print (name) # Self.pushredis (ClassiD, c, Pid)

Fourth py file

#-*-Coding:utf-8-*-import refrom urllib.request import urlopenfrom scrapy.http import requestimport pymongoimport SCRA Pyfrom Time Import sleepfrom scrapy.selector import htmlxpathselectorfrom bson.objectid Import objectidclient = Pymongo. Mongoclient (host= "127.0.0.1") db = client. Qidiancollection = db. Chapternameimport Redis # import Redis Database R = Redis. Redis (host= ' 127.0.0.1 ', port=6379, Db=0) class QidianNovelSpider1 (Scrapy. Spider): name = "QIDIANCLASS5" allowed_domains = ["qidian.com"] dict = {} Start_urls = [] def __init__ (self ): # defines a method a = R.lrange (' Novelnameurl ', 0,-1) # II = 0 for item in A:novelnameurl = by             Tes.decode (item) arr = Novelnameurl.split (', ') # split string QidianNovelSpider1.start_urls.append (Arr[1]) pid = arr[0] url = arr[1] self.dict[url] = {"pid":p ID} print (URL) def pars E (Self, response): Nameinfo = self.dict[response.url] pid = nameinfo[' pid '] HxS = htmlxpathselector (response) HX = Hxs.select ('//div[@class = "Volume-wrap"]/div[@class = "Volume"]/ul[@cl            ass= "CF"]/li/a[@target = "_blank"] ') for secitem in hx:urls = Secitem.select ("@href"). Extract () url = "https:" +urls[0] Chapter = Secitem.select ("text ()"). Extract () print (chapter) Prin  T (URL) classid = Collection.insert ({' Chaptername ': Chapter, ' PID ': pid}) Self.pushredis (Classid,url, PID) def pushredis (self, classid, URL, pid): Chapterurl = '%s,%s,%s '% (classid, URL, pid) R.lpush (' Chap Terurl ', Chapterurl)

Fifth py file

#-*-Coding:utf-8-*-import refrom urllib.request import urlopenfrom scrapy.http import requestimport pymongoimport SCRA Pyfrom Time Import sleepfrom scrapy.selector import htmlxpathselectorfrom bson.objectid Import objectidclient = Pymongo. Mongoclient (host= "127.0.0.1") db = client. Qidiancollection = db. Chapternameimport Redis # import Redis Database R = Redis. Redis (host= ' 127.0.0.1 ', port=6379, Db=0) class QidianNovelSpider1 (Scrapy. Spider): name = "QIDIANCLASS6" allowed_domains = ["qidian.com"] dict = {} Start_urls = [] def __init__ (self ): # defines a method a = R.lrange (' Chapterurl ', 0,-1) # II = 0 for item in A:chapterurl = bytes.            Decode (item) arr = Chapterurl.split (', ') # split string QidianNovelSpider1.start_urls.append (Arr[1]) pid = arr[0] url = arr[1] self.dict[url] = {"pid":p ID} # print (URL) def parse (SE  LF, response): Nameinfo = Self.dict[response.url] pid1 = nameinfo[' pid ')      PID = ObjectId (pid1) HxS = htmlxpathselector (response) ii= "" HX = Hxs.select ('//div[@class = "rea            D-content j_readcontent "]/p") for secitem in hx:contents = Secitem.select ("text ()"). Extract () Content1 = contents[0] # print (content1) ii=content1+ii # content = bytes (Content1, ' GBK ') # ClassID = Collection.insert ({' content ': II, ' PID ': PID1}) db. Chaptername.update ({"_id": pid}, {"$set": {"content": II}}) # Print (content) # f = open (' 1.txt ', ' WB ' ) # F.write (content) # F.close ()

All right, it's done.

Python crawls the starting point using the Scrapy framework

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More