I have climbed the URL of all the courses on Coursera and placed it in the All_url.txt file, about 2000 lines.
Then I want to use these URLs to crawl other information I want, to synthesize a. csv to facilitate the import of the database.
In the following code, I've only written a few things I want to crawl to test for implementation (such as the crawl of the course schedule, and so on, five of the information I need is not written in the code), but after control+b in sublime, without an error, you can only create a CSV file.
If you can find the loophole, the younger brother also has a problem, is not the number of cycles too much, the first layer of the 2000-layer cycle, there is probably every secondary for the loop and 10 levels, it should be how to optimize it ...
Ask the great God for advice! Thx
The code is as follows ***#!usr/bin/python#-*-coding:utf-8-*-import sys;import osimport urllib import requestsimport csvfrom bs4 impor T beautifulsoupreload (SYS); sys.setdefaultencoding ("utf8") F = open ("All_url.txt", "r") lines = F.readlines () for line in lines:html = Urllib.urlopen (line) content = Html.read () html.close () soup = beautifulsoup (content) All_coursename = Soup.find_all (' H2 ', class_= "Color-primary-text headline-1-text flex-1") Coursename = [] for course Name in All_coursename:COURSENAME.append (coursename) all_courseins = Soup.find_all (class_= "Text-light o Ffering-partner-names ") courseinstitution = [] for courseins in All_courseins:COURSEINSTITUTION.appe nd (courseins) all_courseurl = Soup.find_all (' A ', class_= "Rc-offeringcard nostyle") Courseurl = [] For Courseurl in All_courseurl:COURSEURL.append (courseurl) csvfile = File (' all_info.csv ', ' WB ') writer = Csv.writer (csvfile) writeR.writerow ([' Course_name ', ' course_institution ', ' Course_url ']) for I in Range (0,len (coursename)): data = [ (Coursename[i], courseinstitution[i], Courseurl[i]) Writer.writerows (data) csvfile.close ()
Reply content:
I have climbed the URL of all the courses on Coursera and placed it in the All_url.txt file, about 2000 lines. Then I want to use these URLs to crawl other information I want, to synthesize a. csv to facilitate the import of the database. In the following code, I've only written a few things I want to crawl to test for implementation (such as the crawl of the course schedule, and so on, five of the information I need is not written in the code), but after control+b in sublime, without an error, you can only create a CSV file.
If you can find the loophole, the younger brother also has a problem, is not the number of cycles too much, the first layer of the 2000-layer cycle, there is probably every secondary for the loop and 10 levels, it should be how to optimize it ...
Ask the great God for advice! Thx
The code is as follows ***#!usr/bin/python#-*-coding:utf-8-*-import sys;import osimport urllib import requestsimport csvfrom bs4 impor T beautifulsoupreload (SYS); sys.setdefaultencoding ("utf8") F = open ("All_url.txt", "r") lines = F.readlines () for line in lines:html = Urllib.urlopen (line) content = Html.read () html.close () soup = beautifulsoup (content) All_coursename = Soup.find_all (' H2 ', class_= "Color-primary-text headline-1-text flex-1") Coursename = [] for course Name in All_coursename:COURSENAME.append (coursename) all_courseins = Soup.find_all (class_= "Text-light o Ffering-partner-names ") courseinstitution = [] for courseins in All_courseins:COURSEINSTITUTION.appe nd (courseins) all_courseurl = Soup.find_all (' A ', class_= "Rc-offeringcard nostyle") Courseurl = [] For Courseurl in All_courseurl:COURSEURL.append (courseurl) csvfile = File (' all_info.csv ', ' WB ') writer = Csv.writer (csvfile) writeR.writerow ([' Course_name ', ' course_institution ', ' Course_url ']) for I in Range (0,len (coursename)): data = [ (Coursename[i], courseinstitution[i], Courseurl[i]) Writer.writerows (data) csvfile.close ()
The first layer Gets the URL page with the thread module, then the second layer for the direct extend list is good, and finally do not open close file write first save the results for the last time to write to the file
Check out how open mode =WB is defined.
Using ' W ', if the file exists, first empty and then (re) create
Step by step debugging, to see where the problem is, may be filtered from the HTML you want information error or filter can not come out, it is possible
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.