Python crawler problem! Online and other answers!

Last Update:2016-06-06 Source: Internet

Author: User

Keywords Python php mysql sql sqlserver

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I have climbed the URL of all the courses on Coursera and placed it in the All_url.txt file, about 2000 lines.
Then I want to use these URLs to crawl other information I want, to synthesize a. csv to facilitate the import of the database.
In the following code, I've only written a few things I want to crawl to test for implementation (such as the crawl of the course schedule, and so on, five of the information I need is not written in the code), but after control+b in sublime, without an error, you can only create a CSV file.

If you can find the loophole, the younger brother also has a problem, is not the number of cycles too much, the first layer of the 2000-layer cycle, there is probably every secondary for the loop and 10 levels, it should be how to optimize it ...

Ask the great God for advice! Thx

The code is as follows ***#!usr/bin/python#-*-coding:utf-8-*-import sys;import osimport urllib import requestsimport csvfrom bs4 impor T beautifulsoupreload (SYS); sys.setdefaultencoding ("utf8") F = open ("All_url.txt", "r") lines = F.readlines () for line in            lines:html = Urllib.urlopen (line) content = Html.read () html.close () soup = beautifulsoup (content) All_coursename = Soup.find_all (' H2 ', class_= "Color-primary-text headline-1-text flex-1") Coursename = [] for course Name in All_coursename:COURSENAME.append (coursename) all_courseins = Soup.find_all (class_= "Text-light o Ffering-partner-names ") courseinstitution = [] for courseins in All_courseins:COURSEINSTITUTION.appe    nd (courseins) all_courseurl = Soup.find_all (' A ', class_= "Rc-offeringcard nostyle") Courseurl = [] For Courseurl in All_courseurl:COURSEURL.append (courseurl) csvfile = File (' all_info.csv ', ' WB ') writer = Csv.writer (csvfile) writeR.writerow ([' Course_name ', ' course_institution ', ' Course_url ']) for I in Range (0,len (coursename)): data = [                (Coursename[i], courseinstitution[i], Courseurl[i]) Writer.writerows (data) csvfile.close ()

Reply content:

Ask the great God for advice! Thx

The code is as follows ***#!usr/bin/python#-*-coding:utf-8-*-import sys;import osimport urllib import requestsimport csvfrom bs4 impor T beautifulsoupreload (SYS); sys.setdefaultencoding ("utf8") F = open ("All_url.txt", "r") lines = F.readlines () for line in            lines:html = Urllib.urlopen (line) content = Html.read () html.close () soup = beautifulsoup (content) All_coursename = Soup.find_all (' H2 ', class_= "Color-primary-text headline-1-text flex-1") Coursename = [] for course Name in All_coursename:COURSENAME.append (coursename) all_courseins = Soup.find_all (class_= "Text-light o Ffering-partner-names ") courseinstitution = [] for courseins in All_courseins:COURSEINSTITUTION.appe    nd (courseins) all_courseurl = Soup.find_all (' A ', class_= "Rc-offeringcard nostyle") Courseurl = [] For Courseurl in All_courseurl:COURSEURL.append (courseurl) csvfile = File (' all_info.csv ', ' WB ') writer = Csv.writer (csvfile) writeR.writerow ([' Course_name ', ' course_institution ', ' Course_url ']) for I in Range (0,len (coursename)): data = [                (Coursename[i], courseinstitution[i], Courseurl[i]) Writer.writerows (data) csvfile.close ()

The first layer Gets the URL page with the thread module, then the second layer for the direct extend list is good, and finally do not open close file write first save the results for the last time to write to the file

Check out how open mode =WB is defined.

Using ' W ', if the file exists, first empty and then (re) create

Step by step debugging, to see where the problem is, may be filtered from the HTML you want information error or filter can not come out, it is possible



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler problem! Online and other answers!

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler problem! Online and other answers!

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support