Python crawler problem! Online and other answers!

Source: Internet
Author: User
Keywords Python php mysql sql sqlserver
I have climbed the URL of all the courses on Coursera and placed it in the All_url.txt file, about 2000 lines.
Then I want to use these URLs to crawl other information I want, to synthesize a. csv to facilitate the import of the database.
In the following code, I've only written a few things I want to crawl to test for implementation (such as the crawl of the course schedule, and so on, five of the information I need is not written in the code), but after control+b in sublime, without an error, you can only create a CSV file.

If you can find the loophole, the younger brother also has a problem, is not the number of cycles too much, the first layer of the 2000-layer cycle, there is probably every secondary for the loop and 10 levels, it should be how to optimize it ...

Ask the great God for advice! Thx

The code is as follows ***#!usr/bin/python#-*-coding:utf-8-*-import sys;import osimport urllib import requestsimport csvfrom bs4 impor T beautifulsoupreload (SYS); sys.setdefaultencoding ("utf8") F = open ("All_url.txt", "r") lines = F.readlines () for line in            lines:html = Urllib.urlopen (line) content = Html.read () html.close () soup = beautifulsoup (content) All_coursename = Soup.find_all (' H2 ', class_= "Color-primary-text headline-1-text flex-1") Coursename = [] for course Name in All_coursename:COURSENAME.append (coursename) all_courseins = Soup.find_all (class_= "Text-light o Ffering-partner-names ") courseinstitution = [] for courseins in All_courseins:COURSEINSTITUTION.appe    nd (courseins) all_courseurl = Soup.find_all (' A ', class_= "Rc-offeringcard nostyle") Courseurl = [] For Courseurl in All_courseurl:COURSEURL.append (courseurl) csvfile = File (' all_info.csv ', ' WB ') writer = Csv.writer (csvfile) writeR.writerow ([' Course_name ', ' course_institution ', ' Course_url ']) for I in Range (0,len (coursename)): data = [                (Coursename[i], courseinstitution[i], Courseurl[i]) Writer.writerows (data) csvfile.close ()

Reply content:

I have climbed the URL of all the courses on Coursera and placed it in the All_url.txt file, about 2000 lines.
Then I want to use these URLs to crawl other information I want, to synthesize a. csv to facilitate the import of the database.
In the following code, I've only written a few things I want to crawl to test for implementation (such as the crawl of the course schedule, and so on, five of the information I need is not written in the code), but after control+b in sublime, without an error, you can only create a CSV file.

If you can find the loophole, the younger brother also has a problem, is not the number of cycles too much, the first layer of the 2000-layer cycle, there is probably every secondary for the loop and 10 levels, it should be how to optimize it ...

Ask the great God for advice! Thx

The code is as follows ***#!usr/bin/python#-*-coding:utf-8-*-import sys;import osimport urllib import requestsimport csvfrom bs4 impor T beautifulsoupreload (SYS); sys.setdefaultencoding ("utf8") F = open ("All_url.txt", "r") lines = F.readlines () for line in            lines:html = Urllib.urlopen (line) content = Html.read () html.close () soup = beautifulsoup (content) All_coursename = Soup.find_all (' H2 ', class_= "Color-primary-text headline-1-text flex-1") Coursename = [] for course Name in All_coursename:COURSENAME.append (coursename) all_courseins = Soup.find_all (class_= "Text-light o Ffering-partner-names ") courseinstitution = [] for courseins in All_courseins:COURSEINSTITUTION.appe    nd (courseins) all_courseurl = Soup.find_all (' A ', class_= "Rc-offeringcard nostyle") Courseurl = [] For Courseurl in All_courseurl:COURSEURL.append (courseurl) csvfile = File (' all_info.csv ', ' WB ') writer = Csv.writer (csvfile) writeR.writerow ([' Course_name ', ' course_institution ', ' Course_url ']) for I in Range (0,len (coursename)): data = [                (Coursename[i], courseinstitution[i], Courseurl[i]) Writer.writerows (data) csvfile.close ()

The first layer Gets the URL page with the thread module, then the second layer for the direct extend list is good, and finally do not open close file write first save the results for the last time to write to the file

Check out how open mode =WB is defined.

Using ' W ', if the file exists, first empty and then (re) create

Step by step debugging, to see where the problem is, may be filtered from the HTML you want information error or filter can not come out, it is possible

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.