Python: Learning notes for web crawlers

Source: Internet
Author: User

If the content to crawl is embedded in the source code of the Web page, download the source code directly and then use the regular expression to find it ok. Here's a simple example:

1 Import urllib.request 2 3 html = urllib.request.urlopen ('http://www.massey.ac.nz/massey/learning/programme-course/ programme.cfm?prog_id=93536')4 html = Html.read (). Decode ('utf-8  ')

Note that the Decode method can sometimes cause errors, such as

1html = Urllib.request.urlopen ('http://china.nba.com/')2 3html = Html.read (). Decode ('Utf-8')4 Traceback (most recent):5 6File"<ipython-input-6-fc582e316612>", Line 1,inch<module>7html = Html.read (). Decode ('Utf-8')8 9Unicodedecodeerror:'Utf-8'Codec can'T decode byte 0xd6 in position 85:invalid continuation byte

For specific reasons do not know, can be used decode a parameter, as follows

1html = Html.read (). Decode ('Utf-8','Replace')2 3html = Urllib.request.urlopen ('http://china.nba.com/')4html = Html.read (). Decode ('Utf-8','Replace')5 6 HTML7OUT[9]:'<! DOCTYPE html>\r\n

Replace indicates that a character that cannot be transcoded is replaced with a question mark or something ... This is a compromise method. Let's go back to the chase. Say we want to crawl the course name of the page mentioned above

View the Web page source code. I use the Google Browser, right click on the page, and then select ' View page source '

Then ctrl+f on this page to find the characters you want to crawl:

This is just the corresponding code (want to read the source code to learn the HTML language Ah http://www.w3school.com.cn/html/index.asp This site is very good)

The next step is to use a regular expression to pull the string down:

1 re.findall ('', html)2 out[35]: [ '  ']

All that's left is the cutting of the string:

1 course = Re.findall ('', html)2 Course = Str (course[0])3 course = course.replace ('',' ')4 course = course.replace ('(<span>MALP</span>) 

',')

Results:

1Course = Re.findall ('', HTML)2 3Course =str (course[0])4 5Course = Course.replace ('',"')6 7Course = Course.replace ('(<span>MALP</span>) ',"')8 9 CourseTenOUT[40]:'Master of advanced Leadership Practice'

Write it into a function:

1 defget_course (URL):2HTML =urllib.request.urlopen (URL)3html = Html.read (). Decode ('Utf-8')4Course = Re.findall ('', HTML)5Course =str (course[0])6Course = Course.replace ('',"')7Course = Course.replace ('(<span>MALP</span>) ',"')8     returnCourse

In this way, the website of other courses in the school can also be deducted from the name of the course (language is not good, please forgive me)

1 get_course ('Http://www.massey.ac.nz/massey/learning/programme-course/programme.cfm?prog _id=93059')2'Master of Counselling Studies (<span>mcounsstuds </span>) '

This is very embarrassing, because the second replace function, pattern is wrong, it seems to be changed with a regular

1 defget_course (URL):2HTML =urllib.request.urlopen (URL)3html = Html.read (). Decode ('Utf-8')4Course = Re.findall ('', HTML)5Course =str (course[0])6Course = Course.replace ('',"')7REPL = str (Re.findall ('\ (<span>.*?</span>\) ', course) [0])8Course = Course.replace (REPL,"')9     returnCourse

Try again.

1 get_course ('Http://www.massey.ac.nz/massey/learning/programme-course/programme.cfm?prog _id=93059')2'Master of counselling studies'

Get!

In fact, the source code can be directly parsed with BeautifulSoup, which makes locating faster. The next one is talking.

This is actually my first job in Guangzhou to do, check whether the Web site exists, is still the original course. The supervisor wants to check it manually ... More than 1000 URLs, he said he was his own manual check, haha, I do not want to do this job. At that time also tried to use R language to crawl the course name, tried for a long time ... More trouble, then learn Python. Now it's estimated that you'll be able to handle more than 1000 URLs in 10 minutes. Just want to put a B, you can ignore

Python: Learning notes for web crawlers

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.