Getting Started with Python crawlers

Last Update:2016-01-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Bishi is to do crawler-related, originally wanted to write with Java, also wrote a few reptiles, one of which is to crawl NetEase cloud music user information, climbed about more than 1 million, the effect is not too satisfied. Before I heard that Python was strong, I wanted to try it with Python, and I didn't use Python before. So, learning while climbing, learning while climbing. Don't say much nonsense, get to the point.

1. The first is to get the target page, which is very simple for python.

#encoding =utf8
Import urllibres = Urllib.urlopen ("http://www.baidu.com") print Res.read ()

Run the results and open the Baidu page to view the same source code. Here are a few explanations for Python syntax.

A) import is the introduction of the meaning, Java is also used import,c/c++ with the include, the same role

b). Urllib This is the Python's own module, in the future when the development, if encountered their own needs of the function, Python comes with the module is not in the time, you can try to find on the Internet, such as the need to operate the MySQL database, this time Python is not self-contained, You can find mysqldb on the Internet, and then install the introduction on the line.

c). Res is a variable that is not declared like the Java,c language. Just write it when you use it.

d). Punctuation marks. Like Java,c these languages, each line of code followed by semicolons or other symbols, as the end of the flag, Python does not use the reverse will be wrong. Sometimes, however, punctuation marks, such as colons, are used in the back.

e). With regard to print, in python2.7, there is the print () function, and the print statement, basically the same function.

f). #注释

g). Encoding=utf8 represents the use of UTF8 encoding, which is especially useful when you have Chinese in your code.

2. Parse the retrieved elements of the webpage, get what you want, take the watercress for example, we want to get all the book names on this page (for learning communication only)

Http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book

First get the page code:

#encoding =utf8import urllibres = Urllib.urlopen ("Http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book") print Res.read ()

Get results, by analyzing the page source code (suggested with the Firefox browser, press F12, you can see the source code), you can locate the valid code as follows:

Here we begin to parse (here with BeautifulSoup, self-download installation), the basic process:

a). Narrow down, here we get all the books by id= "book"

b). Then through the class= "title", traverse all the titles.

The code is as follows:

#encoding =utf8import Urllibimport beautifulsoupres = Urllib.urlopen ("http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/ Focus=book ") soup = Beautifulsoup.beautifulsoup (res) Book_div = Soup.find (attrs={" id ":" book "}) Book_a = Book_ Div.findall (attrs={"class": "title"}) for book in book_a:    print book.string

Code Description:

a). Book_div get div tags via id=book

b). Book_a get all book a tags through class= "title"

c). For loop is a tag that traverses all BOOK_A

d). Book.string is the content in the output a label

The results are as follows:

3. Store the obtained data, such as write to the database, my database with MySQL, here is the example of MySQL (download install MYSQLDB module here do not narrate), write only how to execute an SQL statement

The code is as follows:

Connection = MySQLdb.connect (host= "* * *", user= "* * *", passwd= "* * *", db= "* * *", port=3306,charset= "UTF8") cursor = Connection.cursor () sql = "*******" sql_res = Cursor.execute (sql) Connection.commit () Cursor.close () Connection.close ()

Description

A). This code is the process of executing the SQL statement, which is handled differently for different SQL statements. For example, executing a SELECT statement, how do I get the result of execution, execute the UPDATE statement, and then fail. These are going to do it yourself.

b). When creating a database, be sure to pay attention to coding and recommend using UTF8

4. At this point, a simple crawler is complete. Then there are some strategies for anti-crawlers, such as using proxies to break through the IP traffic limit

Getting Started with Python crawlers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Getting Started with Python crawlers

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Getting Started with Python crawlers

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support