Python crawler entry (1): python crawler entry

Last Update:2016-01-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Biji is related to crawlers. Originally, he wanted to write in java and also wrote a few crawlers. One of them was the user information of Yiyun music, which crawled about more than 1 million, the effect is not satisfactory. I heard that Python is strong in this aspect. I want to try it with Python. I have never used Python before. So, learning and learning. If you don't talk nonsense, go to the topic.

1. First, obtain the target page. This is very simple for python.

#encoding=utf8
import urllibres = urllib.urlopen("http://www.baidu.com")print res.read()

The running result is the same as opening the Baidu page and viewing the source code. The python syntax is described as follows.

A). import refers to the introduction. java also uses import, and C/C ++ uses include.

B ). urllib is a python module. If you encounter a function you need in future development, you can find it online if it is not available in the python module, for example, if you need to operate the MySql database, you can find MySQLdb on the Internet and install and introduce it.

C). res is a variable and does not need to be declared like java or c. You can simply write it when using it.

D). punctuation. For languages like java and C, each line of code is followed by a semicolon or another symbol. It is used as the end sign. python is not used. If it is used, an error will occur. But sometimes, punctuation, such as colons, will be used later.

E). For print, in python2.7, there are print () Functions and print statements, which have almost the same effect.

F). # comment

G). encoding = utf8 indicates utf8 encoding, which is particularly useful when the Code contains Chinese characters.

2. parse the elements in the obtained webpage and obtain what you want. Take Douban as an example. We want to obtain the names of all the books on this page (for learning and communication only)

Http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4? Focus = book

First, obtain the Page code:

#encoding=utf8import urllibres = urllib.urlopen("http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book")print res.read()

Obtain the result. By analyzing the page source code (we recommend that you use the firefox browser and press F12 to see the source code), you can locate the valid Code as follows:

Next let's start parsing (Here we use BeautifulSoup and download and install it on our own). The basic process is as follows:

A). narrow down the scope. Here we get all the books through id = "book ".

B). Then, use class = "title" to traverse all titles.

The Code is as follows:

#encoding=utf8import urllibimport BeautifulSoupres = urllib.urlopen("http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book")soup = BeautifulSoup.BeautifulSoup(res)book_div = soup.find(attrs={"id":"book"})book_a = book_div.findAll(attrs={"class":"title"})for book in book_a:    print book.string

Code Description:

A). book_div get the div tag through id = book

B). book_a obtains all the book a tags through class = "title"

C). The for loop traverses all the tags of book_a.

D). book. string is the content in the output a tag.

The result is as follows:

3. Store the obtained data, such as writing data to the database and using Mysql in my database. Here we will take Mysql as an example (download and install the MySQLdb module, which is not described here). We will only write how to execute one SQL statement.

The Code is as follows:

connection = MySQLdb.connect(host="***",user="***",passwd="***",db="***",port=3306,charset="utf8")cursor = connection.cursor()sql = "*******"sql_res = cursor.execute(sql)connection.commit()cursor.close()connection.close()

Note:

A). This code is used to execute SQL statements. Different SQL statements are processed differently. For example, how can I obtain the execution result and execute the update statement after executing the select statement. You have to do this yourself.

B) when creating a database, you must pay attention to the encoding. We recommend that you use utf8.

4. Now, a simple crawler is complete. Then there are some anti-crawler policies. For example, you can use a proxy to break through the ip address access limit.

Statement:

The code is only for learning and communication, and cannot be used for malicious collection, damage, and other adverse behaviors. We are not responsible for any problems.

If you have any questions, please correct them.

Indicate the source for reprinting.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler entry (1): python crawler entry

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler entry (1): python crawler entry

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support