Python crawler entry (1): python crawler entry
Biji is related to crawlers. Originally, he wanted to write in java and also wrote a few crawlers. One of them was the user information of Yiyun music, which crawled about more than 1 million, the effect is not satisfactory. I heard that Python is strong in this aspect. I want to try it with Python. I have never used Python before. So, learning and learning. If you don't talk nonsense, go to the topic.
1. First, obtain the target page. This is very simple for python.
#encoding=utf8
import urllibres = urllib.urlopen("http://www.baidu.com")print res.read()
The running result is the same as opening the Baidu page and viewing the source code. The python syntax is described as follows.
A). import refers to the introduction. java also uses import, and C/C ++ uses include.
B ). urllib is a python module. If you encounter a function you need in future development, you can find it online if it is not available in the python module, for example, if you need to operate the MySql database, you can find MySQLdb on the Internet and install and introduce it.
C). res is a variable and does not need to be declared like java or c. You can simply write it when using it.
D). punctuation. For languages like java and C, each line of code is followed by a semicolon or another symbol. It is used as the end sign. python is not used. If it is used, an error will occur. But sometimes, punctuation, such as colons, will be used later.
E). For print, in python2.7, there are print () Functions and print statements, which have almost the same effect.
F). # comment
G). encoding = utf8 indicates utf8 encoding, which is particularly useful when the Code contains Chinese characters.
2. parse the elements in the obtained webpage and obtain what you want. Take Douban as an example. We want to obtain the names of all the books on this page (for learning and communication only)
Http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4? Focus = book
First, obtain the Page code:
#encoding=utf8import urllibres = urllib.urlopen("http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book")print res.read()
Obtain the result. By analyzing the page source code (we recommend that you use the firefox browser and press F12 to see the source code), you can locate the valid Code as follows:
Next let's start parsing (Here we use BeautifulSoup and download and install it on our own). The basic process is as follows:
A). narrow down the scope. Here we get all the books through id = "book ".
B). Then, use class = "title" to traverse all titles.
The Code is as follows:
#encoding=utf8import urllibimport BeautifulSoupres = urllib.urlopen("http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book")soup = BeautifulSoup.BeautifulSoup(res)book_div = soup.find(attrs={"id":"book"})book_a = book_div.findAll(attrs={"class":"title"})for book in book_a: print book.string
Code Description:
A). book_div get the div tag through id = book
B). book_a obtains all the book a tags through class = "title"
C). The for loop traverses all the tags of book_a.
D). book. string is the content in the output a tag.
The result is as follows:
3. Store the obtained data, such as writing data to the database and using Mysql in my database. Here we will take Mysql as an example (download and install the MySQLdb module, which is not described here). We will only write how to execute one SQL statement.
The Code is as follows:
connection = MySQLdb.connect(host="***",user="***",passwd="***",db="***",port=3306,charset="utf8")cursor = connection.cursor()sql = "*******"sql_res = cursor.execute(sql)connection.commit()cursor.close()connection.close()
Note:
A). This code is used to execute SQL statements. Different SQL statements are processed differently. For example, how can I obtain the execution result and execute the update statement after executing the select statement. You have to do this yourself.
B) when creating a database, you must pay attention to the encoding. We recommend that you use utf8.
4. Now, a simple crawler is complete. Then there are some anti-crawler policies. For example, you can use a proxy to break through the ip address access limit.
Statement:
The code is only for learning and communication, and cannot be used for malicious collection, damage, and other adverse behaviors. We are not responsible for any problems.
If you have any questions, please correct them.
Indicate the source for reprinting.