Getting Started with Python crawlers

Source: Internet
Author: User

Bishi is to do crawler-related, originally wanted to write with Java, also wrote a few reptiles, one of which is to crawl NetEase cloud music user information, climbed about more than 1 million, the effect is not too satisfied. Before I heard that Python was strong, I wanted to try it with Python, and I didn't use Python before. So, learning while climbing, learning while climbing. Don't say much nonsense, get to the point.

1. The first is to get the target page, which is very simple for python.

#encoding =utf8
Import urllibres = Urllib.urlopen ("http://www.baidu.com") print Res.read ()

Run the results and open the Baidu page to view the same source code. Here are a few explanations for Python syntax.

A) import is the introduction of the meaning, Java is also used import,c/c++ with the include, the same role

b). Urllib This is the Python's own module, in the future when the development, if encountered their own needs of the function, Python comes with the module is not in the time, you can try to find on the Internet, such as the need to operate the MySQL database, this time Python is not self-contained, You can find mysqldb on the Internet, and then install the introduction on the line.

c). Res is a variable that is not declared like the Java,c language. Just write it when you use it.

d). Punctuation marks. Like Java,c these languages, each line of code followed by semicolons or other symbols, as the end of the flag, Python does not use the reverse will be wrong. Sometimes, however, punctuation marks, such as colons, are used in the back.

e). With regard to print, in python2.7, there is the print () function, and the print statement, basically the same function.

f). #注释

g). Encoding=utf8 represents the use of UTF8 encoding, which is especially useful when you have Chinese in your code.

2. Parse the retrieved elements of the webpage, get what you want, take the watercress for example, we want to get all the book names on this page (for learning communication only)

Http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book

  

First get the page code:

#encoding =utf8import urllibres = Urllib.urlopen ("Http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book") print Res.read ()

Get results, by analyzing the page source code (suggested with the Firefox browser, press F12, you can see the source code), you can locate the valid code as follows:

  

  

Here we begin to parse (here with BeautifulSoup, self-download installation), the basic process:

a). Narrow down, here we get all the books by id= "book"

b). Then through the class= "title", traverse all the titles.

The code is as follows:

#encoding =utf8import Urllibimport beautifulsoupres = Urllib.urlopen ("http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/ Focus=book ") soup = Beautifulsoup.beautifulsoup (res) Book_div = Soup.find (attrs={" id ":" book "}) Book_a = Book_ Div.findall (attrs={"class": "title"}) for book in book_a:    print book.string

Code Description:

a). Book_div get div tags via id=book

b). Book_a get all book a tags through class= "title"

c). For loop is a tag that traverses all BOOK_A

d). Book.string is the content in the output a label

The results are as follows:

  

3. Store the obtained data, such as write to the database, my database with MySQL, here is the example of MySQL (download install MYSQLDB module here do not narrate), write only how to execute an SQL statement

The code is as follows:

Connection = MySQLdb.connect (host= "* * *", user= "* * *", passwd= "* * *", db= "* * *", port=3306,charset= "UTF8") cursor = Connection.cursor () sql = "*******" sql_res = Cursor.execute (sql) Connection.commit () Cursor.close () Connection.close ()

Description

A). This code is the process of executing the SQL statement, which is handled differently for different SQL statements. For example, executing a SELECT statement, how do I get the result of execution, execute the UPDATE statement, and then fail. These are going to do it yourself.

b). When creating a database, be sure to pay attention to coding and recommend using UTF8

4. At this point, a simple crawler is complete. Then there are some strategies for anti-crawlers, such as using proxies to break through the IP traffic limit

Getting Started with Python crawlers

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.