How did you start to write Python crawlers?

Source: Internet
Author: User
Tags xpath test wordpress blog
After reading the concise tutorial and stupid way to learn python, want to write crawler, do not start, need to continue to read what books and Practice

Reply content:

Tell me about my experience.

I was the first to climb the shrimp, want to see the more the shrimp listen to the more songs are what, climbed the whole station of the shrimp play the number of songs, did a statistic
Python crawler Learning record (1)--xiami Total Station play number
Statistics of the score distribution of the watercress anime
Watercress 2,100 anime page source code (including rating, Director, type, Introduction and other information, with grab code)
Crawl Baidu lyrics, do Lda
Python crawler Learning record (2)--lda processing lyrics
Baidu Music band Tag, composer, singer, category of lyrics data
Crawl Football lottery website All markets, find winning algorithm
Python Crawler Learning record (4)-The legendary football lottery-fold method. It's not as reliable as that.
2011~2013.5 Global soccer match score data and Football lottery Company Handicap
Initially do not need to log on the site is relatively simple, master HTTP Get post and urllib how to simulate, Master lxml, BeautifulSoup, such as parser library can be, Use Firefox's firebug or Chrome's debugging tools to see how the browser is contracted. The above is not required to log in without the need for the next file can be done.

After you may want to download files (Pictures, music, videos, etc.), this can try to climb shrimp songs
Python Crawler Learning record (3)--Get the MP3 song with Python and get the download address
Climbing Wallbase Wallpapers
Recently made a avfun video ranking, regularly catch several times a day acfun, and then download the video to the server cache.
Python Crawler Learning record (5)--python MongoDB + crawler + web.py Acfun Video leaderboard
202.120.39.152:8888

Then you may need to impersonate the user to log in and crawl the site that needs to be logged in (such as Renren, Sina Weibo). If it's just a small crawler, we recommend using browser cookies to impersonate a login
Python Crawler Learning Record (0)--python crawler capture record (shrimp, Baidu, watercress, Sina Weibo)

===========================
To say, do not learn to learn, you can see what previously felt very troublesome operation, is not able to use the crawler simplification. The data crawled down is not the value of sorting filtering analysis.

2015-8-31, in the CSDN update on the previous expired Baidu space link, there may be some code because the website is not suitable for the revision, here is mainly to provide some application ideas. Read most of the answer can not help but sigh, mainly because see a lot of Daniel in answer like "How to get Started crawler" This problem, as the year genius to explain the topic, jump steps countless, and then left a "not just like this push", let a few cabbage bird face Meng forced. As a 0 start (before even Python will not), at present finally mastered the foundation, began to advance to the rookie, know that it is not easy, so I will in this answer, as far as possible to share the full and details of the various steps from the 0 learning crawler, if you have help, please praise ~

-------------------------------------------------------------------------------------------------
#我要写爬虫!
#Ver. 1.2
#Based On:python 2.7
#Author: Kano

# original content, reproduced please indicate the source

First of all! You have to have a clear understanding of the reptile, which refers to Chairman Mao's thoughts:


To flout strategically:
  • "All sites can be crawled": the content of the Internet is written by people, and are lazy to write out (not the first page is a, the next page is 8), so there must be regular, this gives the possibility of crawling, it can be said that the world has not been able to crawl the site
  • "Frame unchanged": the site is different, but the principle is similar, most of the crawlers are sent from the request-to get the page-parse page-download content , such as the flow of the process to do, but with different tools

In tactical regard:
  • Perseverance, rashness: for beginners, not easily complacent, thought crawled a little content on what will crawl, although the crawler is relatively simple technology, but to deep learning is also no limit (such as search engines, etc.)! Only keep trying, study hard is the kingly way! (Why is there a primary school composition that is visual sense)
||
||
V

then you need a grand goal to keep you motivated to learn (no real projects, really hard to drive)
I'm going to crawl the whole watercress! ...
I'm going to crawl across the grass and the community!
I'm going to crawl through the various sister's contact *&^#%^$#
||
||
V

Next, you need to ask yourself, your Python basic skills roar do not roar ah?
Oh, roar! --ok, begin to learn the crawler happily!
no yelling? You still need to learn one! hurry back to see Liaoche Teacher's tutorial,
2.7 of the. At least these features and syntax you have to have basic mastery:
  • list,dict: used to serialize the things you crawl
  • slices: used to split the crawled content, generating
  • conditional judgment (if, etc.): used to solve the problem of what not to do in the reptile process
  • loops and iterations (for while): used for looping, repeating crawler actions
  • file read and write operations: used to read parameters, save crawled content, etc.
||
||
V

then, you need to add some of the following content as your knowledge reserves:
(Note: This is not required to "master", the following two points, only need to understand, and then through the specific project to continue to practice, until mastered)

1, the basic knowledge of the Web:
Basic HTML language knowledge (known as HREF and other university computer-level content)
Understand the concept of a website's package and delivery (POST GET)
A little bit of JS knowledge, used to understand dynamic Web pages (of course, if you know the better of course)

2, some analytical language, for the next analysis of Web content to prepare
The top regular expression: carrying the handle technology, will always be the most basic:


No.2 XPATH: Efficient analytical language, clear and simple expression, grasp the basic can be used in the future without regular
Reference: XPath Tutorial

No.3 BeautifulSoup:
Beautiful Soup Module Parsing Web Artifact, an artifact, if not some crawler frame (as described later Scrapy), with Request,urllib and other modules (will be detailed later), you can write a variety of small and capable crawler script
Official documents: Beautiful Soup 4.2.0 Documentation
Reference case:

||
||
V
Next, you need some efficient tools to help
(again, this is the first thing to know, to be familiar with when it comes to specific projects.)
The F12 developer tool:
  • Look at the source code: quickly locate elements
  • Analysis Xpath:1, the proposed Google Department browser, you can directly right-click on the source interface


No.2 Grab Bag tool:
  • Recommended Httpfox, Firefox browser plug-ins, more than the Google Firefox system comes with the F12 tools are better, can easily view the site packet delivery information


No.3 XPATH CHECKER (Firefox plugin):
Very good XPath test tool, but there are a few pits, all people have stepped on, and here to warn you:
1, XPath checker generated is an absolute path, encountered some dynamically generated icons (common list page buttons, etc.), erratic absolute path is likely to cause errors, so here is recommended in the real analysis, just as a reference
2, remember, such as the XPath box "x:" removed, it seems that this is an earlier version of the XPath syntax, is currently incompatible with some modules (such as scrapy), or delete to avoid error


No.4 Regular expression Test tool:
On-line regular expression test , take to practice practiced hand, also auxiliary analysis! There are a lot of ready-made regular expressions can be used, but also for reference!
||
||
V
ok! You have some basic understanding of these, now start to crawl time, on the various modules it! Python fire, a big reason is a variety of useful modules, these modules are home travel crawl site standing-
Urllib
Urllib2
Requests
||
||
V
do not want to repeat the wheel, there is no ready-made frame?
Gorgeous scrapy (this piece I will focus on, my favorite)
||
||
V
What if I encounter a dynamic page?
Selenium (will have this cooperation scrapy drawbacks, is a home travel crawl site and another artifact, the next version of the update will focus on Amway, because this piece seems to be the current online tutorials are very few)
||
||
V
How to use the crawling stuff?
Pandas (based on the NumPy data analysis module, believe me, if you are not specialized in terabytes of data, this is enough)
||
||
V
and then the database, here I think the beginning does not need to be very deep, when needed to learn again
MySQL
Mongodb
Sqllite
||
||
V
Advanced Technology
Multithreading
Distributed



V1.2更新日志:修改了一些细节和内容顺序
Python starter Web crawler Essentials Edition

Python Learning web crawler is divided into 3 major sections: crawl , analyze , store

In addition, more commonly used crawler framescrapy, here at the end of the detailed introduction.

First of all, I summarize the relevant articles, which covers the basic concepts and techniques needed to get started web crawler: Ningo's small station-web crawler

What happens in the background when we enter a URL in the browser and return to the back? For example, you enter the station of Ning elder brother (fireling data World) focus on web crawler, data mining, machine learning direction. , you will see the home page of Ningo.

In a nutshell, the following four steps occur in this process:

    • Find the IP address that corresponds to the domain name.
    • Send the request to the server that corresponds to the IP.
    • The server responds to the request and sends back the page content.
    • The browser parses the Web page content.

Web crawler to do, simply speaking, is to achieve the function of the browser. By specifying a URL, the data is returned directly to the user without the need to manually manipulate the browser to get it.

Crawl

This step, what do you want to be clear about? is the HTML source, or a JSON-formatted string, and so on.

1. The most basic crawl

Crawling most situations are GET requests that fetch data directly from the other server.

First of all, Python with Urllib and urllib2 these two modules, basically can meet the General page crawl. In addition, requests is also a very useful package, similar to this, there are httplib2 and so on.

Requests:    import requests    response = requests.get(url)    content = requests.get(url).content    print "response headers:", response.headers    print "content:", contentUrllib2:    import urllib2    response = urllib2.urlopen(url)    content = urllib2.urlopen(url).read()    print "response headers:", response.headers    print "content:", contentHttplib2:    import httplib2    http = httplib2.Http()    response_headers, content = http.request(url, 'GET')    print "response headers:", response_headers    print "content:", content
Motivation: I want to crawl the school education system course information down

Operation: First understand the HTTP protocol, then learn to use the Python requests module, and then the actual combat.

Practice: Direct operation at the terminal, try to catch http://www. baidu.com , and then publish their own crawler to pipy ... Then, the academic System 2 semester 2000 + door course to catch down, but also try to attack a small partner's website ... Hang It up!

Specifically, you can read this blog and write a Python crawler.
https:// Jenny42.com/2015/02/wri te-a-spider-use-python/

Suddenly found me this is not a reptile, at most, crawl Web pages. Because I did not learn xPath and cssselect those things, did not crawl the site all over ...

You'll find that even though my reptile technology sucks ... But I do most of the attempts are feedback, such as the release module can see the number of downloads (thinking that this kind of pit Daddy module is also someone to download AH), grasp the educational system data is very fun (I found that the school has about 250 students of the same name) ... The small partner's website down to report bugs ... But I think this study is very interesting.

And I think I learned: how to catch the need to verify the code of the site, how to crawl the site and other advanced skills are also very interested in the beginning is just simple to see the next Python, simple to write small programs, gadgets and the like, feel its simplicity and strong.

Suddenly want to write a reptile to try, then crawl my favorite music website arrested , crawling all the music from the first period to the present, including pictures of each issue. Also wrote a script that automatically downloads all the songs that are currently in the latest issue. and try to pyinstaller with tools. Packaged as an EXE, shared with me several friends who also like to be arrested. ↓ This is the fruit of my climb.


about how to learn, I am just a novice, no guidance, just tell me how to do it:
    1. First of all the basic Python syntax you need to understand, recommend a book "Basic Python Tutorial", very suitable for getting started.
    2. Second, analyze your crawler needs. How is the program specific process? Set up the general framework of the program. What other difficulties may be there?
    3. Then find out what libraries the general crawler needs to use, and these libraries can help you solve a lot of problems. Recommended requests:http for humans There are other libraries, such as Urllib2 BeautifulSoup, that can be understood.
    4. Start writing, encounter problems Google can, Google does not know the question, I met a problem is to know the private messages Daniel solve. In the process of writing will also learn a lot of relevant knowledge, such as HTTP protocol, multithreading and so on.
Or you can use other people's frameworks directly, such as the scrapy that others have mentioned. Without having to build the wheel repeatedly. Many of the things you need to do repeatedly on the Web can be written in Python scripts.
For example, some good articles you want to save, or automatically send to Kindle e-book regularly

Python crawler pushes articles to kindle ebook


Python brute force hack wordpress blog Background login password



Batch get shadow Mowgli picture Python_ Group (in link repair)
Use Python to hack a 211 University BBS forum user password (in link repair)
It feels like you're doing it for a purpose, and that motivation is more specific. Currently ready to crawl stock information, do research use (FRY)
More 30 days to try something new The first time I wanted to write a crawler was to grab a high-CTR video link to the grass. The code is as follows. Need to turn over the wall.
#--Coding:utf-8--import urllib2import sysfrom bs4 import beautifulsoupreload (SYS) sys.setdefaultencoding (' UTF8 ') #解决写入文件乱码问题BaseUrl = "http://t66y.com/" J=1for I in Range (1): #设置始末  Page number URL = "http://t66y.com/thread0806.php?fid=22&search=&page=" + str (i) #默认str会把字符串变成unicode, so the start must be reset with SYS page = Urllib2.urlopen (URL) soup = beautifulsoup (page, from_encoding= "GB18030") #解决BeautifulSoup中文乱码问题 print ("Reading Page "+ str (i)) counts = Soup.find_all (" TD ", class_=" Tal F10 Y-style ") for count in Counts:if Int (count.string) >1 5: #选择想要的点击率 Videocontainer = Count.previous_sibling.previous_sibling.previous_sibling.pre vious_sibling video = Videocontainer.find ("h3") print ("Downloading link" + str (j)) Line1 = (Video.get_text ( ) line2 = baseurl+video.a.get (' href ') Line3 = "View *" + count.string + "* *" print line1 f = open (' Cao.md ', ' a ') f.write ("\N "+" # # # "+" "+line1+" \ n "+" < "+line2+" > "+" \ n "+line3+" + "page" +str (i) + "\ n") F.close () j+=1 
Recommend only one library does not explain:

Requests:http for humans The first to read the SIMPLECD that the person's blog began to understand the original Python writing crawler so awesome.

Then wrote the script 1w+ the score page of the Watercress movie. After the lab has a project, wrote a script to crawl 50W Weibo, halfway familiar with how to simulate a login to cheat the server is the most interesting part.

Summary, you look at the previous person's blog on the line, simple crawler with less than the advanced technology, nothing but a few:
1. Familiarize yourself with the use of Urllib
2. Understand the basic HTML parsing, usually the most basic of the regular is enough to do not know the master programming basis, basically want to write Python crawler only need to look at the previous log learning Urllib and BeautifulSoup is enough.

In addition, if there are thousands of practical examples to see should be able to more intuitively understand how a crawler is running up.
In the next, I wrote a crawler crawling with data, Open SourceWhen you pass by, please give point a star or fork): Morganzhang100/zhihu-spider GitHub
The data crawled by the crawler is applied to: / http zhihuhot.sinaapp.com/

Simple point is a crawl problem of some parameters, analysis, to find the most likely to fire problems. To answer the question, the rate of praise is about 20 times times more than before.

I am not very familiar with Python, but there are only hundreds of lines of code, the main question to see should not be a big problem.

Actually learn anything, see the tutorial no more than their internship to write a few lines of code came to have the effect. Just start writing and you'll know what you need.
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.