How did you start to write python crawlers?

Source: Internet
Author: User
Tags wordpress blog python web crawler
After reading simple tutorials and stupid ways to learn python, you can't start to write crawlers. You need to continue reading books and exercises. After reading simple tutorials and stupid ways to learn python, you can't start to write crawlers, continue reading and replying to exercises: Let's talk about my experiences.

I first climbed Xiami and wanted to see which of the most popular songs I heard. Then I crawled the number of songs played by Xiami and made statistics.
Python crawler Learning Record (1) -- number of full-site playback in Xiami
Statistics on the score distribution of Douban Animation
Source code of the webpage for section 2100 of Douban animation pages (including information such as rating, Director, type, introduction, and code capture)
Crawling Baidu lyrics and making LDA
Python crawler Learning Record (2) -- LDA processing lyrics
Baidu music includes tags, composing, singers, and category lyrics
Crawls all the ports of the website to find the winning Algorithm
Python crawler Learning Record (4)-the legendary full-color doubling method .. It seems not so reliable.
2011 ~ 2013.5 Score data of all football matches and all the companies in the world
Websites that do not need to log on at the initial stage are relatively simple. You can master how to simulate http get post and urllib, and Master parser libraries such as lxml and BeautifulSoup, use firefox firebug or chrome debugging tools to check how the browser sends packets. The preceding steps do not require logon and do not require files.

Then you may want to download files (images, music, videos, etc.). You can try crawling Xiami songs.
Python crawler Learning Record (3)-use Python to get Xiami jixin songs and MP3
Crawl wallbase Wallpaper
Recently I made an avfun video ranking. I captured acfun several times a day and downloaded the video to the server cache.
Python crawler Learning Record (5) -- acfun video ranking of python mongodb + crawler + web. py
202.120.39.152: 8888

Then you may need to simulate user logon and crawl the website to be logged on (such as Renren and Sina Weibo ). If it is just a small-scale crawler, we recommend that you use a browser cookie to simulate login.
Python crawler Learning Record (0) -- Python crawler website capture record (Xiami, Baidu, Douban, Sina Weibo)

======================================
What I want to say is, don't learn for the purpose of learning. You can check out any previously troublesome operations and see if they can be simplified with crawlers. Whether the crawled data has the value of sorting, filtering, and analysis.

, The link to the previously expired Baidu space was updated on csdn, and some code may not apply because of the website revision. Here we mainly provide some application ideas. After reading most of the answers, I can't help but sigh, mainly because I saw a lot of Daniel answering questions like "How To Get Started crawler", just as I did when I learned how to explain the questions and took countless steps, then I left a phrase "no, that's the way it was pushed .. As a beginner at 0 (not even python), I have mastered the basics and started to learn more about it. I know it is not easy, so I will answer this question, share the steps for learning crawlers from scratch as comprehensively and in detail as possible. If you have any help, please like them ~

Bytes -------------------------------------------------------------------------------------------------
# I want to write crawlers!
# Ver.1.2
# Based on: Python 2.7
# Author: Gao yuliang

# Original content. For reprinted content, please indicate the source

First! You need to have a clear understanding of crawlers. Here we reference the ideas of Chairman Mao:


Strategic contempt:
  • "All websites can be crawled ":The content on the Internet is written by people, and it is written by people. (No first page is a, and the next page is 8, this gives people the possibility of crawling. It can be said that there are no websites in the world that cannot be crawled.
  • "Framework unchanged ":Websites are different, but the principles are similar. Most crawlers useSend request -- get page -- parse page -- download content -- save contentThis process is only implemented using different tools.

Tactical emphasis:
  • Persevere, guard against arrogance and rashness:For beginners, it is not easy to be complacent. I think that crawlers will crawl everything after a little bit of content. crawlers are simple technologies, but there is no end to learning in depth (such as search engines )! This is the only way to keep trying and study hard! (Why is there a kind of elementary school composition that is visual)
|
|
V

Then, you need an ambitious goal to give you motivation for continuous learning (without practical projects, it is really hard to have motivation)
I want to climb the entire Douban !...
I want to climb the entire caojiong community!
I want to learn the contact information of various sisters * & ^ # % ^ $ #
|
|
V

Next, you need to ask yourself, isn't your basic python skills?
Hey!-- OK. Start learning crawlers happily!
No? You still need to learn one!Go back and read the tutorial from instructor Liao Xuefeng,
2.7. At least These functions and syntaxesYou need to have basic knowledge:
  • List, dict:Used to serialize what you crawl
  • Slice:Used to split the crawled content and generate
  • Conditional judgment (if, etc ):Which of the following problems should be solved during crawling?
  • Loop and iteration (for while ):Used for loop and repeated crawling
  • File read/write operations:Used to read parameters and save crawled content.
|
|
V

Then, you need to add the following content as your knowledge reserve:
(Note: "Understanding" is not required here. The following two points only need to be understood first, and then continue to practice through specific projects until you are proficient)

1. Basic webpage knowledge:
Basic HTML language knowledge (know href and other university computer level-1 content)
Understand the concept of website sending and receiving packets (post get)
A little bit of js knowledge is used to understand dynamic web pages (of course, it would be better if you knew it yourself)

2. Some analysis languages to prepare for parsing the webpage content
NO.1 Regular Expression: Stick to the sub-technology, and it will always be the most basic:


No. 2 XPATH: an efficient analysis language with clear and simple expressions. You can skip regular expressions after understanding it.
Reference: XPath tutorial

No. 3 Beautifulsoup:
The beautiful Soup module parses web page artifacts, an artifact. If you do not need some crawler frameworks (such as scrapy mentioned later), use the request, urllib, and other modules (which will be detailed later ), you can write various compact and lean crawler scripts.
Official Website documentation: Beautiful Soup 4.2.0 documentation
Reference cases:

|
|
V
Next, you need some efficient tools to help
(Here, we will first understand how to use it when it comes to specific projects)
NO.1 F12 developer tools:
  • View Source Code: Quick element locating
  • Analysis of xpath: 1. Google browser is recommended here. You can right-click it on the source code interface.


No. 2 packet capture tool:
  • We recommend that you use httpfox and Firefox plug-ins, which is better than Google Firefox's F12 tool. You can conveniently view the website package receiving and sending information.


No. 3 xpath checker (Firefox Plugin ):
It is a very good xpath testing tool, but there are a few pitfalls that I personally step on. I would like to warn you:
1. xpath checker generates absolute paths. When encountering some dynamically generated icons (such as list flip buttons), erratic absolute paths may cause errors, therefore, we recommend that you use the analysis as a reference.
2. Remember to remove "x:" In the xpath box. It seems that this is the syntax of an earlier version of xpath. It is not compatible with some modules (such as scrapy), or delete it to avoid errors.


No. 4 Regular Expression test tool:
Online regular expression testing It also helps with analysis! There are many regular expressions available for use or reference!
|
|
V
OK! You have some knowledge about this. Now you can start to capture the time and use various modules! A major reason for the fire in python is the various easy-to-use modules, which are common for home travel websites --
Urllib
Urllib2
Requests
|
|
V
I don't want to duplicate the wheel. Is there a ready-made framework?
Lili scrapy (my favorite one)
|
|
V
What should I do if I encounter a dynamic page?
Selenium)
|
|
V
How to Use crawled items?
Pandas (numpy-based data analysis module. Believe me, if you are not dedicated to TB-level data, this is enough)
|
|
V
Then there is the database. Here I think it does not need to be very in-depth at the beginning, so you can study it as needed.
Mysql
Mongodb
Sqllite
|
|
V
Advanced Technology
Multithreading
Distributed



V1.2 Update log: modified some details and content Sequence
Getting started with Python web crawler (excellent Version)

Python learning web crawlers are divided into three major sections:Capture,Analysis,Storage

In addition, the commonly used crawler framework Scrapy.

First, I will list the articles I have summarized, covering the basic concepts and skills required for getting started with Web Crawlers: Ningge's xiaozhan-Web Crawler

When we enter a url in the browser and press enter, what will happen in the background? For example, you enter nange's little station (fireling's data world) to focus on web crawlers, data mining, and machine learning.You will see the homepage of Ningge's small website.

In short, this process involves the following four steps:

  • Find the IP address corresponding to the domain name.
  • Send a request to the server corresponding to the IP address.
  • The server responds to the request and sends back the webpage content.
  • The browser parses the webpage content.

To put it simply, web crawlers implement the functions of browsers. You can specify a url to directly return the required data to the user, without step-by-step manual access to the browser.

Capture

In this step, you need to clarify what the content is to be obtained? Whether it is an HTML source code or a Json string.

1. Basic crawling

Retrieving data in most cases is a get request, that is, retrieving data directly from the server of the other party.

First, Python comes with the urllib and urllib2 modules, which can basically meet the general page crawling requirements. In addition, requestsIt is also a very useful package, like this, and httplib2And so on.

Requests:    import requests    response = requests.get(url)    content = requests.get(url).content    print "response headers:", response.headers    print "content:", contentUrllib2:    import urllib2    response = urllib2.urlopen(url)    content = urllib2.urlopen(url).read()    print "response headers:", response.headers    print "content:", contentHttplib2:    import httplib2    http = httplib2.Http()    response_headers, content = http.request(url, 'GET')    print "response headers:", response_headers    print "content:", content
Motivation: I want to crawl the course information of the school education system

Operation: first understand the HTTP protocol, then learn to use the Python Requests module, and then practice.

Practice: first directly operate on the terminal, try to grasp the http://www.baidu.com And then publish the crawler you write to PIPy... Then, I took over 2000 courses in the educational administration system for two semesters and tried to attack a website of my friends... Put it down!

For details, refer to this blog and write a Python crawler.
Https://jenny42.com/2015/02/write-a-spider-use-python/

I suddenly found that this is not actually a crawler, but at most it is a Web page capture. Because I didn't learn xPath and CSSSelect, I didn't crawl my website...

You will find that even though my crawler technology is still poor... However, most of my attempts have feedback. For example, I can see the number of downloads in the release module (I wonder if someone has downloaded these modules ), it is fun to grasp the educational administration system data (I found that there are about 250 students in the school with duplicate names )... Break down your website and report bugs... But I think this learning is interesting.

In addition, I think I am also very interested in advanced skills like how to capture websites that require verification codes and how to crawl websites ~ At the beginning, I simply read Python, wrote some small programs, gadgets, and so on, and felt its simplicity and strength.

Suddenly I want to write a crawler and try it. Then I will crawl my favorite music website to the Internet. , Crawled all the music from the first phase to the present, including the pictures of each phase. In addition, a script is written to automatically download all the songs of the latest release. Also try to use the tool PyInstaller Package it into an exe and share it with my friends who love accessing the Internet. This is the result of my crawling.


As for how to learn, I am just a newbie. If I have no guidance, let's talk about how I did it myself:
  1. First of all, you need to know the basic Python syntax. We recommend a book "Basic Python tutorial", which is suitable for beginners.
  2. Next, analyze your crawler needs. What is the specific procedure? Set up the general framework of the program. What are the other difficulties?
  3. Then, let's take a look at the databases generally used by write crawlers. These libraries can help you solve many problems. Powerful Requests recommended: HTTP for HumansThere are other libraries such as urllib2 BeautifulSoup.
  4. I started to write it. If Google doesn't work, I 'd like to know about it. One of the problems I have encountered is that I have solved it by private trust. During the writing process, you will also learn a lot of related knowledge, such as HTTP protocol and multithreading.
Or you can use someone else's framework, as someone else mentioned in Scrapy. You do not need to duplicate the wheel. You can write python scripts for many of the tasks you need to repeat on the network.
For example, you want to save some of zhihu's good articles, or automatically send them to kindle e-books on a regular basis.

Python crawlers push zhihu articles to kindle e-books


Python brute force cracking wordpress blog background Login Password



Batch acquisition of images without taboos Python _ Group (Link is being fixed)
Use python to crack the password of a 211 University BBS forum user (Link is being fixed)
I feel that I have done it for some purpose, and the motivation will be clearer. Currently, I am preparing to crawl stock information for research and use (stock trading)
Try new things in 30 days For the first time, I wanted to write a crawler to capture the High-click rate video links of the grass-roots. The Code is as follows. You need to flip the wall.

# -- Coding: UTF-8 -- import urllib2import sysfrom bs4 import BeautifulSoupreload (sys) sys. setdefaultencoding ('utf8') # solve the problem of file garbled writing BaseUrl =" http://t66y.com/ "J = 1for I in range (1,100): # Set the start and end page url =" http://t66y.com/ Thread0806.php? Fid = 22 & search = & page = "+ str (I) # by default, str changes the string to unicode. Therefore, you must use sys to reset page = urllib2.urlopen (url) at the beginning) soup = BeautifulSoup (page, from_encoding = "gb18030") # solve the BeautifulSoup Chinese garbled problem print ("reading page" + str (I) counts = soup. find_all ("td", class _ = "tal f10 y-style") for count in counts: if int (count. string)> 15: # select the expected CTR videoContainer = count. previous_sibling.previus_sibling.previus_sibling.previus_sibling video = videoContainer. find ("h3") print ("Downloading link" + str (j) line1 = (video. get_text () line2 = BaseUrl + video. a. get ('href ') line3 = "view **" + count. string + "**" print line1 f = open ('cao. md ', 'A') f. write ("\ n" + "###" + "" + line1 + "\ n" + "<" + line2 + ">" + "\ n" + line3 + "" + "page" + str (I) + "\ n") f. close () j + = 1
Pushing only one database is not explained:

Requests: HTTP for Humans I first read the blog of the person who wrote Simplecd and started to understand that Python was so awesome to write crawlers.

Then, I wrote the script's scoring page for over 1 million Douban movies. After that, there were projects in the lab, and I wrote a script to crawl 50 million Weibo posts. It was the most interesting part to be familiar with how to simulate login and cheat the server.

To sum up, you just need to look at the previous blogs. Simple crawlers cannot use many advanced technologies, but there are just a few:
1. Familiarize yourself with the use of urllib
2. understand basic html Parsing. Generally, the most basic regular expressions are enough, basically, if you want to write Python crawlers, you only need to look at the previous logs to learn about urllib and BeautifulSoup.

In addition, if there are thousands of actual examples, we should be able to intuitively understand how a crawler runs.
I wrote a data crawler next time, Open SourceWhen you pass Click a Star or Fork ~): MorganZhang100/zhihu-spider · GitHub
The crawler crawled data is applied to: http://zhihuhot.sinaapp.com/

To put it simply, some parameters of the crawling problem are analyzed to find out the most likely problem. In response to this question, the speed is about 20 times faster than before.

I am not very familiar with Python, but there are only a few hundred lines of code in total. It should not be a problem for the subject.

As a matter of fact, learning anything is not as effective as having attempted to write several lines of code. As long as you start writing, you will know what you need.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.