First, let's take a look at how to obtain webpage content for normal human behavior.
(1) Open a browser, enter a URL, and open source web page
(2) select the desired content, including the title, author, abstract, and body.
(3) store it to the hard disk
The above three processes are mapped to the Technical layer. They are actually network requests, capturing structured data, and storing data.
We use Python to write a simple program to implement the above simple crawling function.
#! /Usr/bin/python #-*-coding: UTF-8-*-''' Created on 2014-03-16 @ author: Kris ''' import urllib2, re, cookielibdef httpCrawler (url ): ''' @ summary: webpage capture ''' content = httpRequest (url) title = parseHtml (content) saveData (title) def httpRequest (url): ''' @ summary: network request ''' try: ret = None SockFile = None request = urllib2.Request (url) request. add_header ('user-agent', 'mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1 ;. net clr 1.1.4322) ') request. add_header ('pragm', 'no-cache') opener = urllib2.build _ opener () SockFile = opener. open (request) ret = SockFile. read () finally: if SockFile: SockFile. close () return retdef parseHtml (html): ''' @ summary: captures structured data ''' content = None pattern ='([^ <] *?)'Temp = re. findall (pattern, html) if temp: content = temp [0] return content def saveData (data): ''' @ summary: data storage ''' f = open ('test', 'wb') f. write (data) f. close () if _ name _ = '_ main _': url = 'HTTP: // www.baidu.com 'httpCrawler (url)
It looks very simple. Yes, it is a basic program for getting started with crawlers. Of course, the implementation of a collection process is nothing more than the above basic steps. However, to implement a powerful collection process, you will encounter the following problems:
(1) access with cookie Information is required. For example, most social software basically requires users to log on before they can see valuable things, we can use the cookielib module provided by Python to implement every access with the cookie information provided by the source website, so that as long as we successfully simulate login, crawlers are logged on, then we can collect all the information seen by the login user. The following is how to use cookies to modify the httpRequest () method:
Ckjar = cookielib. using illacookiejar () cookies = urllib2.HTTPCookieProcessor (ckjar) # define the cookie object def httpRequest (url): ''' @ summary: Network request ''' try: ret = None SockFile = None request = urllib2.Request (url) request. add_header ('user-agent', 'mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1 ;. net clr 1.1.4322) ') request. add_header ('pragm', 'no-cache') opener = urllib2.build _ opener (cookies) # transfer cookie object SockFile = opener. open (request) ret = SockFile. read () finally: if SockFile: SockFile. close () return ret
(2) Encoding Problems. Currently, the most two types of website encoding are UTF-8 or gbk. When the source website encoding we collect is inconsistent with the encoding stored in our database, for example, 163.com encoding uses gbk, we need to store UTF-8 encoded data, so we can use the encode () and decode () methods provided in Python for conversion, such:
Content = content. decode ('gbk', 'ignore') # convert gbk encoding to unicode encoding content = content. encode ('utf-8', 'ignore') # convert unicode to UTF-8 encodingUnicode encoding occurs in the middle. We need to convert it to unicode to gbk or UTF-8.
(3) incomplete tags in Web pages. For example, starting tags appear in some source code, but not ending tags. Incomplete HTML tags will affect the capture of structured data, we can use the HTMLParser module of Python to clean the source code and analyze and obtain the content.
(4) Some websites use JS to survive the webpage content. When we directly looked at the source code, we found that it was a bunch of headache JS Code. You can use mozilla, webkit, and other tool kits that can parse browsers to parse js and ajax, although the speed will be slightly slower.
(5) images exist in flash format. When the content in the image is a character composed of text or numbers, it is easy to do. We only need to use ocr technology to achieve automatic identification, but if it is a flash link, we have stored the entire URL.
(6) A single webpage may contain multiple webpage structures. In this case, if we only have a set of crawling rules, it will certainly not work. Therefore, we need to configure multiple simulation sets to assist in capturing.
(7) The source website should be monitored. Crawling other people's things is not a good thing after all, so websites generally have restrictions on crawler access.
A good collection system should be that, no matter where our target data is, we can collect it as long as it is visible to users. What you see is what you get. The data that you need to log on to can be collected without blocking. Most Valuable information can be viewed only after logon. For example, for a social network website, a crawler system that simulates user logon is required for the website to be accessed to obtain data normally. However, social websites all want to form a closed loop and do not want to put data out of the site. Such a system will not be as open to people as news and other content. Most of these social websites adopt some restrictions to prevent the robot crawler system from crawling data. Generally, a single account cannot be crawled for a long time and it will be detected that access is forbidden. Can we not crawl the data of these websites? This is definitely not the case. As long as a social website does not close webpage access, we can also access the data that normal people can access. In the end, it is to simulate the normal operations of people, professionally called "anti-monitoring ".
The source website has the following restrictions:
1. The number of visits from a single IP address within a certain period of time. A normal user will not access a website too quickly within a period of time unless they are randomly playing, the duration is not too long. This problem is easy to solve. We can use a large number of irregular proxy IP addresses to form a proxy pool, randomly select a Proxy from the proxy pool, and simulate access. There are two types of proxy IP addresses: transparent proxy and anonymous proxy.
2. The number of visits to a single account within a certain period of time. If a person accesses a data interface 24 hours a day, and the speed is very fast, it may be a robot. We can use a large number of accounts with normal behaviors, that is, how ordinary people operate on social networking websites. In addition, the number of access URLs per unit time is minimized, and the access interval between each access can be a period of time, this interval can be a random value, that is, after each access to a URL, the random sleep for a period of time, and then access the next URL.
If you can control the access policies of accounts and IP addresses, there is basically no problem. Of course, the O & M policies of the website of the other party will also be adjusted. In a contest between the enemy and me, crawlers must be able to perceive that the anti-monitoring of the other party will affect us and notify the Administrator to handle the problem in a timely manner. In fact, the most ideal thing is that we can use machine learning to intelligently implement anti-monitoring confrontation and achieve uninterrupted crawling.
The following is a distributed crawler architecture that I recently designed, as shown in Figure 1:
Figure 1
This is a pure pipeline. The initial idea is being implemented. Communication between the server and the client is being established. The Python Socket module is mainly used to implement communication between the server and the client. If you are interested, you can contact me separately to discuss a better solution.