Web crawler preliminary: From Access to data analysis

Source: Internet
Author: User

Preface:

Web crawler This thing still looks magical. However, if you think about it or do some research, you know that the crawler is not so advanced. The advanced is that when we have a large amount of data, that is, when our network "graph" of the loop more and more, how to solve it.

This article is just a starting point here. This article mainly explains how to use Java/python to access Web pages and get web page code, python to imitate the browser to access the Web page and use Python for data parsing. Hope that we start with this article, step by step to unlock the mysterious side of the Web spider.


Reference:

1. "Write your own web crawler"

2. Use Python to write reptiles, to crawl csdn content, the perfect solution 403 Forbidden


Run:


There is a bit more content, I only selected a part to show.


Author Environment:

System: Windows 7

CentOS 6.5

Operating Environment: JDK 1.7

Python 2.6.6

Ide:eclipse Release 4.2.0

Pycharm 4.5.1

Database: MySQL Ver 14.14 Distrib 5.1.73


Development process:1. Crawling pages with Java

For page fetching we use Java to implement, of course, you can use other languages to develop. But

The following is the "blog Park," The home page for example, showing the use of Java page crawl process:

public class Retrivepagesimple {private static HttpClient HttpClient = new HttpClient ();//Set proxy server static {Httpclient.geth Ostconfiguration (). SetProxy ("58.220.2.132", 80);} public static Boolean Downloadpage (String path) throws Httpexception,ioexception {Postmethod Postmethod = new Postmethod ( path);//execute, return status code int statusCode = Httpclient.executemethod (Postmethod); System.out.println (StatusCode);//Processing for status codes (for simplicity, only a status code with a return value of 200 is processed) if (StatusCode = = HTTPSTATUS.SC_OK) {String a = Postmethod.getresponsebodyasstring (); System.out.println (a); return true;} return false;} public static void Main (string[] args) {try {retrivepagesimple.downloadpage ("http://www.cnblogs.com/");} catch ( HttpException e) {e.printstacktrace ();} catch (IOException e) {e.printstacktrace ();}}}

The result information is not shown here, too many ... - -!


2. Crawling pages with Python

You might ask me why I wrote a Java version of the page crawl, and here is a python? This is necessary. Because I did not consider a problem before I developed this demo. when we used Java to crawl a Web page to Python, the page string was too long to pass as a parameter. Maybe you would think that saving a file is a good choice, and that too many HTML files to do? Yes, we have to abandon this kind of exhausting practice here.

Considering the limitation of the parameter length, here we only give the page address on the Java side and crawl the Web page using Python.

In the simplest way, we usually use a Python web page like this:

Import Urllib2result = Urllib2.urlopen (' http://blog.csdn.net/mobile/index.html ') HTML = result.read () print HTML

However, the author of the code used in the CSDN blog channel url,csdn access from the crawler to a layer of filtering, as follows we will get the following error message:

403, I was turned down.


3. Log in to the site using a mock browser

When we went to visit a webpage with protective measures, we were rejected. But we can try to access it using our own browser, which is accessible.

This means that if we can imitate ourselves as a browser in Python, we can access this page. Here's the code python mimics the browser:

Import Randomimport socketimport Urllib2import cookieliberror = {' 0 ': ' Can not open ', ' url,checck you net ', ' 1 ': ' creat download dir error ', ' 2 ': ' The image links is empty ', ' 3 ': ' Download Faild ', ' 4 ': ' Build sou P error,the html is empty ', ' 5 ': ' Can not save the image to your disk ',}class browserbase (object): Def __init __: socket.setdefaulttimeout (self._content = None def speak (self, name, content): Print ' [%s]%s '% (name, content) def open_url (self, URL): "" opens Web Page "" "Cookie_support= urllib2. Httpcookieprocessor (Cookielib. Cookiejar ()) Self.opener = Urllib2.build_opener (cookie_support,urllib2. HttpHandler) Urllib2.install_opener (self.opener) user_agents = [' mozilla/5.0 (Windows; U ; Windows NT 5.1; It rv:1.8.1.11) gecko/20071127 firefox/2.0.0.11 ', ' opera/9.25 (Windows NT 5.1; U EN) ', ' mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. NET CLR 1.1.4322;. NET CLR 2.0.50727) ', ' mozilla/5.0 (compatible; konqueror/3.5; Linux) khtml/3.5.5 (like Gecko) (Kubuntu) ', ' mozilla/5.0 (X11; U Linux i686; En-us; rv:1.8.0.12) gecko/20070731 ubuntu/dapper-security firefox/1.5.0.12 ', ' LYNX/2.8.5REL.1 libwww-FM/2.14 S sl-mm/1.4.1 gnutls/1.2.9 ', "mozilla/5.0 (X11;                    Linux i686) applewebkit/535.7 (khtml, like Gecko) ubuntu/11.04 chromium/16.0.912.77 chrome/16.0.912.77 safari/535.7 ", "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) gecko/20100101 firefox/10.0 "," Agent = Random.choice (user_agents) self.            Opener.addheaders = [("User-agent", Agent), ("Accept", "*/*"), (' Referer ', ' http://www.google.com ')] Try: res = self.opener.open (URL) self._content = Res.read () # print self._content except Excepti On, E:self.speAK (STR (e) +url) raise Exception else:return res def get_html_content (self): return Self._contentdef get_html_response (HTML): Spider = Browserbase () spider.open_url (HTML) return Spider.get_html_con Tent ()

The above code can normally get the return value. Let's take a look at the parsing process for the returned results.


4. Data Analysis

Using Python to parse the HTML is a simple exception:

Import Htmlparserclass Listwebparser (htmlparser.htmlparser): def __init__ (self): htmlparser.htmlparser.__init__ ( Self) Self.tagdivflag = False Self.tagdivaflag = False Self.tagh1flag = False SELF.TAGSECONDHR Efflag = False Self._name = None self._address = None def handle_starttag (self, Tag, attrs): If Ta                    g = = ' div ': for name, value in attrs:if name = = ' class ' and value = = ' Blog_list ':                Self.tagdivflag = true if tag = = ' H1 ': if Self.tagDIVFlag:self.tagH1Flag = True # print ' h1-> ', self.tagh1flag if tag = = ' A ': #if self.tagdivaflag: #prin  T ' H1: ', Self.tagh1flag if self.tagH1Flag:for name, value in Attrs:if name                        = = ' target ' and value = = ' _blank ': Self.tagdivaflag = True if name = = ' href ': If Self.tagSecondHrefFlag:print ' URL: ', value self._address = value                        Self.tagsecondhrefflag = True # if Name = = ' href ' and self.tagdivaflag: # print ' URL: ', value # self._address = value def handle_endtag (self, tag): I F tag = = ' div ': Self.tagdivflag = False if tag = = ' H1 ': Self.tagh1flag = False # P        Rint ' false H1. '            if tag = = ' A ': Self.tagdivaflag = False def handle_data (self, data): if Self.tagdivaflag: Print U "Name:", Data.decode ("Utf-8")

If you say you find the HTML file on the Internet without this trouble. I admit it, because normally, it's really easy to parse some simple data. The complex logic in the above code is the processing filter.

When it comes to screening, I use a little trick here (of course, when you use more people, it's no longer just a trick.) However, this method can be used for reference in the subsequent coding process. We lock blocks with some special attributes of tag (e.g. ID, class, etc.). when we start the block, our corresponding flag bit will be hit true, and when we exit the block, our corresponding flag will be hit false. Maybe you think it's too much trouble. In fact, if you think about it, you will know that it makes sense.


Precautions :

1. When using Java for page fetching, we used a proxy server. The host and port of this proxy server can be directly checked online for free.


2. You need to prepare the following jar packages and import them into your Eclipse project:



3. Modify MySQL's default encoding to UTF-8

Because there will be some Chinese information here, we need to convert MySQL into a coded format.

If you are coding under Linux, then you can refer to: http://blog.csdn.net/lemon_tree12138/article/details/46375637

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced. http://blog.csdn.net/lemon_tree12138

Web crawler preliminary: From Access to data analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.