Web Crawler case _, crawler _ 2017

Source: Internet
Author: User

Web Crawler case _, crawler _ 2017

So what is web crawler?

Web Crawlers (also known as web crawlers and Web Robots) in the foaf community are also known as web page chasers, programs or scripts that automatically capture World Wide Web information have been widely used in the Internet field. The search engine uses Web crawlers to capture Web pages, documents, and even images, audios, videos, and other resources. The search engine uses the corresponding indexing technology to organize such information and provide it to search users. Web Crawlers also provide an effective way to promote small and medium-sized websites. The optimization of websites for search engine crawlers has been popular for a while.

The basic workflow of web crawler is as follows:

1. Select a part of carefully selected seed URLs;

2. Put these URLs into the URL queue to be crawled;

3. Retrieve the URL to be crawled from the URL queue to be crawled, parse the DNS, obtain the Host ip address, download the webpage corresponding to the URL, and store it in the downloaded webpage library. In addition, put these URLs into the captured URL queue.

4. Analyze the URLs in the captured URL queue, analyze other URLs in the queue, and put the URLs in the queue to be crawled URL to enter the next loop.

Of course, I don't understand what I mentioned above. As I understand it now, we request a URL and the server returns a super large text to us, our browser can parse this super big text into the gorgeous page we can see.

Then, we only need to regard this super large text as a String that is large enough.

Below is my code

package main.spider;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;import java.io.IOException;/** * Created by 1755790963 on 2017/3/10. */public class Second {    public static void main(String[] args) throws IOException {        System.out.println("begin");        Document document = Jsoup.connect("http://tieba.baidu.com/p/2356694991").get();        String selector="div[class=d_post_content j_d_post_content  clearfix]";        Elements elements = document.select(selector);        for (Element element:elements){           String word= element.text();           if(word.indexOf("@")>0){               word=word.substring(0,word.lastIndexOf("@")+7);               System.out.println(word);           }            System.out.println(word);        }    }}

Here I use the jsoup jar package provided by apache. jsoup is a Java HTML Parser that can directly parse a URL address and HTML text content. It provides a set of very labor-saving APIs that can be used to retrieve and manipulate data through DOM, CSS, and operations similar to jQuery.

In the code, we can directly use the Jsoup class, and. the connect () method of Jsoup. This method returns an org. jsoup. connection object, the parameter is the url address of the website, the Connection object has a get () method to return the Document Object

The select method of the document Object can return an Elements object, while the Elements object is a set of formal Element objects. However, the select () method requires us to input a String parameter, which is our selector.

String selector = "div [class = d_post_content j_d_post_content clearfix]";
Our selector syntax is similar to jquery's selector syntax. You can select Elements on the html page. After selection, you can conveniently obtain the code in html through the Element text () method.
In this way, the simplest web crawler is finished.
The web site I selected is Douban.com, leaving your email address. I will send you an email with this Baidu Post. What I chose is the email address of all people.
Attached result:

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.