Web Crawler case _, crawler _ 2017
So what is web crawler?
Web Crawlers (also known as web crawlers and Web Robots) in the foaf community are also known as web page chasers, programs or scripts that automatically capture World Wide Web information have been widely used in the Internet field. The search engine uses Web crawlers to capture Web pages, documents, and even images, audios, videos, and other resources. The search engine uses the corresponding indexing technology to organize such information and provide it to search users. Web Crawlers also provide an effective way to promote small and medium-sized websites. The optimization of websites for search engine crawlers has been popular for a while.
The basic workflow of web crawler is as follows:
1. Select a part of carefully selected seed URLs;
2. Put these URLs into the URL queue to be crawled;
3. Retrieve the URL to be crawled from the URL queue to be crawled, parse the DNS, obtain the Host ip address, download the webpage corresponding to the URL, and store it in the downloaded webpage library. In addition, put these URLs into the captured URL queue.
4. Analyze the URLs in the captured URL queue, analyze other URLs in the queue, and put the URLs in the queue to be crawled URL to enter the next loop.
Of course, I don't understand what I mentioned above. As I understand it now, we request a URL and the server returns a super large text to us, our browser can parse this super big text into the gorgeous page we can see.
Then, we only need to regard this super large text as a String that is large enough.
Below is my code
package main.spider;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;import java.io.IOException;/** * Created by 1755790963 on 2017/3/10. */public class Second { public static void main(String[] args) throws IOException { System.out.println("begin"); Document document = Jsoup.connect("http://tieba.baidu.com/p/2356694991").get(); String selector="div[class=d_post_content j_d_post_content clearfix]"; Elements elements = document.select(selector); for (Element element:elements){ String word= element.text(); if(word.indexOf("@")>0){ word=word.substring(0,word.lastIndexOf("@")+7); System.out.println(word); } System.out.println(word); } }}
Here I use the jsoup jar package provided by apache. jsoup is a Java HTML Parser that can directly parse a URL address and HTML text content. It provides a set of very labor-saving APIs that can be used to retrieve and manipulate data through DOM, CSS, and operations similar to jQuery.
In the code, we can directly use the Jsoup class, and. the connect () method of Jsoup. This method returns an org. jsoup. connection object, the parameter is the url address of the website, the Connection object has a get () method to return the Document Object
The select method of the document Object can return an Elements object, while the Elements object is a set of formal Element objects. However, the select () method requires us to input a String parameter, which is our selector.
String selector = "div [class = d_post_content j_d_post_content clearfix]";
Our selector syntax is similar to jquery's selector syntax. You can select Elements on the html page. After selection, you can conveniently obtain the code in html through the Element text () method.
In this way, the simplest web crawler is finished.
The web site I selected is Douban.com, leaving your email address. I will send you an email with this Baidu Post. What I chose is the email address of all people.
Attached result: