PHP + HTML + JavaScript + Css for simple crawler development, javascriptcss

Source: Internet
Author: User

PHP + HTML + JavaScript + Css for simple crawler development, javascriptcss

To develop a crawler, you must first know what your crawler is intended. I want to use different websites to find articles with specific keywords and get their links so that I can quickly read them.

According to my habits, I first need to write an interface to clarify my ideas.

1. Go to different websites. Then we need a url input box.

2. Search for articles with specific keywords. Then we need an article title input box.

3. Get the article link. Then we need a container to display the search results.

<Div class = "jumbotron" id = "mainJumbotron"> <div class = "panel-default"> <div class = "panel-heading"> Article URL capture </div> <div class = "panel-body"> <div class = "form-group"> <label for = "article_title"> Article Title </label> <input type =" text "class =" form-control "id =" article_title "placeholder =" article title "> </div> <div class =" form-group "> <label for =" website_url "> website URL </label> <input type =" text "class =" form-control "id =" website_url "placeholder =" website URL "> </div> <button type = "submit" class = "btn-default"> capture </button> </div> <div class = "panel-default"> <div class = "panel-heading"> Article URL </div> <div class = "panel-body"> 

Directly add the code and add some style adjustments to the interface:

The next step is the implementation of the function. I will use PHP to write the code. The first step is to obtain the html code of the website. There are many ways to obtain the html code. I will not introduce it one by one, here we use curl to obtain the html code by passing in the website url:

private function get_html($url){  $ch = curl_init();  $timeout = 10;  curl_setopt($ch, CURLOPT_URL, $url);  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);  curl_setopt($ch, CURLOPT_ENCODING, 'gzip');  curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36');  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);  $html = curl_exec($ch);  return $html; }

Although you get the html code, you will soon encounter a problem, that is, the encoding problem, which may make your next matching useless. Here we will convert the html content to utf8 encoding:

$coding = mb_detect_encoding($html); if ($coding != "UTF-8" || !mb_check_encoding($html, "UTF-8"))  $html = mb_convert_encoding($html, 'utf-8', 'GBK,UTF-8,ASCII');

To obtain the html of the website and obtain the url of the article, the next step is to match all the tags under the webpage and use regular expressions. After multiple tests, finally, we get a more reliable regular expression. No matter how complicated the structure of tag a is, as long as it is a tag, we will not let it go: (the most critical step)

$pattern = '|<a[^>]*>(.*)</a>|isU'; preg_match_all($pattern, $html, $matches);

The matching result is in $ matches. It is probably such a multi-dimension group:

Array (2) {[0] => array (*) {[0] => string (*) "complete a tag "...} [1] => array (*) {[0] => string (*) "content in the tag corresponding to the subscript above "}}

As long as you can get the data, you can perform other operations. You can traverse this group, find the tag you want, and then obtain the corresponding attributes of tag, we recommend a class to facilitate the operation of tag:

$ Dom = new DOMDocument (); @ $ dom-> loadHTML ($ a); // $ a is the tag obtained above $ url = new DOMXPath ($ dom ); $ hrefs = $ url-> evaluate ('// A'); for ($ I = 0; $ I <$ hrefs-> length; $ I ++) {$ href = $ hrefs-> item ($ I); $ url = $ href-> getAttribute ('href '); // obtain the href attribute of tag a here}

Of course, this is only one way. You can also use regular expressions to match the information you want and create new tricks for data.

Get and match the desired results. The next step is to display them back to the front end, write the interface, and then use js to get data from the front end. Then, add the content to display it dynamically with jquery:

Var website_url = 'your interface address'; $. getJSON (website_url, function (data) {if (data. text = '') {comment ('{article_url'{.html ('<div> <p> no link to this article </p> </div>'); return ;} var string = ''; var list = data. text; for (var j in list) {var content = list [j]. url_content; for (var I in content) {if (content [I]. title! = '') {String + = '<div class =" item ">' + '<em> [<a href =" http: //' + list [j]. website. web_url + '"target =" _ blank ">' + list [j]. website. web_name + '</a>] </em>' + '<a href = "' + content [I]. url + '"target =" _ blank "class =" web_url ">' + content [I]. title + '</a>' + '</div>'; }}(('{article_url'}.html (string );});

Ultimately:

The above is all the content of this article, hoping to help you learn.

Articles you may be interested in:
  • Full process of making crawlers using NodeJS
  • Full process of making crawlers using NodeJS (continued)
  • Summary of node. js crawler crawling data garbled
  • Code issue of node. js crawling data
  • Powerful crawlers Based on Node. js can directly publish captured articles.
  • Java Web crawler provides App data (Jsoup web crawler)
  • Asynchronous Concurrency Control in Nodejs crawler advanced tutorial
  • Node. js Basic module http and webpage analysis tool cherrio implement Crawler
  • The basic idea of writing crawlers in Node. js and sharing examples of capturing Baidu Images
  • Simple code for nodeJs crawlers to obtain data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.