PHP + HTML + JavaScript + Css for simple crawler development, javascriptcss

Last Update:2016-04-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To develop a crawler, you must first know what your crawler is intended. I want to use different websites to find articles with specific keywords and get their links so that I can quickly read them.

According to my habits, I first need to write an interface to clarify my ideas.

1. Go to different websites. Then we need a url input box.

2. Search for articles with specific keywords. Then we need an article title input box.

3. Get the article link. Then we need a container to display the search results.

<Div class = "jumbotron" id = "mainJumbotron"> <div class = "panel-default"> <div class = "panel-heading"> Article URL capture </div> <div class = "panel-body"> <div class = "form-group"> <label for = "article_title"> Article Title </label> <input type =" text "class =" form-control "id =" article_title "placeholder =" article title "> </div> <div class =" form-group "> <label for =" website_url "> website URL </label> <input type =" text "class =" form-control "id =" website_url "placeholder =" website URL "> </div> <button type = "submit" class = "btn-default"> capture </button> </div> <div class = "panel-default"> <div class = "panel-heading"> Article URL </div> <div class = "panel-body"> 
Directly add the code and add some style adjustments to the interface:

The next step is the implementation of the function. I will use PHP to write the code. The first step is to obtain the html code of the website. There are many ways to obtain the html code. I will not introduce it one by one, here we use curl to obtain the html code by passing in the website url:
private function get_html($url){  $ch = curl_init();  $timeout = 10;  curl_setopt($ch, CURLOPT_URL, $url);  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);  curl_setopt($ch, CURLOPT_ENCODING, 'gzip');  curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36');  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);  $html = curl_exec($ch);  return $html; }
Although you get the html code, you will soon encounter a problem, that is, the encoding problem, which may make your next matching useless. Here we will convert the html content to utf8 encoding:
$coding = mb_detect_encoding($html); if ($coding != "UTF-8" || !mb_check_encoding($html, "UTF-8"))  $html = mb_convert_encoding($html, 'utf-8', 'GBK,UTF-8,ASCII');
To obtain the html of the website and obtain the url of the article, the next step is to match all the tags under the webpage and use regular expressions. After multiple tests, finally, we get a more reliable regular expression. No matter how complicated the structure of tag a is, as long as it is a tag, we will not let it go: (the most critical step)
$pattern = '|<a[^>]*>(.*)</a>|isU'; preg_match_all($pattern, $html, $matches);
The matching result is in $ matches. It is probably such a multi-dimension group:
Array (2) {[0] => array (*) {[0] => string (*) "complete a tag "...} [1] => array (*) {[0] => string (*) "content in the tag corresponding to the subscript above "}}
As long as you can get the data, you can perform other operations. You can traverse this group, find the tag you want, and then obtain the corresponding attributes of tag, we recommend a class to facilitate the operation of tag:
$ Dom = new DOMDocument (); @ $ dom-> loadHTML ($ a); // $ a is the tag obtained above $ url = new DOMXPath ($ dom ); $ hrefs = $ url-> evaluate ('// A'); for ($ I = 0; $ I <$ hrefs-> length; $ I ++) {$ href = $ hrefs-> item ($ I); $ url = $ href-> getAttribute ('href '); // obtain the href attribute of tag a here}
Of course, this is only one way. You can also use regular expressions to match the information you want and create new tricks for data.
Get and match the desired results. The next step is to display them back to the front end, write the interface, and then use js to get data from the front end. Then, add the content to display it dynamically with jquery:
Var website_url = 'your interface address'; $. getJSON (website_url, function (data) {if (data. text = '') {comment ('{article_url'{.html ('<div> <p> no link to this article </p> </div>'); return ;} var string = ''; var list = data. text; for (var j in list) {var content = list [j]. url_content; for (var I in content) {if (content [I]. title! = '') {String + = '<div class =" item ">' + '<em> [<a href =" http: //' + list [j]. website. web_url + '"target =" _ blank ">' + list [j]. website. web_name + '</a>] </em>' + '<a href = "' + content [I]. url + '"target =" _ blank "class =" web_url ">' + content [I]. title + '</a>' + '</div>'; }}(('{article_url'}.html (string );});
Ultimately:

The above is all the content of this article, hoping to help you learn.
Articles you may be interested in:

 
 
  Full process of making crawlers using NodeJS
 
  Full process of making crawlers using NodeJS (continued)
 
  Summary of node. js crawler crawling data garbled
 
  Code issue of node. js crawling data
 
  Powerful crawlers Based on Node. js can directly publish captured articles.
 
  Java Web crawler provides App data (Jsoup web crawler)
 
  Asynchronous Concurrency Control in Nodejs crawler advanced tutorial
 
  Node. js Basic module http and webpage analysis tool cherrio implement Crawler
 
  The basic idea of writing crawlers in Node. js and sharing examples of capturing Baidu Images
 
  Simple code for nodeJs crawlers to obtain data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PHP + HTML + JavaScript + Css for simple crawler development, javascriptcss

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PHP + HTML + JavaScript + Css for simple crawler development, javascriptcss

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support