PHP + HTML + JavaScript + Css for simple crawler development

Source: Internet
Author: User
This article mainly introduces PHP + HTML + JavaScript + Css for simple crawler development, which has some reference value. interested friends can refer to the development of a crawler, first, you need to know what your crawler is going to do. I want to use different websites to find articles with specific keywords and get their links so that I can quickly read them.

According to my habits, I first need to write an interface to clarify my ideas.

1. go to different websites. Then we need a url input box.

2. search for articles with specific keywords. Then we need an article title input box.

3. get the article link. Then we need a container to display the search results.

Article URL capture

Article title

Website URL

Capture

Article URL

Directly add the code and add some style adjustments to the interface:

The next step is the implementation of the function. I will use PHP to write the code. The first step is to obtain the html code of the website. There are many ways to obtain the html code. I will not introduce it one by one, here we use curl to obtain the html code by passing in the website url:

private function get_html($url){  $ch = curl_init();  $timeout = 10;  curl_setopt($ch, CURLOPT_URL, $url);  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);  curl_setopt($ch, CURLOPT_ENCODING, 'gzip');  curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36');  curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);  $html = curl_exec($ch);  return $html; }

Although you get the html code, you will soon encounter a problem, that is, the encoding problem, which may make your next matching useless. here we will convert the html content to utf8 encoding:

$coding = mb_detect_encoding($html); if ($coding != "UTF-8" || !mb_check_encoding($html, "UTF-8"))  $html = mb_convert_encoding($html, 'utf-8', 'GBK,UTF-8,ASCII');

To obtain the html of the website and obtain the url of the article, the next step is to match all the tags under the webpage and use regular expressions. after multiple tests, finally, we get a more reliable regular expression. no matter how complicated the structure of Tag a is, as long as it is a tag, we will not let it go: (the most critical step)

$pattern = '|]*>(.*)|isU'; preg_match_all($pattern, $html, $matches);

The matching result is in $ matches. it is probably such a multi-dimension group:

Array (2) {[0] => array (*) {[0] => string (*) "complete a tag "...} [1] => array (*) {[0] => string (*) "content in the tag corresponding to the subscript above "}}

As long as you can get the data, you can perform other operations. you can traverse this group, find the tag you want, and then obtain the corresponding attributes of Tag, we recommend a class to facilitate the operation of tag:

$ Dom = new DOMDocument (); @ $ dom-> loadHTML ($ a); // $ a is the tag obtained above $ url = new DOMXPath ($ dom ); $ hrefs = $ url-> evaluate ('// a'); for ($ I = 0; $ I <$ hrefs-> length; $ I ++) {$ href = $ hrefs-> item ($ I); $ url = $ href-> getAttribute ('href '); // Obtain the href attribute of tag a here}

Of course, this is only one way. you can also use regular expressions to match the information you want and create new tricks for data.

Get and match the desired results. The next step is to display them back to the front end, write the interface, and then use js to get data from the front end. Then, add the content to display it dynamically with jquery:

Var website_url = 'Your interface address'; $. getJSON (website_url, function (data) {if (data. text = '') {response ('your article_url'{.html ('

No link to this article

'); Return;} var string = ''; var list = data. text; for (var j in list) {var content = list [j]. url_content; for (var I in content) {if (content [I]. title! = '') {String + ='

'+'['+ List [j]. website. web_name +']'+ ''+ Content [I]. title +'' +'

';}}{('{Article_url'}.html (string );});

Ultimately:

The above is all the content of this article, hoping to help you learn.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.