Use PHP to get link information on a page

Source: Internet
Author: User
Tags xpath

In the development we may get a page or a section of the content of the link information, below I share a function I wrote to everyone, hope to help everyone.


function function:

1, access to a section of the content of the link information;

2, get a URL link information;

3. Invalid links such as anchor chains are excluded.

4. Get the link information under the current domain

5. Get the link information under his domain

6. Keep the text message of the link

Code:

/** * +----------------------------------------------------------* Function: Get a Web page or a section of content inside the link information * +------------------------ ----------------------------------* @param string $html to get the content or URL of the link * @param string $isExclude whether to filter invalid links such as "", "#", "Jav Ascript:; "," javascript:void (0); ". Default filter * @param string $isKeepLinkText whether to preserve linked text. By default, the number of reserved and non-reserved links may be different * @param string $linkType Get the type of link, all links, inner links under this domain, out-of-the-field link information. The default is to get all links * +----------------------------------------------------------* @return Array * +-------------------------- --------------------------------*/function getlinks ($html, $isExclude =true, $isKeepLinkText =true, $linkType = ' all ')    {if (empty ($html)) return false; Set_time_limit (0);//Prevent Timeouts $removes =array (' ', ' # ', ' javascript:; ', ' javascript:void (0); ', ' javascript:void (0) ');//  Excluding the anchor chain $html = substr (Strtolower ($html), 0,4) = = "http"? File_get_contents ($html): $html;//content to be processed//extract link information $pattern = '/<a (?:. *?) Href= "((?: http (?: s?): \ /\/)? ([^\"\/]+))? (?:[^\"]*))" (?: [^>]*?) > ([^<]*?)    <\/a>/i ';    Preg_match_all ($pattern, $html, $_links);        if ($isKeepLinkText) {foreach ($_links[1] as $key = = $href) {$links [$_links[4][$key]]= $href;    }}else{$links =$_links[4];                        } unset ($_links);                       foreach ($links as $text = + $href) {//Remove Invalid link if ($isExclude &&in_array ($href, $removes)) {        Unset ($links [$text]);             } if ($linkType! = ' All ') {$host =parse_url ($href);            $host =isset ($host [' Host '])? $host [' Host ']: '; if ($linkType = = ' inner ') {//This domain link if (substr ($href, 0,1)! = "/" &&strtolower ($host)!=strtolower ($_server[' SE                Rver_name ']) {unset ($links [$text]); }}elseif ($linkType = = ' out ') {//his domain link if (substr ($href, 0, 1) = = "/" | |                Strtolower ($host) ==strtolower ($_server[' server_name ')) {unset ($links [$text]);  }            }        }  } return $links;} 

How to use:

$links =getlinks ("http://www.sina.com.cn");               or $links =getlinks ("http://www.sina.com.cn", 1,0, "out");                or $links =getlinks ("Here is the content to extract the link Information");

Special Note:

1, the above function used to file_get_contents, get content may fail, you can change to curl yourself;

2, the extraction link uses the regular, the efficiency may be low.


Of course, you can take a look. Use the following function to extract the link information when the URL of the content to extract is not used:

Code:

/** * +----------------------------------------------------------* Function: Get a Web page or a section of content inside the link information * +------------------------ ----------------------------------* @param string $html to get the content or URL of the link * @param string $isExclude whether to filter invalid links such as "", "#", "Jav Ascript:; "," javascript:void (0); ". Default filter * @param string $isKeepLinkText whether to preserve linked text. By default, the number of reserved and non-reserved links may be different * @param string $linkType Get the type of link, all links, inner links under this domain, out-of-the-field link information. The default is to get all links * +----------------------------------------------------------* @return Array * +-------------------------- --------------------------------*/function getlinks ($html, $isExclude =true, $isKeepLinkText =true, $linkType = ' all ')    {if (empty ($html)) return false;    Set_time_limit (0); $removes =array ("', ' # ', ' javascript:; ', ' javascript:void (0); ', ' javascript:void (0) ');//exclude anchor chains and the like $isLink =substr (    Strtolower ($html), 0,4) = = "http"? 1:0;//is link $html = $isLink? file_get_contents ($html): $html;        if ($isLink) {$dom = new DOMDocument ();        @ $dom->loadhtml ($html); $xpath = new Domxpath ($dom);        Unset ($dom); $hrefs = $xpath->evaluate ("/html/body//a");//Gets a node $length = $hrefs->length;//Gets the number of links $links =array ();//Web page            The link for ($i = 0; $i < $length; $i + +) {$href = Trim ($hrefs->item ($i)->getattribute (' href '));            $text =trim ($hrefs->item ($i)->textcontent);        $links [$text]= $href; }}else{$pattern = '/<a (?:. *?) Href= "((?: http (?: s?): \ /\/)? ([^\"\/]+))? (?:[^\"]*))" (?: [^>]*?) > ([^<]*?)        <\/a>/i ';        Preg_match_all ($pattern, $html, $_links);            if ($isKeepLinkText) {foreach ($_links[2] as $key = = $href) {$links [$_links[4][$key]]= $href;        }}else{$links =$_links[4];    } unset ($_links);            } foreach ($links as $text = + $href) {//Remove Invalid link if ($isExclude &&in_array ($href, $removes)) {        Unset ($links [$text]);       } if ($linkType! = ' All ') {     $host =parse_url ($HREF);            $host =isset ($host [' Host '])? $host [' Host ']: '; if ($linkType = = ' inner ') {//This domain link if (substr ($href, 0,1)! = "/" &&strtolower ($host)!=strtolower ($_server[' SE                Rver_name ']) {unset ($links [$text]); }}elseif ($linkType = = ' out ') {//his domain link if (substr ($href, 0, 1) = = "/" | |                Strtolower ($host) ==strtolower ($_server[' server_name ')) {unset ($links [$text]); }}}} return $links;}

Use the same method as above

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.