Scraping Links with PHP

Last Update:2016-06-23 Source: Internet

Author: User

Tags xpath xpath examples

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scraping Links with PHP

By Justin on August 11, 2007

from:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content

In this tutorial you'll learn how to build a PHP script that scrapes links from the any Web page.

What are you ' ll learn? The use of CURL to get the content from a website (URL). Call PHP DOM functions to parse the HTML so can extract links. Use the XPath to grab links from specific parts of a page. Store the scraped links in a MySQL database. Put it all together into a link scraper. What else do could use a scraper for. Legal issues associated with scraping content. What do you'll need Basic knowledge of PHP and MySQL. A Web server running PHP 5. The CURL extension for PHP. Mysql? If you want to store the links. Get the Page Content

CURL is a great the tool for making requests to remote servers in PHP. It can imitate a browser in pretty much every the. Here's the code to grab our target site content:

$ch = Curl_init (); curl_setopt ($ch, curlopt_useragent, $userAgent); curl_setopt ($ch, Curlopt_url, $target _url); curl_ Setopt ($ch, Curlopt_failonerror, True); curl_setopt ($ch, curlopt_followlocation, True); curl_setopt ($ch, Curlopt_ Autoreferer, True); curl_setopt ($ch, curlopt_returntransfer,true); curl_setopt ($ch, Curlopt_timeout, ten); $html = Curl_ EXEC ($ch); if (! $html) {echo "
CURL error Number: ". Curl_errno ($ch); echo"
CURL error: ". Curl_error ($ch); exit;}

If the request is successful $html would be filled with the content of $target _url. If The call fails then we'll see a error message about the failure.

curl_setopt ($ch, Curlopt_url, $target _url);

This is determines what URL would be requested. For example if you wanted to scrape the this site has $target _url = "/makebeta/". I won ' t go into the rest of the options is set (except for curlopt_useragent? See below). You can read the depth tutorial on PHP and CURL here.

Tip:fake Your User Agent

Many websites won ' t play nice with a If you come knocking with the wrong User Agent string. What ' s a User Agent string? It's part of every request to a Web server so tells it what type of agent (browser, spider, etc) is requesting the Conte Nt. Some websites would give you different content depending on the user agent so you might want to experiment. You does this on CURL with a call to curl_setopt () with curlopt_useragent as the option:

$userAgent = ' googlebot/2.1 (http://www.googlebot.com/bot.html) '; curl_setopt ($ch, curlopt_useragent, $userAgent);

This would set CURL's user agent to mimic Google ' s. You can find a comprehensive list of user agents Here:user agents.

Common User Agents

I ' ve done a bit of the leg work for you and gathered the most common user agents:

Search Engine User Agents Google? googlebot/2.1 (http://www.googlebot.com/bot.html) Google Image? googlebot-image/1.0 (http://www.googlebot.com/bot.html) MSN Live? msnbot-products/1.0 (+http://search.msn.com/msnbot.htm) Yahoo? mozilla/5.0 (compatible; Yahoo! slurp; HTTP://HELP.YAHOO.COM/HELP/US/YSEARCH/SLURP) Ask Browser User Agents Firefox (windowsxp)? mozilla/5.0 (Windows; U Windows NT 5.1; EN-GB; rv:1.8.1.6) gecko/20070725 firefox/2.0.0.6 IE 7? mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1;. NET CLR 1.1.4322;. NET CLR 2.0.50727;. NET CLR 3.0.04506.30) IE 6? mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;. NET CLR 1.1.4322) Safari? mozilla/5.0 (Macintosh; U Intel Mac OS X; EN) applewebkit/522.11 (khtml, like Gecko) safari/3.0.2 Opera? opera/9.00 (Windows NT 5.1; U EN) Using PHP ' s DOM Functions to Parse the HTML

PHP provides with a really cool tool for working with HTML Content:dom Functions. The DOM Functions allow your to parse HTML (or XML) into an object structure (or Dom?). Document Object Model). Let's see how we do it:

$dom = new DOMDocument (); @ $dom->loadhtml ($html);

Wow is it really? yes! Now we had a nice DOMDocument object, the We can use with access everything within the HTML in a nice clean-out. I discovered this over at Russll Beattie's post on:using PHP to Scrape Sites as Feeds, thanks russell!

Tip:you May has noticed I put @ front of loadhtml (), this suppresses some annoying warnings that the HTML parser thro WS on many pages that has non-standard compliant code.

XPath makes Getting the Links want easy

Now for the real magic of the dom:xpath! XPath allows gather collections of DOM nodes (otherwise known as tags in HTML). Say want to only get links that is within unordered lists. All of the/html/body//ul//li//a and pass it to Xpath->evaluate () with to does is write a query. I ' m not going to go in all the ways you can use XPath because I ' m just learning myself and someone else have already made A great list of Examples:xpath examples. Here's a code snippet that would just get every link on the page using XPath:

$xpath = new Domxpath ($dom), $hrefs = $xpath->evaluate ("/html/body//a");

Iterate and Store Your Links

Next we'll iterate through all the links we've gathered using XPath and store them in a database. First the code to iterate through the links:

for ($i = 0; $i < $hrefs->length; $i + +) {$href = $hrefs->item ($i); $url = $href->getattribute (' href '); storelink ($url, $target _url);}

Full program:

"!--? php$target_url = "http://www.google.com"; $ch = Curl_init ();    $userAgent = ' googlebot/2.1 (http://www.googlebot.com/bot.html) '; curl_setopt ($ch, curlopt_useragent, $userAgent) ; curl_setopt ($ch, Curlopt_url, $target _url); curl_setopt ($ch, Curlopt_failonerror, True); curl_setopt ($ch, Curlopt_ Followlocation, True); curl_setopt ($ch, Curlopt_autoreferer, True); curl_setopt ($ch, curlopt_returntransfer,true); curl_setopt ($ch, curlopt_timeout), $html = Curl_exec ($ch), if (! $html) {echo ' 
 Curl Error Number: '. Curl_errno ($ch) ; echo "
 CURL error:". Curl_error ($ch); Exit;} $dom = new DOMDocument (), @ $dom->loadhtml ($html), $xpath = new Domxpath ($dom), $hrefs = $xpath->evaluate ("/html/ body//a "); 
   for ($i = 0; $i < $hrefs->length; $i + +) {$href = $hrefs->item ($i); $url = $href->getattribute (' href '); echo $url; echo "
";} 
?

Then you can store the URL to your database. More details from here:http://www.merchantos.com/makebeta/php/scraping-links-with-php/#curl_content

You can find a comprehensive list of user agents Here:user agents.

Using PHP to Scrape Sites as Feeds



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More