PHP mining of XML and HTML data

Last Update:2013-06-25 Source: Internet

Author: User

Tags access properties file transfer protocol

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Data mining and its importance

Common Acronyms

API: Application Programming Interface
CDATA: Character data
DOM: Document object mode
FTP: File Transfer Protocol
HTML: Hypertext Markup Language
HTTP: Hypertext Transfer Protocol
REST: Transmission of concrete states
URL: Uniform Resource Identifier
W3C: World Wide Web Alliance
XML: Extensible Markup Language

Wikipedia defines "data mining" as "data mining is a process that uses statistical and artificial intelligence methods, combined with database management, to extract models from large datasets ". This is a very in-depth definition and may exceed the typical use cases of most people. Few people use artificial intelligence. generally, data mining only searches for and aggregates large datasets to find useful information.

With the rapid development of the Internet and the provision of massive information, it is very important to collect large amounts of data and make the data meaningful. It is important to collect large datasets that individuals cannot read and extract them into useful data. This type of data mining is the focus of this article, this article will specifically introduce how to collect and parse such data.

Practical application of data mining

Data mining has many practical applications. You may assume that you want to search for a website and find the information provided (such as attendance records for movies or concerts ). You may need to retrieve more serious information, such as voter records and obtain useful data. Or, more often, you may need to check the social network data and try to parse the data to understand a certain trend, such as the frequency at which your company is mentioned, and whether this mention is positive or negative.

Precautions before mining websites

Before proceeding to the following content, we assume that you will extract data from another website. If you already have the data to be processed, it is completely different. When you extract data from a website, make sure that you comply with the terms of service, whether you are performing a Web erasure (which will be detailed later) or using an API. If you are performing an erasure, you also need to observe the robots.txt file of the site. this file describes what parts of the website scripts are allowed to access. Finally, make sure that the site bandwidth is not blocked. The script you write cannot access the site data as quickly as the script can run. Otherwise, this may not only cause hosting problems, but may also face the risk of being banned or blocked because the script is too "aggressive.

Understanding XML data structure

Whatever method you use to extract data, you may receive data in XML (or HTML) format. When XML supports data sharing, XML becomes the standard language for Internet. Before considering how to extract data, it is important to study the XML structure and how to process XML data.

The basic structure of XML documents is very intuitive, especially when you have used HTML before. All data in the XML document is stored in either of the two methods. The primary way to store data is to store the data in the nested tag. The following is a simple example. if you have an address, the data can be stored in the document as follows:

1234 Main Street, Baltimore, MD

You can nest these XML data points to create a list of multiple addresses. You can put all these addresses in another tag, which in this case is locations (see Listing 1).

Listing 1. Multiple addresses in XML

Ranch

1234 Main Street, Baltimore, MD
567 1st Street, San Jose, CA
901 Washington Ave, Chicago, IL

To extend this approach further, these addresses can be broken down into components: streets, cities, and states, which makes data processing easier. The resulting XML file is more typical, as shown in Listing 2.

Listing 2. Fully resolved address in XML

Ranch


1234 Main Street
Baltimore
MD


567 1st Street
San Jose
CA


901 Washington Ave
Chicago
IL


As mentioned earlier, there are two ways to store XML data, and we just saw one of them. Another method is to store data through attributes. You can assign several attributes to each tag. Although this method is less common, it can be a very useful tool. Sometimes it provides additional information, such as a unique ID or an event date. A more common case is adding metadata; in your address example, you can use a type attribute to indicate whether the address is a home address or a work address, as shown in Listing 3.

Listing 3. Tags added to XML

Ranch


1234 Main Street
Baltimore
MD


567 1st Street
San Jose
CA


901 Washington Ave
Chicago
IL


Note that XML documents always have a parent root tag / node, and all other tags / nodes are child tags / nodes of that root tag / node. The beginning of an XML document can also contain other declarations and definitions, as well as other complex content such as CDATA blocks. It is highly recommended that you read the Resources section to learn more about XML.

Back to top

Parsing XML data in PHP

Now that you understand the appearance and structure of XML, let's look at how to parse and programmatically access XML data in PHP. There are several libraries created for PHP that allow XML parsing, and each has its advantages and disadvantages. These libraries include DOM, XMLReader / Writer, XML Parser, SimpleXML, and more. For the purpose of this article, our main focus is SimpleXML, because it is one of the most commonly used libraries and one of my favorite libraries.

SimpleXML, as the name suggests, aims to provide a very simple interface to access XML. It can transform XML documents into an internal PHP object format. Accessing data points becomes as easy as accessing object variables. To parse an XML document using SimpleXML, simply use the simplexml_load_file () function (see Listing 4).

Listing 4. Parsing a document using SimpleXML

Ranch

It's that simple! Note, however, that thanks to PHP's file stream integration, you can insert a file name or URL here, and the file stream integration will automatically retrieve it. If the XML is already loaded into memory, you can also use simplexml_load_string (). If you run this code on the XML in Listing 3 and use print_r () to see the approximate structure of the data, you will get the output shown in Listing 5.

Listing 5. Output of parsed XML

Ranch
SimpleXMLElement Object
(
[address] => Array
(
[0] => SimpleXMLElement Object
(
[@attributes] => Array
(
[type] => home
)
[street] => 1234 Main Street
[city] => Baltimore
[state] => MD
)
[1] => SimpleXMLElement Object
(
[@attributes] => Array
(
[type] => work
)
[street] => 567 1st Street
[city] => San Jose
[state] => CA
)
[2] => SimpleXMLElement Object
(
[@attributes] => Array
(
[type] => work
)
[street] => 901 Washington Ave
[city] => Chicago
[state] => IL
)
)
)

You can then access the data using standard PHP object access and methods. For example, to echo every state where someone has lived, this can be achieved by iterating over the addresses (see Listing 6).

Listing 6. Iterating addresses

Ranch
address as $ address) {
echo $ address-> state, "
\ n ";
}
?>

Accessing properties is slightly different. Unlike referencing object properties, you can access properties as array values. You can change the code sample above to show the type attribute, and you can do this using the code shown in Listing 7.

Listing 7. Adding attributes

Ranch
address as $ address) {
echo $ address-> state, ':', $ address ['type'], "
\ n ";
}
?>

Although all the examples above involve iteration, you can also access the data directly and use the specific piece of information you need, such as extracting the street address of the second address with the following code: $ xml-> address [1]-> street.

You should now have some basic tools to start working with XML data. For more information, we strongly recommend that you refer to the SimpleXML documentation and other links listed in Resources.

Back to top

Data mining in PHP: possible approaches

As mentioned earlier, there are several ways to access data, the two main methods of which are web wipe and API usage.

Web wipe

Web scraping is the operation of downloading and extracting data from all web pages programmatically. There are many books devoted to this topic (see Related topics). I will only briefly list some of the tools needed for web wiping. First, PHP makes it easy to read a Web page into a string. There are many ways to accomplish this, including using file_get_contents () with a URL, but here you want to be able to parse the HTML in a meaningful way.

Because HTML is essentially an XML-based language, converting HTML to a SimpleXML structure is useful. But you can't just use simplexml_load_file () to load an HTML page, because even if it's valid HTML, it's still not XML. A good approach is to use a DOM extension, load the HTML page as a DOM document, and convert it to SimpleXML, as shown in Listing 8.

Listing 8.Using the DOM method to get a SimpleXML version of a web page

Ranch
loadHTMLFile ('http://example.com/');
$ xml = simplexml_import_dom ($ dom);
?>

You can now traverse HTML pages just like you would any other XML document. So you can now use $ xml-> head-> title to access the page title, or $ xml-> body-> div [0]-> div [0]-> div [0]-> h4 [0] Such references go deeper into the page.

However, as you can expect from the previous example, trying to find data from an HTML page can sometimes be very inconvenient because HTML pages are usually not as well structured as XML files. The line above looks for the first h4 of the three nested divs; in each case, it looks for the first div in each parent div.

Fortunately, if you just want to find the first h4 on the page, or other "direct data", then XPath is an easier way to do it. XPath is a very powerful tool that can be used as a topic for the entire article series (see some of the articles listed in Resources). In short, you can use '/' to describe hierarchical relationships; therefore, you can rewrite the previous reference to the XPath search below (see Listing 9).

Listing 9. Using XPath directly

Ranch
xpath ('/ html / body / div / div / div / h4');
?>

Alternatively, you can just use the '//' option and XPath, which will search all documents for the tag you are looking for. So you can find all h4, generate an array, and then access the first h4 using the following XPath:
'// h4'

Traversing the HTML hierarchy

The main reason to discuss the above transformations and XPaths is that one of the common necessary tasks for a web wipe is to automatically find other links on a web page and follow them, which allows you to "traverse" the website to find as much information as possible.

This task can be very cumbersome if XPath is used. Listing 10 provides an array of all the links with the "href" attribute and lets you work with them.

Listing 10. Finding all links on a page using multiple technologies

Ranch
loadHTMLFile ('http://example.com/');
$ xml = simplexml_import_dom ($ dom);
$ links = $ xml-> xpath ('// a [@href]');
foreach ($ links as $ l) {
echo $ l ['href'], "
\ n ";
}
?>

Now the above code finds all the links, but if you open every possible link you find, you start to quickly “crawl” the entire web. Therefore, it's best to enhance your code to ensure that only two types of links are accessed: one is a valid HTML link (not FTP or JavaScript), and the other is a link that only returns (through full or relative domain links) to the same website.

An easier way is to use PHP's internal
Set the parse_url () function to iterate over these links, which handles a large number of eligibility checks for you, as shown in Listing 11.

Listing 11. A more robust site traversal program

Ranch
loadHTMLFile ("http: // {$ host} /");
$ xml = simplexml_import_dom ($ dom);
$ links = $ xml-> xpath ('// a [@href]');
foreach ($ links as $ l) {
$ p = parse_url ($ l ['href']);
if (empty ($ p ['scheme']) || in_array ($ p ['scheme'], array ('http', 'https'))) {
if (empty ($ p ['host']) || ($ host == $ p ['host'])) {
echo $ l ['href'], "
\ n "; // Handle URL iteration here
}
}
}
?>

A final note on HTML parsing is that for a unified interface for all XML-like languages, you reviewed how to convert HTML back to SimpleXML using only DOM extensions. Note that the DOM library itself is very powerful and can be used directly. If you are very familiar with JavaScript and using tools like getElementsByTagName to traverse the DOM document tree, you can simply use the DOM library without SimpleXML.

You should now have the tools you need to erase data from a Web page. Once you are familiar with the techniques detailed earlier in this article, you can read any information from a Web page, not just links you can follow. We hope that you don't have to perform this task because APIs or other data sources already exist.

Working with XML APIs and data

At this point, you have acquired some basic skills to access and use the major XML data APIs on the Internet. They are usually based on REST, so only a simple HTTP access is needed to retrieve the data and parse the data using the closing described earlier.

Each API behaves differently in this regard. We can't go through each API one by one, so we will only briefly introduce some basic XML API examples. One of the most common data sources that are already in XML format is RSS feeds. RSS stands for Really Simple Syndication, and is usually a standardized format for sharing frequently updated data, such as blog posts, news headlines, or podcasts. To learn more about the RSS format, see Related topics. Note that RSS is an XML file with a parent tag, which has several tags, each of which provides a set of data points.

Here are some examples. Suppose we use SimpleXML to read the RSS feed of the New York Times title (see Related topics for a link to the RSS feed) and format a list of titles that will be linked to the corresponding body (see Listing 12).

Listing 12. Reading the New York Times RSS feed

Ranch

channel-> item as $ item) {
echo "
link} \ "> {$ item-> title}
";
}
?>

Figure 1 shows the output from the New York Times summary.

Output from the New York Times summary
Now let's explore a more fully functional REST-based API. The Flickr API is a good starting point because it provides large amounts of data but does not require authentication. Many APIs require authentication (using Oauth or other mechanisms) to operate on behalf of a web user. This step can be applied to the entire API or only part of the API. Refer to each API documentation to learn how to authenticate.

To learn how to use the Flickr API for an unauthenticated request, you can use its search API. As an example, we search for all public crossbow photos of Flickr. Although authentication is not necessary, like many APIs, an API key needs to be generated for use when accessing data. Please refer directly to Flickr's API documentation to learn how to accomplish this task. Once you have an API key, you can explore using the search capabilities of the Flickr API, as shown in Listing 13.

Listing 13. Searching for "crossbow" using the Flickr API

Ranch
'flickr.photos.search',
'api_key' => $ key,
'text' => 'crossbow', // Our search term
'media' => 'photos',
'per_page' => 20 // We only want 20 results
);

// Now make the request to Flickr:
$ xml = simplexml_load_file ($ apiurl. http_build_query ($ params));

// From this, iterate over the list of photos & request more info:
foreach ($ xml-> photos-> photo as $ photo) {
// Build a new request with this photo's ID
$ params = array (
'method' => 'flickr.photos.getInfo',
'api_key' => $ key,
'photo_id' => (string) $ photo ['id']
);
$ info = simplexml_load_file ($ apiurl.http_build_query ($ params));

// Now $ info holds a vast amount of data about the image including
// owner, GPS, dates, description, tags, etc ... all to be used.

// Let's also request "sizes" to get all of the image URLs:
$ params = array (
'method' => 'flickr.photos.getSizes',
'api_key' => $ key,
'photo_id' => (string) $ photo ['id']
);
$ sizes = simplexml_load_file ($ apiurl.http_build_query ($ params));
$ small = $ sizes-> xpath ("// size [@ label = 'Small']");

// For now, just going to create a simple display of the image,
// linked back to Flickr, with title, GPS info, and more shown:
echo << urls-> url [0]} ">





Title: {$ info-> photo-> title}


User: {$ info-> photo-> owner ['realname']}


Date Taken: {$ info-> photo-> dates ['taken']}


Location: {$ info-> photo-> location-> locality},
{$ info-> photo-> location-> county},
{$ info-> photo-> location-> region},
{$ info-> photo-> location-> country}




EOHTML;
}
?>

Figure 2 shows the output of this Flickr program. Your crossbow search results include some photos and information about each photo (title, user, location, and shooting date).

Figure 2.Sample output from the Flickr program from Listing 13
You have seen how powerful such an API is, and how to combine various calls in the same API to get the data you need. With these basic techniques, you can mine data from any website and information source.

Think about how to gain programmatic access to data through an API or web wipe. Then use the method shown to access and iterate over all the target data.

Back to top

Store and report extracted data

Finally, storing and reporting data is the easiest and probably the most interesting part in many ways. When you decide how to deal with this aspect according to your actual situation, you can use your imagination at will.

Usually, you get all the data you collect and store it in a database. Then design your data structure to match the way you plan to access the data in the future. At this time, try to store as much information as possible, and you may use them in the future. Although it is always possible to delete data, once there is too much data, retrieving additional information can be a painful process. But it's best to store more data in the beginning, after all, you never know what data might come in handy.

After storing the data in a database or similar data store, you can create reports. Creating a report may be as simple as running a few basic SQL queries against the database and seeing how many times a certain data appears; or, report creation may also be a very complex web user interface designed to allow users to drill down into the data and discover The relationship between the data.

When you have worked hard to classify all your data, you can imagine some innovative ways to display your data.

Back to top

Concluding remarks

In this article, you learned the basic structure of an XML document and a simple way to parse XML data in PHP using SimpleXML. You've also added the ability to process HTML in a similar way, and you've learned how to traverse a website to get data that is not available in XML format. With these tools, combined with some examples provided in this article, now that you have a good knowledge base, you can start data mining a website. There is still much to learn about this topic, and they cannot be introduced in one article. For other ways to increase your knowledge of data mining, see Resources.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More