PHP mining XML and HTML data

Source: Internet
Author: User
Tags access properties cdata file transfer protocol knowledge base
Data mining and its importance

Common abbreviations

API: Application Programming Interface CDATA: Character data DOM: Document Object mode FTP: File Transfer Protocol HTML: Hypertext Markup Language HTTP: Hypertext Transfer Protocol REST: Representational State transport URL: Uniform Resource Locator World Wide Web Consortium XML: Extensible Exhibition Markup Language

Wikipedia's definition of "data mining" is that "data mining is the process of extracting patterns from a large data set using statistical methods and artificial intelligence methods, combined with database management." This is a very in-depth definition that may be beyond the typical use cases of most people. Few people use artificial intelligence; Typically, data mining is simply searching for and pooling large datasets to find useful information.

The Internet is rapidly evolving and provides a huge amount of information, so it is important to be able to collect large amounts of data and make it meaningful. It is an important goal to collect large datasets that cannot be read by individuals and refine them into useful data. This type of data mining is the focus of this article, and this article will specifically describe how to collect and parse this data.

Back to top of page

Practical application of data mining

Data mining has many practical applications. You might assume a scenario where you want to search for a website and find the information it provides (such as a movie or concert attendance record). You may need to retrieve more serious information, such as voter records, and get useful data. Or, more commonly, you might want to examine social network data, try to parse that data, understand a trend, such as how often your company is mentioned, and whether the mention is positive or negative.

Back to top of page

Considerations before digging a Web site

Before we go on to the following, it is important to note that we assume that you will extract data from another site. If you already have the data you want to work with, it's a completely different scenario. When you extract data from a Web site, you need to be sure to follow the terms of service, whether you're doing a Web wipe (more on this later) or using an API. If you are erasing, you also need to be aware of the requirements of the site's robots.txt file, which describes which parts of the site's script are allowed to access. Finally, make sure that the bandwidth of the site is not obstructed. The script you write does not have the freedom to access the site's data at the speed that the script can run. Otherwise, this can lead not only to hosting problems, but also to the risk of being banned or blocked because the script is too "aggressive".

Back to top of page

Understanding XML Data Structures

No matter what method you take to extract data, it is possible to receive data in XML (or HTML) format. When XML supports shared data, XML becomes the standard language for the Internet. Before considering the method of extracting data, it is important to study the XML structure and how to process the XML data first.

The basic structure of an XML document is straightforward, especially if you have used HTML before. All data in an XML document is stored in one of two ways. The primary way to store data is to store the data in nested tags. Here is a simple example, assuming you have an address, the data can be stored in the document as follows:

1234 Main Street, Baltimore, MD

You can nest these XML data points to create a list of multiple addresses. You can put all these addresses in another tag, in this case the tag is locations (see Listing 1).


Listing 1. Multiple addresses in XML

     
     
      
          1234 Main Street, Baltimore, MD    567 1st Street, San Jose, CA    901 Washington Ave, Chicago, IL
     
     

To further extend this approach, these addresses can be decomposed into individual components: streets, cities, and states, which makes data processing easier. In this way, the resulting XML file is more typical, as shown in Listing 2.


Listing 2. Fully decomposed addresses in XML

     
                 
      
      
       
       1234 Main Street
      
              
      
      
       
       Baltimore
      
              
      
      
       
       MD
      
                      
      
      
       
       567 1st Street
      
              
      
      
       
       San Jose
      
              
      
      
       
       CA
      
                      
      
      
       
       901 Washington Ave
      
              
      
      
       
       Chicago
      
              
      
      
       
       IL
      
          
     
     

As mentioned earlier, there are two ways to store XML data, and we just saw one of them. Another approach is to store data through attributes. You can assign several properties to each tag. Although this approach is less common, this can be a very useful tool. Sometimes, it provides additional information, such as a unique ID or an event date. A more common scenario is to add metadata; In your address example, you can use a Type property to indicate whether the address is a home or work address, as shown in Listing 3.


Listing 3. tags added to XML

     
                 
      
      
       
       1234 Main Street
      
              
      
      
       
       Baltimore
      
              
      
      
       
       MD
      
                      
      
      
       
       567 1st Street
      
              
      
      
       
       San Jose
      
              
      
      
       
       CA
      
                      
      
      
       
       901 Washington Ave
      
              
      
      
       
       Chicago
      
              
      
      
       
       IL
      
          
     
     

Note that the XML document always has a parent root tag/node, and all other tokens/nodes are child tags/nodes of that root tag/node. The beginning of an XML document can also contain other declarations and definitions, as well as other complex content, such as CDATA blocks. It is strongly recommended that you refer to the Resources section for further information about XML.

Back to top of page

Parsing XML data in PHP

Now that you understand the look and structure of XML, let's look at how to parse and programmatically access the XML data in PHP. There are several libraries created for PHP that allow XML parsing, and each library has its pros and cons. These libraries include DOM, Xmlreader/writer, XML Parser, SimpleXML, and so on. For the purposes of this article, our main concern is SimpleXML, because it is one of the most used libraries and one of my favorite libraries.

SimpleXML, as the name implies, is designed to provide a very simple interface to access XML. It can convert an XML document into an internal PHP object format. Accessing data points becomes as simple as accessing object variables. To parse an XML document using SimpleXML, simply use the Simplexml_load_file () function (see Listing 4).


Listing 4. Parsing documents with SimpleXML

     
     

It's so easy! Note, however, that thanks to PHP's file stream integration, a file name or URL can be inserted here, and the stream integration will automatically retrieve it. If the XML is already loaded into memory, you can also use Simplexml_load_string (). If you run this code on the XML in Listing 3 and use Print_r () to see the approximate structure of the view data, you will get the output shown in Listing 5.


Listing 5. Output of parsed XML

SimpleXMLElement object ([address] = = Array ([0] = simplexmlelement Object                         ([@attributes] = = Array ([Type] = Home                     ) [Street] = 1234 Main Street [city] = Baltimore                     [state] = MD) [1] = = SimpleXMLElement Object (                         [@attributes] = = Array ([Type] = Work                     ) [Street] = 567 1st Street [City] = San Jose                     [state] = CA) [2] = = SimpleXMLElement Object (                         [@attributes] = = Array ([Type] = Work                )      [Street] = 901 Washington Ave [City] = Chicago [state] = IL )          ))

You can then access the data using standard PHP object access and methods. For example, to echo each state that a person lives in, you can do so by iterating through addresses (see Listing 6).


Listing 6. Iteration Address

     
     Address as $address) {    echo $address->state, "
\ n ";}?" >

Access properties are slightly different. Unlike the Reference object property, you can access the property as an array value. You can change the code sample above to display the type attribute, which can be done using the code shown in Listing 7.


Listing 7. Add Property

     
     Address as $address) {    echo $address->state, ': ', $address [' type '], '
\ n ";}?" >

Although all of the above examples involve iterations, you can also access the data directly and use the specific pieces of information you need, such as extracting the street address of the second address from the following code: $xml->address[1]->street.

Now, you should have some basic tools to start working with XML data. For more information, it is highly recommended that you refer to the SimpleXML documentation and other links listed in resources.

Back to top of page

Data mining in PHP: possible methods

As mentioned earlier, data can be accessed in several ways, two of which are Web erasure and API usage.

Web Erase

Web wipe (scraping) is the act of programmatically downloading and extracting data from all WEB pages. There are many books devoted to this topic (see Resources). I'll simply list some of the tools needed to do a Web wipe. First, PHP makes it very easy to read a Web page into a string. There are many ways to accomplish this task, including using file_get_contents () with a URL, but here you want to parse the HTML in a meaningful way.

Because HTML is essentially an XML-based language, it is useful to convert HTML to a SimpleXML structure. But you can't load an HTML page with just Simplexml_load_file (), because even if it's a valid HTML, it's still not XML. A good way to do this is to use the DOM extension, load the HTML page as a DOM document, and then convert it to SimpleXML, as shown in Listing 8.


Listing 8. Use the DOM method to get the SimpleXML version of a Web page

     
     Loadhtmlfile (' http://example.com/'); $xml = Simplexml_import_dom ($dom);? >

You can now traverse HTML pages just as you would for other XML documents. Therefore, you can now use the $xml->head->title to access the page title, or use $xml->body->div[0]->div[0]->div[0]->h4[0] Such references to drill down into the page.

However, as you can expect from the previous example, trying to find data from an HTML page can sometimes be inconvenient because HTML pages are often not as well structured as XML files. The above line of code looks for the first H4 in the three nested Div, and in each case it looks for the first div in each parent div.

Fortunately, if you just want to find the first H4 on a page, or other "direct data," XPath is a much simpler implementation. XPath is a very powerful tool that can be used as the subject of the entire article series (see the articles listed in resources). In short, you can use '/' to describe a hierarchical relationship, so you can rewrite the preceding reference as the following XPath search (see listing 9).


Listing 9. Using XPath directly

     
     XPath ('/html/body/div/div/div/h4 ');? >

Alternatively, you can use only the '//' option and XPath, which will search all documents to find the tag you are looking for. Therefore, you can find all H4, generate an array, and then use the following XPath to access the first H4:
'//h4 '

Traverse the HTML hierarchy

The main reason for discussing these conversions and XPath is that one of the most common tasks necessary for Web erasure is to automatically find other links on Web pages and recover them, which allows you to "traverse" the site to find as much information as possible.

If you use XPath, this task will be very cumbersome. Listing 10 provides an array of all the links with the "href" attribute and allows you to work with them.


Listing 10. Use a combination of techniques to find all links on a page

     
     Loadhtmlfile (' http://example.com/'); $xml = Simplexml_import_dom ($dom); $links = $xml->xpath ('//a[@href] '); foreach ($links as $l) {    echo $l [' href '], '
\ n ";}?" >

Now, the code above finds all the links, but if you open every possible link you find, you start crawling the entire Web quickly. Therefore, it is best to enhance your code to ensure that only two types of links are available: One is a valid HTML link (not FTP or JavaScript), and the other is to return only links to the same Web site (through full or relative domain links).

An easier approach is to iterate over these links using PHP's built-in Parse_url () function, which handles a number of qualifying reviews, as shown in Listing 11.


Listing 11. A more robust site traversal program

     
     Loadhtmlfile ("http://{$host}/"), $xml = Simplexml_import_dom ($dom), $links = $xml->xpath ('//a[@href] '); foreach ($ Links as $l) {    $p = Parse_url ($l [' href ']);    if (Empty ($p [' scheme ')] | | in_array ($p [' scheme '], array (' http ', ' https ')) {        if (empty ($p [' Host ']) | | ($host = = $p [' Host ']) {            echo $l [' href '], '
\ n "; Handle URL iteration here } }}?>

The last point about HTML parsing is that for a unified interface to all XML class languages, you reviewed how to use only DOM extensions to convert HTML back to SimpleXML. Note that the DOM library itself is very powerful and can be used directly. If you are very familiar with JavaScript and use tools such as getElementsByTagName to traverse the DOM document tree, you can use only the DOM library and not use SimpleXML.

You should now have the tools you need to erase data from a Web page. Once you are familiar with the techniques detailed earlier in this article, you can read any information from a Web page, not just a link that you can follow. We hope that you do not have to perform this task because an API or other data source already exists.

Using XML APIs and data

At this point, you have mastered some basic skills to access and use the primary XML data API on the Internet. They are usually based on REST, so you can retrieve the data with just one simple HTTP access and parse the data using the end described earlier.

Each API behaves differently in this regard. We can't describe how to access each API, so we'll just briefly cover some basic XML API examples. One of the most common and XML-formatted data sources is the RSS feed. RSS stands for really Simple syndication, typically a standardized format for sharing frequently updated data, such as blog posts, news headlines, or podcasts. To learn more about the RSS format, see Resources. Note that RSS is an XML file that has a name of a parent tag with several tags, each of which provides a set of data points.

The following examples illustrate. Let's say we use SimpleXML to read the RSS feed for the New York Times headline (see Resources for a link to the RSS feed) and format a column of headings that are linked to the appropriate body (see listing 12).


Listing 12. Read The New York Times RSS Feed

     
     
     
    Channel->item as $item) { echo "
  • Link}\ ">{$item->title}
  • ";}? >

Figure 1 shows the output from The New York Times Digest.


Output from the New York Times Digest

Now, let's explore a more functional REST-based API. The Flickr API is a good starting point because it provides a lot of data, but does not require authentication. Many APIs require authentication (using Oauth or other mechanisms) to act on behalf of a WEB user. This step can be applied to the entire API, or only to some APIs. See each API documentation to learn how to authenticate.

To learn how to use the Flickr API for an unauthenticated request, you can use its search API. As an example, we searched all the public crossbow photos of Flickr. While it is not necessary to authenticate, as with many APIs, you still need to generate an API key to use when accessing data. Please refer directly to the Flickr API documentation to learn how to accomplish this task. Once you have an API key, you can explore it using the Flickr API's search function, as shown in Listing 13.


Listing 13. Search for "crossbow" using the Flickr API

      ' Flickr.photos.search ', ' api_key ' + $key, ' text ' = ' crossbow ',//Search term ' media ' = ' photos ' , ' per_page ' = +//We only want results);//Today make the request to Flickr: $xml = simplexml_load_file ($apiurl . Http_build_query ($params));//From this, iterate through the list of photos & request more Info:foreach ($xml->photos- >photo as $photo) {//Build a new request with this photo ' s ID $params = Array (' method ' = ' = ' flickr.ph    Otos.getinfo ', ' api_key ' and $key, ' photo_id ' = (string) $photo [' id ']);        $info = Simplexml_load_file ($apiurl. Http_build_query ($params)); Now $info holds a vast amount of data about the image including//owner, GPS, dates, description, tags, etc ... al    l to be used. Let's also request "sizes" to get all of the image URLs: $params = Array (' method ' = ' = ' flickr.photos.getSi Zes ', ' api_key ' = $key, ' photo_id ' and ' = ' (string) $photo [' ID ']    );    $sizes = Simplexml_load_file ($apiurl. Http_build_query ($params));        $small = $sizes->xpath ("//size[@label = ' small ']"); For-now, just going-create a simple display of the image,//linked back to Flickr, with title, GPS info, and Mo Re Shown:echo <<
     
      
       Urls->url[0]} ">
       
     
  • Title: {$info->photo->title}
  • User: {$info->photo->owner[' realname '}
  • Date taken: {$info->photo->dates[' taken '}
  • Location: {$info->photo->location->locality}, {$info->photo->location->county}, {$info-&G T;photo->location->region}, {$info->photo->location->country}
eohtml;}? >

Figure 2 shows the output of this Flickr program. Your crossbow search results include some photos and information about each photo (title, user, location, and photo date).


Figure 2. Sample output from the FLICKR program in Listing 13

You've seen how powerful such an API is, and how to combine various calls in the same API to get the data you need. With these basic techniques, you can tap data from any website and information source.

Consider how to get programmatic access to the data through an API or Web wipe. Then use the method presented to access and iterate over all the target data.

Back to top of page

Storage and reporting of extracted data

Finally, storing and reporting data is the easiest part, and probably the most interesting part, in many ways. When you decide how to deal with this aspect according to your situation, you can use your imagination.

Typically, you get all the data you collect and store it in a database. Then, design your data structure so that it matches the way you plan to access your data in the future. At this point, try to store as much information as possible in the future. Although you can always delete data, retrieving additional information can be a painful process once there is too much data. But it's best to store some more data at the beginning, after all, you never know what data might come in handy.

Once you have stored the data in a database or similar data storage, you can create a report. Creating a report can be as simple as running a few basic SQL queries against the database to see how many times a single data appears, or it could be a very complex WEB user interface designed to allow users to drill down into the data and discover the relationships between them.

When you've worked hard to categorize all of your data, you can imagine some innovative ways to visualize your data.

Back to top of page

Conclusion

In this article, you learned about the basic structure of XML documents and an easy way to use SimpleXML to parse XML data in PHP. You've also added the ability to work with HTML in a similar way and have an initial look at how to traverse a Web site to get data that is not available in XML format. With these tools, combined with some of the examples provided in this article, now that you have a good knowledge base, you can start data mining for a site. There is still much to learn about this topic and it is not possible to introduce them in one article. See Resources for additional ways to increase your data mining knowledge.

Back to top of page

Download

Description Name Size Download method
Source Datamining_source.zip 10KB HTTP

Information about the Download method

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.