When it comes to acquisition, it's just a remote access to information-> extract the required content-> classified storage-> Read-> display
It's a simple "thief program," a reinforced version.
The following is the corresponding core code (don't take It Bad, ^_^)
The content to be collected is an announcement on a game website, as shown below:
Use file_get_contents and simple regular to get basic page information
Organize the basic information, collect the storage:
<?php
include_once ("conn.php");
if ($_get[' id ']<=8&&$_get[' ID ']) {
$id =$_get[' id '];
$conn =file_get_contents ("http://www.93moli.com/news_list_4_$id.html");//Get page content
$pattern = "/<li><a Title=\ "(. *) \" target=\ "_blank\" href=\ "(. *) \" >/ius ";//Regular
Preg_match_all ($pattern, $conn, $arr); Matching content to arr array
//print_r ($arr);d ie;
foreach ($arr [1] as $key => $value) {//two-D array [2] corresponding to the ID and [1] exactly the same, using the key
$url = "http://www.93moli.com/". $arr [2][$key] ;
$sql = "INSERT into List (Title,url) value (' $value ', ' $url ')";
mysql_query ($sql);
echo "<a href= ' content.php?url=http://www.93moli.com/$url ' > $value </a>". <br/> ";
}
$id + +;
echo "is collecting URL data list $id ... Please later ... ";
echo "<script>window.location= ' list.php?id= $id ' </script>";
} else{
echo "gathers data to end. ";
}
? >
conn.php is a database connection file
list.php is this page
Because the data to be collected is pagination, and the page address is increasing, so I used the JS jump code, using ID to control the number of pages collected, but also to avoid a for the loop too large.
Gently loose data warehousing, the next article to write about the specific URL to collect information process.