PHP-based data warehouse receiving Program (2): php-based data warehouse receiving program
In the previous article, the Program for collecting data into the database based on PHP (ii) mentions the list data on the news information page. Next, let's talk about the specific content of the collected news.
This is the final data table of the previous blog:
The next step is to read the URL to be collected from the database and capture the page.
Create a content table
However, you must note that you cannot use the incremental method of id collection URL, because IDs in the data table may be intermittent, such as id = 9, id = 11, when id = 10 is collected, the URL is blank, which may result in empty fields being collected.
One technique used here is the database query statement. When we collect the first piece of data, we can determine whether there is an id number greater than this id in the database. If so, read one, the query information already exists.
The Code is as follows:
<? Php include_once ("conn. php "); $ id = (int) $ _ GET ['id']; $ SQL =" select * from list where id = $ id "; $ result = mysql_query ($ SQL); $ row = mysql_fetch_array ($ result); // obtain the corresponding url address $ content = file_get_contents ($ row ['url']); $ pattern = "/<dd class = \" dataWrap \ "> (. *) <\/dd>/iUs "; preg_match ($ pattern, $ content, $ info); // obtain the information to store info echo $ title = $ row [1]. "<br/>"; echo $ content = $ info [0]. "
In this way, the news content we want will be collected into the database. Next we only need to sort out some data styles.
How can php programmers master data collection?
Common php data collection techniques:
1. Skills in Data Extraction Using Regular Expressions: Key Steps for extracting content
2. skillful character encoding conversion analysis technology: Compatibility management and data validity control
3. Skilled data warehouse receiving and Sorting Technology: storage and management of collected content, including databases, files, and progress
4. Data Mining and website crawling technology: analyzes the website structure, simplifies crawling techniques, and improves efficiency
5. Anti-collection processing technology: Anti-collection technology designed for objects with anti-collection targets
6. multi-server concurrent Collection Management Technology: working methods to improve efficiency
7. Data collation and analysis technology: Check for missing data to verify data correctness and effectiveness
8. Self-identity protection technology: Self-Information Protection
PHP collection warehouse receiving Problems
Php has the $ nr = implode ('#', $ arr) method.
However, the above is composed of "content 1 # Content 2", without the last #, if necessary
$ Nr = implode ('#', $ arr ).'#'
The stupid method is to use
Foreach ($ arr as $ vl ){
$ Nr. = $ vl ."#";
}
References: $