Data acquisition is critical for some websites, and some of the following problems are encountered when developing this collection program:
One, the single process acquisition slow, the acquisition task interval time cannot control.
Second, the data download part and the analysis part does not separate causes the portability to be not strong, the debugging difficulty.
Third, the collection program by the machine performance bottleneck and speed bottleneck limit.
Iv. suffering from a data source blockade.
Wait a minute....
This requires that the acquisition process must be intelligent enough to have the following requirements:
One, can be multi-machine distribution operation to adapt to a large number of data collection.
Second, can be more concurrent acquisition, so that the operation cycle of the acquisition task can be controlled.
Third, the download program and the Analysis program separation, is not only the program separation, but also needs the separation on the machine.
Four, can easily join the new acquisition task, very easy to debug.
Five, the analysis of the content of the collection page can be fuzzy matching.
Six, the agent can be called when downloading.
Seven, long-term automatic maintenance of a valid proxy list
After several major changes, I have now designed the Linux and PHP-based capture program architecture as follows:
Snatch (home directory)
|-lib (class library, function, configuration directory)
| |-config.inc.php (main program variable configuration)
| |-otherconfig.inc.php (several other configuration files)
| |-functions.inc.php (several function files)
| |-classes.inc.php (some class library files)
| |-classlocaldb.inc.php (Action class to connect to the local database)
| |-classremotedb.inc.php (Operation class to connect to a remote database)
| |-classlog.inc.php (write the log of download analysis)
|-paser (Parser program directory)
| |-WEBSITE1 (parser directory for WebSite1)
| | |-website1paser1.php (Analysis procedure for WebSite1 1)
| | |-website1paser2.php (Analysis procedure for WebSite1 2)
| |-website2 (parser directory for WEBSITE2)
| |-proxywebsite1 (Site 1 of the Analysis proxy server list, get the proxy address and inbound)
| |-proxywebsite2 (site 2 of the Analysis proxy server list, get the proxy address and inbound)
| |-... ...
|-log (log directory)
| |-website1.log (Download and data analysis log for WebSite1)
| |-website2.log (Download and data analysis log for WebSite2)
| |-... ...
|-files (downloaded file save directory)
|-main.php (main entry program, assigning download tasks)
|-assign.php (Get download task, assign to down.php execution)
|-down.php (Download and save the downloaded file to be analyzed)
|-delovertimedata.php (purge very old download files)
|-errornotice.php (monitor Download program, notify the relevant person when the error occurs)
|-proxy.php (Verify the list of proxies in the database to analyze their validity and connection speed)
|-fork (Hook program to enable download and analysis concurrency)
|-main.sh (encapsulates the main.php so that it runs under the shell does not appear to contain a path error)
|-assign.sh (Package assign.php)
|-delovertimedata.sh (Package delovertimedata.php)
|-errornotice.sh (Package errornotice.php)
|-proxy.sh (Package proxy.php)
The Local database table structure is as follows (brief introduction):
Downloadlist table:
' id ' int (ten) unsigned not NULL auto_increment, self-increment ID
' ParentID ' int (one) not NULL default ' 0 ', which is the parent ID, which is derived from which download record
' SiteName ' char (+) not NULL default ', collect the name or code of the website
' Localservername ' char (+) not NULL default ', which is performed by one of several local machines
' URL ' char (255) Not NULL default ', need to download the data page address
' filename ' char (+) not NULL default ', file name saved after download
' FileSize ' int (one) not NULL default ' 0 ', file size after download
' Handler ' char (+) not NULL default ', parser's PHP file path, such as./paser/website1/paser1.php
' Status ' enum (' Wait ', ' Download ', ' Doing ', ' done ', ' Dead ') is not NULL default ' Wait ', the state of the task
' Proxyid ' int (one) not NULL default ' 0 ', the task uses a proxy ID of 0 and does not use proxy download
' Remark ' char (+) not NULL default ' ', Memo field
' Waitaddtime ' datetime not NULL default ' 0000-00-00 00:00:00 ', record time to join to wait
' Downloadaddtime ' datetime not NULL default ' 0000-00-00 00:00:00 ', recording the time the download started
' Doingaddtime ' datetime not NULL default ' 0000-00-00 00:00:00 ', recording time to start analysis
' Doneaddtime ' datetime not NULL default ' 0000-00-00 00:00:00 ', record the time of completion
Proxylist table:
' id ' int (one) not NULL auto_increment, self-increment ID
' Proxy ' char (+) not NULL default ' ', proxy address, such as: 127.0.0.1:8080
' Status ' enum (' bad ', ' good ', ' Perfect ') is not NULL default ' bad ', the agent state
' Sockettime ' float not NULL default ' 3 ', local connection to the proxy socket time
' Usedcount ' int (one) not NULL default ' 0 ', number of times used
' Addtime ' datetime not NULL default ' 0000-00-00 00:00:00 ', the time the agent was added to the list
' Lasttime ' datetime not NULL default ' 0000-00-00 00:00:00 ', the time the agent was last verified
Other related tables: (slightly)
Introduce several files (only introduction, no code):
First, main.php
Close ();
?>
Second, assign.php
/dev/null ";
}
$LocalDB->close ();
?>
Third, down.php
Iv. proxy.php (Maintain a valid proxy list)
There are two ways of doing this:
1, the proxy port for the proxy address socket connection. Set the connection timeout to 3 seconds (3 seconds is still not connected to the agent will not be asked)
If connected, calculate the connection time and update the Data sockettime field of the agent record to determine if its status is bad, good, or perfect
2, for non-bad agents, to download the files of the experiment, if you do not use the agent to download the file and use the agent to download the same file, then the agent is true and effective.
Program slightly
Multi-Machine distributed acquisition:
Only one run main.sh,2 minute runs.
Other machines run assign.sh,1 minutes at a time, assign.php will retrieve the task and complete it according to the Localservername field in the Downloadlist table.
The Localservername value is assigned when the collection task is loaded by main.php. This can also be adjusted automatically according to the load of each collection machine.
Log:
The log of the collected analysis is written like the log directory, so that it is easy to see if the data is collected, the analysis program is valid, and the possible location and time of the error can be found when errors occur.
A little complicated, I only wrote a general idea, the page Analysis section is not involved, but also very important.
Background management is not discussed.
After the rack up very cool, as long as you collect more machines, build a Qihoo no problem.
The collection of the previous company is this architecture, collect Sina, Tom, 163 and so on altogether 143 channel content.
This is also used for accurate collection and analysis of the toll data for a number of websites (which, of course, requires a simulated login).
It's still pretty stable.
A stable distributed data acquisition architecture based on Linux and PHP