Looking at the document that had been written by myself, a hint of a smile had poured through my heart. One point is for the naïve design at that time, a point is for the time attentively, RTHK RTHK industry attitude.
This design is a system of data capture part of the design, if now let me design, certainly completely another kind of. This system is no longer running, so can be assured to take out, only Bo a smile.
XXXX SITE Monitoring--Detailed information crawler design
Version: V0.1
Author: XX
Time: 2010-9-13 Demand
According to the video title and page URL, to the specific page to capture the video properties and information.
Functional demand Point
Overview
Input
Provides a webservice interface for a master call to start the reptilian task asynchronously.
Output
1. After the normal reception, start the task to return to the master immediately after successful.
2. Invoke the callback interface provided by the master after completing the task/task failure.
3. After successful capture, the crawl data is saved to the database.
Error handling
In case of a crawl exception, the cause of the error should be reported to the master, and the log will be logged.
Concurrency requirements
The module supports multithreaded concurrent calls.
Actual investigation
According to each XXXX website HTML source analysis, in the interface/structure basically the same, but in different sites, or the same site with the video there are a number of subtle "version" differences, so need in the development process, develop a good crawl plan to adapt to each site video page changes. Interface design
Using the WebService interface, the SOAP protocol.
L version
http://xxxxx/netvideo/bokecc/vinfoant/version?wsdl
Intput:null
OUTPUT: Current Version
such as 0.1
L Crawl
http://xxxxx/netvideo/bokecc/vinfoant/run?wsdl
INPUT:
Parameters
Describe
Type
Note
Dbip
Connecting to the database IP
String
Dbport
Connecting to a database port
Integer
dbname
Database name
String
Dbuser
Database user Name
String
Dbpw
Database Password
String
VideoID
The video ID that you want to crawl
Integer
SessionId
Task ID
Integer
Timeout
Timeout time
Integer
Callbackaddr
Callback Address
String
Address that is called after the task completes
OUTPUT:
Responese (WebService call synchronous return):
Id
Describe
0
Accept Success
-1
Master identity error, no access
-2
Database connection Failed
Callback (WebService call is returned asynchronously):
Parameters
Describe
Type
Note
SessionId
Task ID
Integer
Status
Task Completion status
Integer
0 success
-1 Network condition anomaly
-2 Regular Expression match error
Process Design
1. Crawl Task Overall process
"Unable to map"
2. Content matching Process
"Unable to map"
Log design
Log Entries
Level
Record information
WebService interface is invoked
Info
Caller IP and each interface parameter
Master Authentication failed
Warn
Caller IP
Start building/Updating database connection pool
Info
Database parameters
Database connection Failed
Error,notify
Reason for failure
Database connection succeeded
Info
Start the crawler task.
Debug
Start crawling Web pages
Info
Url
A web crawl timeout
Warn
Current number of retries
One page Crawl exception
Warn
Cause of the exception
Web crawl failed in retry scope
Error,notify
Web Crawl Success
Debug
Start content Matching
Info
Regular expression match failed
Error,notify
Failed field, reason for failure
Regular expression match succeeded
Debug
Start updating the database
Info
SQL operations
Debug
SQL statement
Update Database complete
Debug
Write Database exception
Error,notify
Currently executing SQL statement, exception reason
Mission successful
Info
Technology selection
Development platform: Windows XP
Deployment platforms: Cross-platform
Programming Language: python2.5
Ide+plug-in:myeclipse 7.0 + Pydev
The specific Python technology used:
Function
Technology selection
Web Crawl
Urllib2
Content parsing, regular expression
Re
WebService
ZSI2.0
SOAP protocol
Soappy (Zsi dependent)
Xml
PyXML (Zsi dependent)
Web server
Zsi with soap SERVER or Apache
Publish, deploy
Windows platform: Py2exe