To write a design document _ work in a year

Source: Internet
Author: User

Looking at the document that had been written by myself, a hint of a smile had poured through my heart. One point is for the naïve design at that time, a point is for the time attentively, RTHK RTHK industry attitude.


This design is a system of data capture part of the design, if now let me design, certainly completely another kind of. This system is no longer running, so can be assured to take out, only Bo a smile.



XXXX SITE Monitoring--Detailed information crawler design

Version: V0.1

Author: XX

Time: 2010-9-13 Demand

According to the video title and page URL, to the specific page to capture the video properties and information.

Functional demand Point

Overview

Input

Provides a webservice interface for a master call to start the reptilian task asynchronously.

Output

1. After the normal reception, start the task to return to the master immediately after successful.

2. Invoke the callback interface provided by the master after completing the task/task failure.

3. After successful capture, the crawl data is saved to the database.

Error handling

In case of a crawl exception, the cause of the error should be reported to the master, and the log will be logged.

Concurrency requirements

The module supports multithreaded concurrent calls.

Actual investigation

According to each XXXX website HTML source analysis, in the interface/structure basically the same, but in different sites, or the same site with the video there are a number of subtle "version" differences, so need in the development process, develop a good crawl plan to adapt to each site video page changes. Interface design

Using the WebService interface, the SOAP protocol.

L version

http://xxxxx/netvideo/bokecc/vinfoant/version?wsdl

Intput:null

OUTPUT: Current Version

such as 0.1

L Crawl

http://xxxxx/netvideo/bokecc/vinfoant/run?wsdl

INPUT:

Parameters

Describe

Type

Note

Dbip

Connecting to the database IP

String

Dbport

Connecting to a database port

Integer

dbname

Database name

String

Dbuser

Database user Name

String

Dbpw

Database Password

String

VideoID

The video ID that you want to crawl

Integer

SessionId

Task ID

Integer

Timeout

Timeout time

Integer

Callbackaddr

Callback Address

String

Address that is called after the task completes

OUTPUT:

Responese (WebService call synchronous return):

Id

Describe

0

Accept Success

-1

Master identity error, no access

-2

Database connection Failed

Callback (WebService call is returned asynchronously):

Parameters

Describe

Type

Note

SessionId

Task ID

Integer

Status

Task Completion status

Integer

0 success

-1 Network condition anomaly

-2 Regular Expression match error


Process Design

1. Crawl Task Overall process

"Unable to map"

2. Content matching Process

"Unable to map"

Log design

Log Entries

Level

Record information

WebService interface is invoked

Info

Caller IP and each interface parameter

Master Authentication failed

Warn

Caller IP

Start building/Updating database connection pool

Info

Database parameters

Database connection Failed

Error,notify

Reason for failure

Database connection succeeded

Info

Start the crawler task.

Debug

Start crawling Web pages

Info

Url

A web crawl timeout

Warn

Current number of retries

One page Crawl exception

Warn

Cause of the exception

Web crawl failed in retry scope

Error,notify

Web Crawl Success

Debug

Start content Matching

Info

Regular expression match failed

Error,notify

Failed field, reason for failure

Regular expression match succeeded

Debug

Start updating the database

Info

SQL operations

Debug

SQL statement

Update Database complete

Debug

Write Database exception

Error,notify

Currently executing SQL statement, exception reason

Mission successful

Info

Technology selection

Development platform: Windows XP

Deployment platforms: Cross-platform

Programming Language: python2.5

Ide+plug-in:myeclipse 7.0 + Pydev

The specific Python technology used:

Function

Technology selection

Web Crawl

Urllib2

Content parsing, regular expression

Re

WebService

ZSI2.0

SOAP protocol

Soappy (Zsi dependent)

Xml

PyXML (Zsi dependent)

Web server

Zsi with soap SERVER or Apache

Publish, deploy

Windows platform: Py2exe


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.