Go crawler Software Pholcus

Source: Internet
Author: User
Tags manual writing
This is a creation in Article, where the information may have evolved or changed.

Pholcus

Pholcus (Ghost spider) is a pure Go language support distributed high-concurrency, heavy-weight crawler software, located in the Internet data collection, for a certain go or JS programming basis for people with a need to focus on the rules to customize the powerful crawler tools.

It supports single-machine, server-side, client three operating modes, with Web, GUI, command line three operation interface, simple and flexible rules, batch task concurrency, rich output mode (Mysql/mongodb/csv/excel, etc.), a large number of demo sharing In addition, it supports two kinds of vertical crawl modes, which support a series of advanced functions, such as simulated login and task pause and cancel.

    • Official QQ group: Go Big Data 42731170

Crawler principle

Frame Features

    1. For users with a certain go or JS programming based on the user to pay attention to the rules of custom-made, full-featured heavy crawler tools;

    2. Support single-machine, service-side, client three kinds of operation mode;

    3. GUI (Windows), Web, CMD three operating interface, can be controlled by the parameters of open mode;

    4. Support State control, such as pause, resume, stop, etc.

    5. Can control the amount of acquisition;

    6. Can control the number of concurrent processes;

    7. Support multi-acquisition task concurrent execution;

    8. Support proxy IP list, can control the frequency of replacement;

    9. Support the collection process random stop, simulating artificial behavior;

    10. Provides custom configuration input interfaces according to the requirements of the rule

    11. There are MySQL, MongoDB, CSV, Excel, the original file download a total of five kinds of output mode;

    12. Support batch output, and the quantity of each batch can be controlled;

    13. Support static go and dynamic JS Two collection rules, support horizontal vertical two grasping mode, and have a large number of demo;

    14. Persistent success record, easy to auto-weight;

    15. Serialization of failed requests, support deserialization automatic overload processing;

    16. Adopt surfer high Concurrent downloader, support Get/post/head method and HTTP/HTTPS protocol, support fixed useragent automatic saving cookie and random mass useragent disable cookie two modes, highly simulate browser behavior, Can realize the function of analog login;

    17. Server/Client mode uses teleport high concurrency Socketapi frame, full double foreman connection communication, internal data transmission format is JSON.

Download installation

    1. Download the third-party dependency package source code, put in the GOPATH/SRC directory [click Download ZIP]

    2. Download update source, command line as follows

Go get-u-v github.com/henrylee2cn/pholcus

Note: Pholcus publicly maintained spider Rule Library address Https://github.com/pholcus/spider_lib

Create a project

Package Mainimport (    "Github.com/henrylee2cn/pholcus/exec"    _ "github.com/pholcus/spider_lib"// This is a library of publicly maintained spider rules    //_ "Spider_lib_pte"//Also you can freely add your own rule library) Func main () {    //Set the runtime Default action interface and start running    //Run the software before you can set The-A_UI parameter is "Web", "GUI" or "cmd", which specifies the operating interface of this run    //where "GUI" only supports Windows system    exec. Defaultrun ("Web")}

Compile run

Normal compilation method

CD {{Replace your gopath}}/src/github.com/henrylee2cn/pholcusgo install or go build

How to compile a hidden cmd window under Windows

CD {{Replace your gopath}}/src/github.com/henrylee2cn/pholcusgo install-ldflags= "-H Windowsgui" or go build-ldflags= "- H Windowsgui "

To view the optional parameters:

Pholcus-h

The web version of the operating interface is as follows:

GUI version of the operating interface mode selection interface is as follows

CMD version of the run parameter setting example below

$ Pholcus-_ui=cmd-a_mode=0-c_spider=3,8-a_outtype=csv-a_thread=20-a_dockercap=5000-a_pause=300-a_proxyminute=0- a_keyins= "<pholcus><golang>"-a_limit=10-a_success=true-a_failure=true

Run-time Catalog files

├─pholcus software │├─pholcus_pkg Runtime file directory │  ├─config.ini profile │  ││  ├─proxy.lib proxy ip list file │  ││  ├─spiders Dynamic rules directory │  │  └─xxx.pholcus.html dynamic Rules files │  ││  ├─phantomjs Program Files │  ││  ├─text_out text data file output directory │  ││  ├─file_out file Results output directory │  ││  ├─logs log directory │  ││  ├─history History directory │  │└─└─cache temporary Cache directory

Dynamic Rule Examples

Features: Dynamic loading rules, no need to recompile software, easy to write, add free, suitable for lightweight acquisition projects.
Xxx.pholcus.html

<Spider> <name>html Dynamic Rule Example </Name> <description>html dynamic Rule example [Auto Page] [HTTP://XXX.XXX.XXX]&L t;/description> <Pausetime>300</Pausetime> <EnableLimit>false</EnableLimit> <enable Cookie>true</enablecookie> <EnableKeyin>false</EnableKeyin> <notdefaultfield>false </NotDefaultField> <Namespace> <Script></Script> </Namespace> <subnamespa        ce> <Script></Script> </SubNamespace> <Root> <script param= "CTX" >        Console.log ("Root"); CTx.        Jsaddqueue ({Url: "Http://xxx.xxx.xxx", Rule: "Login Page"}); </Script> </Root> <rule name= "Login page" > <AidFunc> <script param= "Ctx,aid" &G            T </Script> </AidFunc> <ParseFunc> <script param= "CTX" > Console. Log (CTX. GetrUlename ()); CTx.                Jsaddqueue ({Url: "Http://xxx.xxx.xxx", Rule: "After Login", Method: "Post", PostData: "Username=44444444@qq.com&amp;password=44444444&amp;login_btn=login_btn&amp;submit=login_            BTN "});            </Script> </ParseFunc> </Rule> <rule name= "post-Login" > <ParseFunc> <script param= "CTX" > Console.log (CTX.            Getrulename ()); CTx. Output ({"All": CTX.            GetText ()}); CTx.                    Jsaddqueue ({Url: "Http://accounts.xxx.xxx/member", Rule: "Personal Center", Header: { "Referer": [CTX.            GETURL ()]}});            </Script> </ParseFunc> </Rule> <rule name= "Personal Center" > <ParseFunc> <script param= "CTX" > Console.log ("Personal Center:" + CTX.         Getrulename ());   CTx. Output ({"All": CTX.            GetText ()}); </Script> </ParseFunc> </Rule></Spider>

Static rule examples

Features: With the software compiled together, more customized, more efficient, suitable for heavy-weight acquisition projects.
Xxx.go

Func init () {spider{Name: "Static rule Example", Description: "Static rule example [Auto Page] [http://xxx.xxx.xxx]", pausetime:300,//Limit:limit,//Keyin:keyin, Enablecookie:true, NOTDEFAULTF Ield:false, Namespace:nil, Subnamespace:nil, Ruletree: &ruletree{root:fu NC (CTX *context) {ctx. Addqueue (&request.                     Request{url: "Http://xxx.xxx.xxx", Rule: "Login Page"}), trunk:map[string]*rule{"Login page": { Parsefunc:func (ctx *context) {ctx. Addqueue (&request.                            request{Url: "Http://xxx.xxx.xxx", Rule: "After Login", Method: "POST", PostData: "Username=123456@qq.com&password=123456&lo Gin_btn=login_btn&submit=login_btn ",})},},               "After Login": {parsefunc:func (ctx *context) {ctx. Output (map[string]interface{}{"All": CTX. GetText (),}) ctx. Addqueue (&request.                            request{Url: "Http://accounts.xxx.xxx/member", Rule: "Personal Center", Header:http. header{"Referer": []string{ctx.                    GETURL ()}},})}, "Personal center": { Parsefunc:func (ctx *context) {ctx. Output (map[string]interface{}{"All": CTX. GetText (),})},},},},}. Register ()}

FAQ

In the request queue, will the duplicate URLs be automatically reset?

URLs are deduplication by default, but can be ignored by setting request.reloadable=true.

If the page content that the URL points to is updated, does the framework have a mechanism for judging?

URL page content updates, the framework cannot directly support the judgment, but users can customize the support in the rules themselves.

Is the request successful based on the status code of the Web header?

Instead of judging the state, the server is judged to have no response stream returned. That is, the 404 page is also a success.

The re-request mechanism after the request failed?

After each URL tries to download a specified number of times, if it still fails, the request is appended to a special queue similar to the defer nature.  after the current task ends normally, it will be automatically added to the download queue and downloaded again. If there are still unsuccessful downloads, save to the failure history. The  next time the crawler rules are executed, these failed requests can be automatically added to the special queue of the defer nature by selecting the Inheritance history failure record ... (followed by repeat steps)

List of contributors

contributing People contributing Content
Henrylee2cn Software authors
Kas Phantomjs Kernel in Surfer downloader
Wang898jian Participate in the full manual writing

Third-party dependency packages

Go get github.com/pholcus/spider_libgo get Github.com/henrylee2cn/teleportgo get Github.com/puerkitobio/goquerygo get Github.com/robertkrimen/ottogo get Github.com/andybalholm/cascadiago get Github.com/lxn/walkgo get github.com/lxn/ Wingo get Github.com/go-sql-driver/mysqlgo get Github.com/jteeuwen/go-bindata/...go get github.com/elazarl/ Go-bindata-assetfs/...go get gopkg.in/mgo.v2< below need to flip the wall download >go get Golang.org/x/net/htmlgo get golang.org/x/text/ Encodinggo Get Golang.org/x/text/transform

(Thanks for the support of the above open source projects here!) )

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.