Go crawler Software Pholcus

Last Update:2016-11-22 Source: Internet

Author: User

Tags manual writing

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

Pholcus

Pholcus (Ghost spider) is a pure Go language support distributed high-concurrency, heavy-weight crawler software, located in the Internet data collection, for a certain go or JS programming basis for people with a need to focus on the rules to customize the powerful crawler tools.

It supports single-machine, server-side, client three operating modes, with Web, GUI, command line three operation interface, simple and flexible rules, batch task concurrency, rich output mode (Mysql/mongodb/csv/excel, etc.), a large number of demo sharing In addition, it supports two kinds of vertical crawl modes, which support a series of advanced functions, such as simulated login and task pause and cancel.

Official QQ group: Go Big Data 42731170

Crawler principle

Frame Features

For users with a certain go or JS programming based on the user to pay attention to the rules of custom-made, full-featured heavy crawler tools;
Support single-machine, service-side, client three kinds of operation mode;
GUI (Windows), Web, CMD three operating interface, can be controlled by the parameters of open mode;
Support State control, such as pause, resume, stop, etc.
Can control the amount of acquisition;
Can control the number of concurrent processes;
Support multi-acquisition task concurrent execution;
Support proxy IP list, can control the frequency of replacement;
Support the collection process random stop, simulating artificial behavior;
Provides custom configuration input interfaces according to the requirements of the rule
There are MySQL, MongoDB, CSV, Excel, the original file download a total of five kinds of output mode;
Support batch output, and the quantity of each batch can be controlled;
Support static go and dynamic JS Two collection rules, support horizontal vertical two grasping mode, and have a large number of demo;
Persistent success record, easy to auto-weight;
Serialization of failed requests, support deserialization automatic overload processing;
Adopt surfer high Concurrent downloader, support Get/post/head method and HTTP/HTTPS protocol, support fixed useragent automatic saving cookie and random mass useragent disable cookie two modes, highly simulate browser behavior, Can realize the function of analog login;
Server/Client mode uses teleport high concurrency Socketapi frame, full double foreman connection communication, internal data transmission format is JSON.

Download installation

Download the third-party dependency package source code, put in the GOPATH/SRC directory [click Download ZIP]
Download update source, command line as follows

Go get-u-v github.com/henrylee2cn/pholcus

Note: Pholcus publicly maintained spider Rule Library address Https://github.com/pholcus/spider_lib

Create a project

Package Mainimport (    "Github.com/henrylee2cn/pholcus/exec"    _ "github.com/pholcus/spider_lib"// This is a library of publicly maintained spider rules    //_ "Spider_lib_pte"//Also you can freely add your own rule library) Func main () {    //Set the runtime Default action interface and start running    //Run the software before you can set The-A_UI parameter is "Web", "GUI" or "cmd", which specifies the operating interface of this run    //where "GUI" only supports Windows system    exec. Defaultrun ("Web")}

Compile run

Normal compilation method

CD {{Replace your gopath}}/src/github.com/henrylee2cn/pholcusgo install or go build

How to compile a hidden cmd window under Windows

CD {{Replace your gopath}}/src/github.com/henrylee2cn/pholcusgo install-ldflags= "-H Windowsgui" or go build-ldflags= "- H Windowsgui "

To view the optional parameters:

Pholcus-h

The web version of the operating interface is as follows:

GUI version of the operating interface mode selection interface is as follows

CMD version of the run parameter setting example below

$ Pholcus-_ui=cmd-a_mode=0-c_spider=3,8-a_outtype=csv-a_thread=20-a_dockercap=5000-a_pause=300-a_proxyminute=0- a_keyins= "<pholcus><golang>"-a_limit=10-a_success=true-a_failure=true

Run-time Catalog files

├─pholcus software │├─pholcus_pkg Runtime file directory │  ├─config.ini profile │  ││  ├─proxy.lib proxy ip list file │  ││  ├─spiders Dynamic rules directory │  │  └─xxx.pholcus.html dynamic Rules files │  ││  ├─phantomjs Program Files │  ││  ├─text_out text data file output directory │  ││  ├─file_out file Results output directory │  ││  ├─logs log directory │  ││  ├─history History directory │  │└─└─cache temporary Cache directory

Dynamic Rule Examples

Features: Dynamic loading rules, no need to recompile software, easy to write, add free, suitable for lightweight acquisition projects.
Xxx.pholcus.html

<Spider> <name>html Dynamic Rule Example </Name> <description>html dynamic Rule example [Auto Page] [HTTP://XXX.XXX.XXX]&L t;/description> <Pausetime>300</Pausetime> <EnableLimit>false</EnableLimit> <enable Cookie>true</enablecookie> <EnableKeyin>false</EnableKeyin> <notdefaultfield>false </NotDefaultField> <Namespace> <Script></Script> </Namespace> <subnamespa        ce> <Script></Script> </SubNamespace> <Root> <script param= "CTX" >        Console.log ("Root"); CTx.        Jsaddqueue ({Url: "Http://xxx.xxx.xxx", Rule: "Login Page"}); </Script> </Root> <rule name= "Login page" > <AidFunc> <script param= "Ctx,aid" &G            T </Script> </AidFunc> <ParseFunc> <script param= "CTX" > Console. Log (CTX. GetrUlename ()); CTx.                Jsaddqueue ({Url: "Http://xxx.xxx.xxx", Rule: "After Login", Method: "Post", PostData: "Username=44444444@qq.com&amp;password=44444444&amp;login_btn=login_btn&amp;submit=login_            BTN "});            </Script> </ParseFunc> </Rule> <rule name= "post-Login" > <ParseFunc> <script param= "CTX" > Console.log (CTX.            Getrulename ()); CTx. Output ({"All": CTX.            GetText ()}); CTx.                    Jsaddqueue ({Url: "Http://accounts.xxx.xxx/member", Rule: "Personal Center", Header: { "Referer": [CTX.            GETURL ()]}});            </Script> </ParseFunc> </Rule> <rule name= "Personal Center" > <ParseFunc> <script param= "CTX" > Console.log ("Personal Center:" + CTX.         Getrulename ());   CTx. Output ({"All": CTX.            GetText ()}); </Script> </ParseFunc> </Rule></Spider>

Static rule examples

Features: With the software compiled together, more customized, more efficient, suitable for heavy-weight acquisition projects.
Xxx.go

Func init () {spider{Name: "Static rule Example", Description: "Static rule example [Auto Page] [http://xxx.xxx.xxx]", pausetime:300,//Limit:limit,//Keyin:keyin, Enablecookie:true, NOTDEFAULTF Ield:false, Namespace:nil, Subnamespace:nil, Ruletree: &ruletree{root:fu NC (CTX *context) {ctx. Addqueue (&request.                     Request{url: "Http://xxx.xxx.xxx", Rule: "Login Page"}), trunk:map[string]*rule{"Login page": { Parsefunc:func (ctx *context) {ctx. Addqueue (&request.                            request{Url: "Http://xxx.xxx.xxx", Rule: "After Login", Method: "POST", PostData: "Username=123456@qq.com&password=123456&lo Gin_btn=login_btn&submit=login_btn ",})},},               "After Login": {parsefunc:func (ctx *context) {ctx. Output (map[string]interface{}{"All": CTX. GetText (),}) ctx. Addqueue (&request.                            request{Url: "Http://accounts.xxx.xxx/member", Rule: "Personal Center", Header:http. header{"Referer": []string{ctx.                    GETURL ()}},})}, "Personal center": { Parsefunc:func (ctx *context) {ctx. Output (map[string]interface{}{"All": CTX. GetText (),})},},},},}. Register ()}

FAQ

In the request queue, will the duplicate URLs be automatically reset?

URLs are deduplication by default, but can be ignored by setting request.reloadable=true.

If the page content that the URL points to is updated, does the framework have a mechanism for judging?

URL page content updates, the framework cannot directly support the judgment, but users can customize the support in the rules themselves.

Is the request successful based on the status code of the Web header?

Instead of judging the state, the server is judged to have no response stream returned. That is, the 404 page is also a success.

The re-request mechanism after the request failed?

After each URL tries to download a specified number of times, if it still fails, the request is appended to a special queue similar to the defer nature.  after the current task ends normally, it will be automatically added to the download queue and downloaded again. If there are still unsuccessful downloads, save to the failure history. The  next time the crawler rules are executed, these failed requests can be automatically added to the special queue of the defer nature by selecting the Inheritance history failure record ... (followed by repeat steps)

List of contributors

contributing People	contributing Content
Henrylee2cn	Software authors
Kas	Phantomjs Kernel in Surfer downloader
Wang898jian	Participate in the full manual writing

Third-party dependency packages

Go get github.com/pholcus/spider_libgo get Github.com/henrylee2cn/teleportgo get Github.com/puerkitobio/goquerygo get Github.com/robertkrimen/ottogo get Github.com/andybalholm/cascadiago get Github.com/lxn/walkgo get github.com/lxn/ Wingo get Github.com/go-sql-driver/mysqlgo get Github.com/jteeuwen/go-bindata/...go get github.com/elazarl/ Go-bindata-assetfs/...go get gopkg.in/mgo.v2< below need to flip the wall download >go get Golang.org/x/net/htmlgo get golang.org/x/text/ Encodinggo Get Golang.org/x/text/transform

(Thanks for the support of the above open source projects here!) ）

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More