Golang Proxy High-performance, self-with-API Hi-Stealth agent Crawler

Source: Internet
Author: User

Golang-proxyv2.0


Golang-proxy--Simple and efficient free agent crawler to maintain a high-stealth proxy pool for web crawlers, resource downloads and other uses by crawling free agents exposed on the network. Are you still writing proxy crawlers in Python? Try golang!. Available out-of-the-box version, without any programming basis to use

What's new in V2.0?

    1. no longer dependent on MySQL and NSQ!
    2. Before you need to start publisher , consumer and assessor , and now only need to start the main program !
    3. Provide a highly flexible API interface , after starting the main program, you can access the browser localhost:9999/all and localhost:9999/random directly get caught agent! You can even use localhost:9999/sql?query= to execute SQL statements to customize Proxy filter rules!
    4. Available, out-of-the- Windows Linux Mac box version !
      Download Release v2.0

Installation

1. By compiling the source code

go get github.com/storyicon/golang-proxy

Go to the golang-proxy directory, execute, execute the go build main.go generated binary execution program.

Attention:
In go build the process may appear cannot find package "github.com/gocolly/col1ly" in any of when the package is not found, according to the address of the prompt go get

# 比如如果在 go build main.go 的时候提示business\publisher.go:8:2: cannot find package "github.com/gocolly/col1ly" in any of:        F:\Go\src\github.com\gocolly\col1ly (from $GOROOT)        D:\golang\src\github.com\gocolly\col1ly (from $GOPATH)        C:\Users\Administrator\go\src\github.com\gocolly\col1ly        D:\ivank\src\github.com\gocolly\col1ly执行 go get github.com/gocolly/col1ly 即可

If you feel trouble, you can use the version provided 开箱即用 on the release page.

2. Out-of-the-box version

Release page According to the system environment provides a number of compressed packages, they can be extracted after the execution.

Out-of-the-box version download address: Download Release v2.0

3. Tips

The project root directory ./source is the project execution must folder, which stores a variety of web site sources, the other folders are stored in the project source code. So after the compilation of the binary program main files, you can main move files and source folders together anywhere, the main file can be arbitrarily named.
If you are prompted not to find a source folder, you can specify the folder path by adding parameters when executing the program -source= source , for example:

# xxx填source文件夹的相对或者绝对路径main -source=xxx

API interface

After the program is run, you can get the agent fetched in the database by accessing the following interface in the browser.

1. Randomly get an agent

地址: http://localhost:9999/random返回示例:{    //状态码0表示成功,1表示错误    "code": 0,    "message": [{        "id": 124,        "content": "http://190.2.144.133:1080",        //评估次数,次数越多,代表代理存活时间越长        "assess_times": 13,        //评估成功次数,success_times/assess_times可以得到评估成功率        "success_times": 11,        //平均响应时间,单位为秒        "avg_response_time": 2.0831538461538464,        //连续评估失败次数,是分数计算的重要指标        "continuous_failed_times": 0,        //分数,分数越高,代理质量越高        "score": 3.2747991296955083,        //插入时间戳(秒)        "insert_time": 1532324791,        //更新时间戳(秒)        "update_time": 1532414960    }]}

2. Get all available proxies

地址: http://localhost:9999/all

3. Execute SQL

地址: http://localhost:9999/sql/query=xxxx将xxxx替换为要执行的sql语句即可,程序配置了两张表:valid_proxy 存放高可用代理crawl_proxy 抓取到的代理的缓存表(代理质量不能保证)例如: http://localhost:9999/sql/query=SELECT * FROM VALID_PROXY WHERE 1 ORDER BY SCORE DESC将会将所有的可用代理按照分数倒序并返回。

Why do you use Golang-proxy

    1. Stable and fast.
      Capture module, single-core concurrency can reach 1000 pages/second .
    2. High level of configuration, high scalability.
      You don't have to write any code, you can add a new site source by filling in a configuration file in two minutes .
    3. evaluation function.
      Through the Assessor evaluation module, the periodic test agent quality, according to the agent's test success rate, high stealth, test times, mutation, response speed and other independent impact factors for comprehensive scoring, the algorithm is highly configurable, you can according to the needs of the project can be adjusted independently of the weight of the factors.
    4. Provide a highly flexible API interface , after starting the main program, you can access the browser localhost:9999/all and localhost:9999/random directly get caught agent! You can even use localhost:9999/sql?query= to execute SQL statements to customize Proxy filter rules!
    5. Do not rely on any service database, one-click Download, out-of-the-box!

How to configure a new source

./source/Under all yml format files are source , you can increase the source, you can also add a . file name before the program to ignore the source, of course, you can also delete directly, so that a source will never disappear, the following source parameter description:

#Page配置项page:    entry: "https://xxx/1.html"    template: "https://xxx/{page}.html"    from: 2    to: 10#publisher将会首先抓取entry,即 https://xxx/1.html#然后根据 template、from 和 to 依次抓取#  https://xxx/2.html#  https://xxx/3.html#  https://xxx/4.html#  ...#  https://xxx/10.html
  #Selector配置项selector: iterator: ". Table tbody TR" IP: "td:nth-child (1)" Port: "Td:nth-chil         D (2) "scheme:" Td:nth-child (3) "Filter:" "# above configuration used to crawl the following HTML structure # <table class=" table "># <tbody># <tr># <td>187.3.0.1</td># <td>8080</td># <td> ; http</td># <tr># <tr># <td>164.23.1.2</td># <TD&G t;80</td># <td>https</td># <tr># <tr># <td>131 .9.2.3</td># <td>8080</td># <td>http</td># <tr># & The lt;tbody># <table># selector is a universal jquery selector, iterator is a loop object, such as a row in a table, an agent per line, and the selector for this row is iterator, and the IP, port, Protocal is a sub-element lookup based on the iterator selector. # protocal is empty, or protocal corresponding element cannot be found, the default is HTTP type  
category:    # 并行数    parallelnumber: 1    # 对于这个源,每抓取一个页面    # 将会随机等待5~20s再抓下一个页面    delayRange: [5, 20]    # 间隔多长时间启用一次这个源    # @every 10s , @every 10h...    interval: "@every 10m"debug: true

Request for Comments

    1. Any questions you may have in use issues
    2. If you find a new useful source, please submit to share
    3. Come a little Star and then go:)
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.