How to use OCR images to identify anti-spider strategies that bypass free house prices

Source: Internet
Author: User
Tags decrypt

A.D. 2015 28th Autumn

September afternoon, the breeze blowing the window, from the 24 floor to see the distant white clouds a blossoming cotton candy floating in the air, two street corner outside the church bell sounded the 13th.

X sits at the table, the double-decker desk is full of all kinds of cartoons, the computer next to the "New Century Gospel Warrior" is he recently from the old box to re-turned out, looked out of the window, close his eyes to take a deep breath.

He has to change the house again, every time to the season to change a place, a change of identity, surrounded by strangers will make him feel safe, so no one will find his secret.

Open the ZR rental site, listings of the search list page in front of the eyes, suitable for their own house is always self-rented. But X still has to pick the best from the inside.

Spiders are a great choice, and Goquery is a good choice.

Saber Tool Goquery

Using the DOM selector syntax, if you use Chrome, it's very easy to extract the element selector.

chrome 右键->检查->选择需要的dom元素->代码上右键->copy->copy selector

Installation

go get github.com/PuerkitoBio/goquery

How to use

Read page content generate document

res, e := http.Get(url);if e != nil {    // e}defer res.Body.Close()doc, e := goquery.NewDocumentFromReader(res.Body)if e != nil {    // e}

Use selector to select page content

doc.Find("#houseList > li").Each(func(i int, selection *goquery.Selection) {    // 房屋名称    houseName := selection.Find("div.txt > h3 > a").Text()}

Or you can use the direct selection method

// 获取经纬度houseLat, _ := doc.Find("#mapsearchText").Attr("data-lat")houseLng, _ := doc.Find("#mapsearchText").Attr("data-lng")

Anti-Spider strategy

The common page element price is similar <span>$5880</span> to this implementation, but the ZR uses a different one.
The price of TA is realized by the deviation of the background image by the CSS style sheet, for example, ¥2690 the price realization:

<span style="background-position:1000px" class="num rmb">¥</span><span style="background-position:-90px" class="num"></span><span style="background-position:-210px" class="num"></span><span style="background-position:-0px" class="num"></span><span style="background-position:-240px" class="num"></span>

The image address is

images/price/0fcc0d83409c547d3a9d038cc7808fa3s.png

The content of the picture is

6532148907

So for the above scenario, X proposes a scheme: to convert the price according to the offset

This idea is right, but after careful testing, found every visit, the address of the image will change, the corresponding image of the number of the order will also change.

At this point, the mouth of X smiled, and it was obvious

    1. Images change every time you visit
    2. The price offset data is also changed
    3. In order to ensure the price of each display is the same, then there must be a picture corresponding to the number conversion rule

Conclusion

    1. cannot be converted by direct mode
    2. Find the conversion rules for this image offset and price

So x uses Chrome's search to find the JS code for price element modification.

var ROOM_PRICE = {"image":"//xxxx.com/phoenix/pc/images/price/0fcc0d83409c547d3a9d038cc7808fa3s.png","offset":[[3,7,0,8],[3,7,0,8],[2,3,0,8],[2,3,7,8],[2,0,2,8],[3,7,2,8],[2,2,7,8],[2,2,2,8],[3,6,7,8],[3,6,7,8],[2,8,2,8],[2,8,7,8],[3,7,0,8],[2,4,7,8],[2,5,7,8],[2,8,7,8],[2,5,7,8],[2,3,7,8]]};// 这一段不用看了,其实就是将图片上的字符,按照上面ROOM_PRICE的规则,按照数组的索引取出来即可$('#houseList p.price').each(function(i){    var dom = $(this);    if(!ROOM_PRICE['offset'] || !ROOM_PRICE['offset'][i]) return ;    var pos = ROOM_PRICE['offset'][i];    for(i in pos){        var inx = pos.length -i -1;        var seg = $('<span>', {'style':'background-position:-'+(pos[inx]*offset_unit)+'px', 'class':'num'});        dom.prepend(seg);    }    var seg = $('<span>', {'style':'background-position:1000px', 'class':'num rmb'}).html('¥');    dom.prepend(seg);});

Through the above code, the entire anti-crawler strategy is exposed, in fact, the designer is also smart.

OCR Decryption of Tesseract hypercube

X wrote the following on the notebook:

    1. We have a decryption rule.
    2. With a picture to decrypt.
    3. Need to extract the image content, as a decrypted string, referred to the program processing

So I used another saber tool.Tesseract

Introduced

Tesseract is an optical character recognition engine that supports a wide range of operating systems. Tesseract is a free software based on the Apache license and has been sponsored by Google since 2006. In 2006, Tesseract was considered one of the most accurate open source optical character recognition engines.

Tesseract is a super cube.

Installation

This is Wiki:https://github.com/tesseract-...

The installation under Mac is simple

brew install tesseract

After the installation can be resolved to try

➜  go tesseract ~/Desktop/7d9a5bb074a89f93a5b4e82bea5dc872s.png stdout --psm 62436851907

Install the Go Package

go get -v -t github.com/otiai10/gosseract

How to use

Directly on the code, the call is very convenient

package ocrimport (    "fmt"    "net/http"    "os"    "github.com/otiai10/gosseract"    "io/ioutil"    "io"    "bytes")func Parse(imageUrl string)(string) {    f, _ := os.Create("s.png")    defer f.Close()    resp, _ := http.Get(imageUrl)    defer  resp.Body.Close()    pic, _ := ioutil.ReadAll(resp.Body)    io.Copy(f, bytes.NewReader(pic))    client := gosseract.NewClient()    defer client.Close()    client.SetImage("./s.png")    text, e := client.Text()    if e != nil {        fmt.Println("error")    }    return text}

Finally, the use of OCR to decrypt the image content, and then through the conversion to the real price

{"_id": ObjectId ("5b88dfa8644d03deebc6ba6a"), "name": "Tian Tong yuan Zhong Yuan 4 bedroom-South bedroom", "image": "Http://img.xxxx.com/pic/house_ Images/g2m1/m00/66/82/chafb1ugm-kazfn6aaq6qyhgutc084.jpg_c_264_198_q80.jpg "," Price ":" 2290 "," url ":" Http://www.x Xxx.com/z/vr/61676366.html "," Size ":" 13㎡ "," Floor ":" Level 5/6 "," Room ":" 4 Rooms 1 Halls "," loc ": [" 40.077562 "    , "116.432684"], "toilet": 0, "balcony": 1}/* 2 */{"_id": ObjectId ("5b88dfa9644d03deebc6ba6f"), "Name": "Tian Tong yuan Zhong Yuan 4 bedroom-South bedroom", "image": "http://img.xxxx.com/pic/house_images/g2m1/M00/4F/6E/ Chafblt9cxaay-38aaryyleu6s4611.jpg_c_264_198_q80.jpg "," Price ":" 2430 "," url ":" http://www.xxxx.com/z/vr/61663763. " HTML "," Size ":" 11.8㎡ "," Floor ":" 5/10 layers "," Room ":" 4 Rooms 1 Halls "," loc ": [" 40.077562 "," 116.4326 "]," toilet ": 0," Balcony ": 1}/* 3 */{" _id ": ObjectId (" 5b88dfa9644d03deebc6ba74 ")," name ":" Tian Tong Yuan Ben San Qu 4 Bedroom-South Bedroom "," image ":" Http://img.xxxx.com/pic/house_imAges/g2m1/m00/5a/e5/chafblubaa-apjrgaasbblkpt0w953.jpg_c_264_198_q80.jpg "," Price ":" 2790 "," url ":" Http://www.xxx X.com/z/vr/61666427.html "," Size ":" 25.5㎡ "," Floor ":" Level 6/6 "," Room ":" 4 Rooms 1 Halls "," loc ": [" 40.066064 " , "116.426734"], "toilet": 0, "Balcony": 1}

END

ZR If you want to use this way to do anti-crawler, in fact, only limited the counter-threshold, but the design is very clever. In fact, this is also the first time Tesseract-OCR this knife tool.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.