A.D. 2015 28th Autumn
September afternoon, the breeze blowing the window, from the 24 floor to see the distant white clouds a blossoming cotton candy floating in the air, two street corner outside the church bell sounded the 13th.
X sits at the table, the double-decker desk is full of all kinds of cartoons, the computer next to the "New Century Gospel Warrior" is he recently from the old box to re-turned out, looked out of the window, close his eyes to take a deep breath.
He has to change the house again, every time to the season to change a place, a change of identity, surrounded by strangers will make him feel safe, so no one will find his secret.
Open the ZR rental site, listings of the search list page in front of the eyes, suitable for their own house is always self-rented. But X still has to pick the best from the inside.
Spiders are a great choice, and Goquery is a good choice.
Saber Tool Goquery
Using the DOM selector syntax, if you use Chrome, it's very easy to extract the element selector.
chrome 右键->检查->选择需要的dom元素->代码上右键->copy->copy selector
Installation
go get github.com/PuerkitoBio/goquery
How to use
Read page content generate document
res, e := http.Get(url);if e != nil { // e}defer res.Body.Close()doc, e := goquery.NewDocumentFromReader(res.Body)if e != nil { // e}
Use selector to select page content
doc.Find("#houseList > li").Each(func(i int, selection *goquery.Selection) { // 房屋名称 houseName := selection.Find("div.txt > h3 > a").Text()}
Or you can use the direct selection method
// 获取经纬度houseLat, _ := doc.Find("#mapsearchText").Attr("data-lat")houseLng, _ := doc.Find("#mapsearchText").Attr("data-lng")
Anti-Spider strategy
The common page element price is similar <span>$5880</span>
to this implementation, but the ZR uses a different one.
The price of TA is realized by the deviation of the background image by the CSS style sheet, for example, ¥2690
the price realization:
<span style="background-position:1000px" class="num rmb">¥</span><span style="background-position:-90px" class="num"></span><span style="background-position:-210px" class="num"></span><span style="background-position:-0px" class="num"></span><span style="background-position:-240px" class="num"></span>
The image address is
images/price/0fcc0d83409c547d3a9d038cc7808fa3s.png
The content of the picture is
6532148907
So for the above scenario, X proposes a scheme: to convert the price according to the offset
This idea is right, but after careful testing, found every visit, the address of the image will change, the corresponding image of the number of the order will also change.
At this point, the mouth of X smiled, and it was obvious
- Images change every time you visit
- The price offset data is also changed
- In order to ensure the price of each display is the same, then there must be a picture corresponding to the number conversion rule
Conclusion
- cannot be converted by direct mode
- Find the conversion rules for this image offset and price
So x uses Chrome's search to find the JS code for price element modification.
var ROOM_PRICE = {"image":"//xxxx.com/phoenix/pc/images/price/0fcc0d83409c547d3a9d038cc7808fa3s.png","offset":[[3,7,0,8],[3,7,0,8],[2,3,0,8],[2,3,7,8],[2,0,2,8],[3,7,2,8],[2,2,7,8],[2,2,2,8],[3,6,7,8],[3,6,7,8],[2,8,2,8],[2,8,7,8],[3,7,0,8],[2,4,7,8],[2,5,7,8],[2,8,7,8],[2,5,7,8],[2,3,7,8]]};// 这一段不用看了,其实就是将图片上的字符,按照上面ROOM_PRICE的规则,按照数组的索引取出来即可$('#houseList p.price').each(function(i){ var dom = $(this); if(!ROOM_PRICE['offset'] || !ROOM_PRICE['offset'][i]) return ; var pos = ROOM_PRICE['offset'][i]; for(i in pos){ var inx = pos.length -i -1; var seg = $('<span>', {'style':'background-position:-'+(pos[inx]*offset_unit)+'px', 'class':'num'}); dom.prepend(seg); } var seg = $('<span>', {'style':'background-position:1000px', 'class':'num rmb'}).html('¥'); dom.prepend(seg);});
Through the above code, the entire anti-crawler strategy is exposed, in fact, the designer is also smart.
OCR Decryption of Tesseract hypercube
X wrote the following on the notebook:
- We have a decryption rule.
- With a picture to decrypt.
- Need to extract the image content, as a decrypted string, referred to the program processing
So I used another saber tool.Tesseract
Introduced
Tesseract is an optical character recognition engine that supports a wide range of operating systems. Tesseract is a free software based on the Apache license and has been sponsored by Google since 2006. In 2006, Tesseract was considered one of the most accurate open source optical character recognition engines.
Tesseract is a super cube.
Installation
This is Wiki:https://github.com/tesseract-...
The installation under Mac is simple
brew install tesseract
After the installation can be resolved to try
➜ go tesseract ~/Desktop/7d9a5bb074a89f93a5b4e82bea5dc872s.png stdout --psm 62436851907
Install the Go Package
go get -v -t github.com/otiai10/gosseract
How to use
Directly on the code, the call is very convenient
package ocrimport ( "fmt" "net/http" "os" "github.com/otiai10/gosseract" "io/ioutil" "io" "bytes")func Parse(imageUrl string)(string) { f, _ := os.Create("s.png") defer f.Close() resp, _ := http.Get(imageUrl) defer resp.Body.Close() pic, _ := ioutil.ReadAll(resp.Body) io.Copy(f, bytes.NewReader(pic)) client := gosseract.NewClient() defer client.Close() client.SetImage("./s.png") text, e := client.Text() if e != nil { fmt.Println("error") } return text}
Finally, the use of OCR to decrypt the image content, and then through the conversion to the real price
{"_id": ObjectId ("5b88dfa8644d03deebc6ba6a"), "name": "Tian Tong yuan Zhong Yuan 4 bedroom-South bedroom", "image": "Http://img.xxxx.com/pic/house_ Images/g2m1/m00/66/82/chafb1ugm-kazfn6aaq6qyhgutc084.jpg_c_264_198_q80.jpg "," Price ":" 2290 "," url ":" Http://www.x Xxx.com/z/vr/61676366.html "," Size ":" 13㎡ "," Floor ":" Level 5/6 "," Room ":" 4 Rooms 1 Halls "," loc ": [" 40.077562 " , "116.432684"], "toilet": 0, "balcony": 1}/* 2 */{"_id": ObjectId ("5b88dfa9644d03deebc6ba6f"), "Name": "Tian Tong yuan Zhong Yuan 4 bedroom-South bedroom", "image": "http://img.xxxx.com/pic/house_images/g2m1/M00/4F/6E/ Chafblt9cxaay-38aaryyleu6s4611.jpg_c_264_198_q80.jpg "," Price ":" 2430 "," url ":" http://www.xxxx.com/z/vr/61663763. " HTML "," Size ":" 11.8㎡ "," Floor ":" 5/10 layers "," Room ":" 4 Rooms 1 Halls "," loc ": [" 40.077562 "," 116.4326 "]," toilet ": 0," Balcony ": 1}/* 3 */{" _id ": ObjectId (" 5b88dfa9644d03deebc6ba74 ")," name ":" Tian Tong Yuan Ben San Qu 4 Bedroom-South Bedroom "," image ":" Http://img.xxxx.com/pic/house_imAges/g2m1/m00/5a/e5/chafblubaa-apjrgaasbblkpt0w953.jpg_c_264_198_q80.jpg "," Price ":" 2790 "," url ":" Http://www.xxx X.com/z/vr/61666427.html "," Size ":" 25.5㎡ "," Floor ":" Level 6/6 "," Room ":" 4 Rooms 1 Halls "," loc ": [" 40.066064 " , "116.426734"], "toilet": 0, "Balcony": 1}
END
ZR If you want to use this way to do anti-crawler, in fact, only limited the counter-threshold, but the design is very clever. In fact, this is also the first time Tesseract-OCR
this knife tool.