Objective:
Implement command line crawl Web snapshots on a Debian server with no X-server installed
Software:
- XVFB (simulates the x-server at the command line, caches the rendered graphics)-provides image rendering in an environment where x-server is not installed
- CUTYCAPT (Simulation browser to download Web pages, HTML, CSS rendering, JavaScript execution, and the final rendering completed pages to take a snapshot)-main Gan
- Qt (CUTYCAPT is developed based on this framework)
Practice:
1. Install CUTYCAPT, QT and related packages:
Help
| 12345 |
sudoapt-get install subversion libqt4-webkit libqt4-dev g++svn co https://cutycapt.svn.sourceforge.net/svnroot/cutycaptcdcutycapt/CutyCaptqmakemake |
2. Install XVFB:
Help
3. Crawl test:
Help
| 1 |
xvfb-run --server-args="-screen 0, 1024x768x24"./CutyCapt--url=http://www.zol.com.cn --out=zol.png |
Found caught in the Chinese page garbled:
4. Toss a half-day, the original is not installed Chinese fonts, install Chinese fonts, and then catch ~ ~
Summarize:
The basic implementation of the Linux command line to achieve the Web page snapshot crawl function, but cutycapt to JavaScript parsing ability is still limited, from can be seen through swfobject loaded flash is not rendered. Later will try to do the rendering crawl directly with Firefox.
Reference Links:
Http://cutycapt.sourceforge.net/http://www.x.org/archive/X11R6.8.2/doc/Xvfb.1.html http://www.yeeach.com/tag/ Screenshot/http://hi.baidu.com/pkubuntu/blog/item/7dcc064ff0246a3eaec3abe2.html http://qt.nokia.com/http:// En.wikipedia.org/wiki/xvfb
Install Chinese fonts: http://hi.baidu.com/spiritualcity/blog/item/96369c2afa8740fde6cd40d2.html linux Chinese internal Code Control scheme:/HTTP Zhcon.sourceforge.net/index_cn.html