This article refers to: http://tonl.iteye.com/blog/1918245
Python version: 2.7 64bit window version;
Download python:http://www.python.org/getit/
Python 2.7.5 Windows x86-64 Installer (Windows Amd64/intel 64/x86-64 binary [1]--does not include source) for installation:
First write the following spider.py script:
#-*-Coding:utf-8-*-#import urllib2 from urllib import urlopen import OS import sys class Spider:
"" "Download Web site from the given file" "Def __init__ (Self,filename,downloadpath):" "
init the filename, if the filename is not raise a error ' "' If not os.path.isfile (filename):
print ' The given file does not exist,the program'll exit ' Sys.exit (0) Else: Self.fname=filename if not Os.path.isdir (downloadpath): print ' Given download path does not
exist, the programe'll exit ' Else:self.dpath=downloadpath def download (self): "" Download the Web site from the given file by line "" Fp=open (Self.fname, ' R ') while T
Rue:line=fp.readline () If not line:break if ' HTML ' in line: Tempname=filter (str.isalnum,line). Replace (' HTML ', '. html ') Else:tempname=filter (str.isalnum,line) + '. html ' self.download_html (line,self.dpath+ ' \ +tempname) fp.close () def download_html
(self,website,filename): "" "Download the HTML by the given Web site and save to name" "
Response=urlopen (website) data=response.read () fp=file (filename, ' A + ') fp.write (data) Fp.close () def test (): "" "" Test Program "" Filename=sys.argv[1] Downloadpath=s YS.ARGV[2] Spider=spider (Filename,downloadpath) spider.download () If __name__ = ' __main__ ': Test ()
The above script, to enter two parameters, one is to download the page address file, format generally as follows (Websites.txt):
See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/Java/
http://blog.csdn.net/fansy1990
http://www.baidu.com
Another parameter is the location where the downloaded Web page is stored.
You can then run at the command line:
Python d:\\spider.py d:\\websites.txt d:\\download_tmp
Then go to the download_tmp under D disk to find the downloaded file, if found, then the configuration is correct;
Finally write the following Java program, you need to import Jython-*.jar package (LZ download is 2.2):
Package test;
Import java.io.IOException;
public class Pytest {
/**
* @param args
* @throws ioexception *
@throws interruptedexception
* * Public
static void Main (string[] args) throws IOException, interruptedexception {
String py_path= "d:\\ spider.py ";
String websites= "D:\\websites.txt";
String outdir= "d:\\tmp";
Process pr=runtime.getruntime (). EXEC ("python" +py_path+ "" +websites+ "" +outdir);
Pr.waitfor ();
System.out.println ("Done ...");
}
To run the above command, you need to set the environment attribute in Eclipse, add a path variable, and the value is the Python installation directory;
After running, you will be prompted:
*sys-package-mgr*: Can ' t create package cache dir, *jython-2.2.jar\cachedir\packages '
This can be used without control and will not affect the program running.