Java calls python to download web pages

Source: Internet
Author: User

First, write the following spider. py Script: [python] #-*-coding: UTF-8-*-# import urllib2 from urllib import urlopen import OS import sys class Spider: "download web site from the given file" def _ init _ (self, filename, downloadPath): "init the filename, if the filename is not raise a error "if not OS. path. isfile (filename): print 'the given file does not exist, the program will exit 'sys. exit (0) else: self. fn Ame = filename if not OS. path. isdir (downloadPath): print 'the given download path does not exist, the programe will exit 'else: self. dpath = downloadPath def download (self): "download the web site from the given file by line" fp = open (self. fname, 'R') while True: line = fp. readline () if not line: break if 'html' in line: tempname1_filter(str.isalnum,line=.replace('html', '.html ') else: tempname = filter (Str.isalnum,line+'.html 'self. download_html (line, self. dpath + '\' + tempname) fp. close () def download_html (self, website, filename): "download the html by the given web site and save to name" response = urlopen (website) data = response. read () fp = file (filename, 'a + ') fp. write (data) fp. close () def test (): "test program" filename = sys. argv [1] downloadPath = sys. argv [2] spider = Spider (filename, downloa DPath) spider. download () if _ name _ = '_ main _': the script of the test () object. You need to enter two tokens. One is the address file of the webpage to be downloaded. The format is generally (websites.txt ): [plain] Another parameter in the http://blog.csdn.net/fansy1990 http://www.baidu.com is the location where the Downloaded web page is stored. Then run [plain] python D :\\ spider. py D: \ websites.txt D :\\ download_tmp and search for the downloaded file under download_tmp on drive D. If yes, the configuration is correct. Finally, write the following java program, import jython -*. jar package (lz downloaded 2.2): [java] package test; import java. io. IOException; public class PyTest {/*** @ param args * @ throws IOException * @ throws InterruptedException */public static void main (String [] args) throws IOException, InterruptedException {String py_pat H = "D: \ spider. py "; String websites =" D :\\ websites.txt "; String outDir =" D: \ tmp "; // Process pr=runtime.getruntime(cmd.exe c ("python" + py_path + "" + websites + "" + outDir); pr. waitFor (); System. out. println ("done... ") ;}} to run the preceding command, you need to set the Environment attribute in eclipse and add a PATH variable whose value is the python installation directory. After running the command, a prompt is displayed: [plain] * sys-package-mgr *: can't create package cache dir, * jython-2.2.jar \ cachedir \ packages 'This can be left empty without affecting program running.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.