How to invoke Python code from IBM infosphere Streams

Source: Internet
Author: User
Tags spl regular expression socket wrapper python script in python advantage


Overview



IBM Infosphere Streams is a high-performance real-time event processing middleware. Its unique advantage is its ability to obtain structured and unstructured data from a variety of data sources to perform real-time analysis. It completes this task by combining an Easy-to-use application development language called SPL (Streams processing Language, streaming language) with a distributed runtime platform. The middleware also provides a flexible application development framework that integrates code written using C + + and Java into Streams applications. In addition to C + + and Java, many developers who build real IT assets also use dynamic programming languages. With its advantages in system integration capabilities, Python is a viable option for many companies to quickly build solutions. For existing assets that are written in Python, you can integrate Python code into the Streams application in one way. This article describes the details of implementing this goal through a simple Streams application example.



This article assumes that you are familiar with Infosphere Streams and its SPL programming model. You need to understand programming techniques and have practical knowledge of C + + and Python.



Infosphere Streams is an important component of IBM's large data platform strategy. Many of IBM's current and potential customers with Python assets and skills can take advantage of it and use it in conjunction with Infosphere Streams. This article is intended for readers who have technical concerns for large data applications, including application designers, developers, and architects.



Sample Scenario



To explain the real details of the technology involved in invoking Python from a Streams application, we will stick to a simple example. This scenario involves reading the names of some Web addresses from an input CSV file, invoking a simple Python function written by the user, which returns the following details as the result. We then write the results of each WEB address into a separate output CSV file:



The primary host name of the URL



Alternate host name for URL



List of IP addresses for URLs



The company name specified in the URL string



Prerequisite conditions



The following code snippet is used to explain the implementation details of the above scenario. You can also download this sample code and run it on your own IBM infosphere Streams installation. The sample code has been tested in the following environments:



RedHat Enterprise Linux 6.1 or later (or equivalent CentOS version)



GCC version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC)



Python 2.6.6 (r266:84292, APR 11 2011, 15:50:32, released with RHEL6)



/usr/lib/libpython2.6.so



/usr/include/python2.6 directory containing Python.h and other included files



IBM Infosphere Streams 3.x configured with a valid Streams instance



The same techniques apply to slightly different environments (such as RHEL 5.8 and Streams 2.0.0.4) as long as the code or environment settings are slightly adjusted.



Advanced Application Components



In our simple example scenario, there are 3 major components. Because of the natural independence of the programming language used in each component, each component is self-contained and can be placed in its own project:



urltoipaddress Python Script



Streamstopythonlib C + + scripts



Streams-to-python SPL Script



Urltoipaddress is a Python script that contains simple logic to get the IP address and hostname information of a given WEB address using the Python API. This script can use the Python interpreter to test independently. This short script plays an important role in how to invoke a function in a Python script from a Streams application.



Streamstopythonlib is a C + + project. Inside it, contains the source code of the SPL native function. In general, the source code here uses the PYTHON/C API to embed in Python code during the execution of C + + code. The Python documentation details how to embed Python in C + + code. This project contains a wrapper containing (. h) file, which is very important and provides an entry point for an Streams SPL application to invoke any C + + class method. All C + + logic in this project is compiled into a shared object library (. So) file for use by SPL applications.



Streams-to-python is a streams SPL project. Within it, we provide a basic SPL flow diagram to create a call chain (Spl<-->c++<-->python). This SPL code reads the URL from an input file in the data directory, invokes the C + + native function to execute the Python code, receives the result, and writes the result to an output file in the data directory. In the SPL project directory, a native function model XML file lists the meta information required to invoke a C + + class method directly from SPL. This detail includes the C + + wrapper containing the filename, the C + + namespace that contains the wrapper function, the C + + wrapper function expressed using SPL syntax/type, the name of the shared object library created from the C + + project, the location of the shared object library, the location where the wrapper contains the file, and so



In the following sections, we will delve into the 3 application components, detailing Python, C + +, and SPL code.



Python logic



Listing 1 shows the Python code. This is the business logic we want to invoke from Streams.



Listing 1. urltoipaddress.py


import re, sys, socket
    
def getCompanyNameFromUrl(url):
    # Do a regex match to get just the company/business part in the URL.
    # Example: In "www.ibm.com", it will return "ibm".
    escapedUrl = re.escape(url)
    m = re.match(r'www\.(.*)\..{3}', url)
    x = m.group(1)
    return (x)
    
def getIpAddressFromUrl(url):
    # The following python API will return a triple
    # (hostname, aliaslist, ipaddrlist)
    # hostname is the primary host name for the given URL
    # aliaslist is a (possibly empty) list of alternative host names for the same URL
    # ipaddrlist is a list of IPv4 addresses for the same interface on the same host
    #
    # aliaslist and ipaddrlist may have multiple values separated by
    # comma. We will remove such comma characters in those two lists.
    # Then, return back to the caller with the three comma separated
    # fields inside a string. This can be done using the Python 
    # list comprehension.
    return(",".join([str(i).replace(",", "") for i in socket.gethostbyname_ex(url)])) 
    
if ((__name__ == "__main__") and (len(sys.argv) >= 2)):
    url = sys.argv[1]
    # print("url=%s" % (url, ))
    print "IP address of %s=%s" % (url, getIpAddressFromUrl(url))
    print "Company name in the URL=%s" % repr(getCompanyNameFromUrl(url))
elif ((__name__ == "__main__") and (len(sys.argv) < 2)):
    sys.exit("Usage: python UrlToIpAddress.py www.watson.ibm.com")


As you can see from Listing 1, we deliberately make Python code as simple as possible to keep the conditioning clear. There are two Python functions in this code, followed by a snippet that will run when the Python script is executed using the Python interpreter. To verify that the code is running as expected, you can run this script from a shell window: Python urltoipaddress.py www.watson.ibm.com.



At the top of the file, some Python modules, such as regular expressions and sockets, are imported. The first function is Getcompanynamefromurl, which accepts a WEB address as input. It performs a regular expression match to resolve the company name from the Web address and returns the company name to the caller. The next function is Getipaddressfromurl. It also accepts a WEB address as input. It invokes a Python socket API to obtain the IP address of a given WEB address. Specifically, this Python API (gethostbyname) returns a tuple that contains 3 elements. These 3 elements provide the host name, alternate host name (if any), and one or more IP addresses of the server for a given WEB address. Instead of returning the tuple type to the caller, this function flattened the 3 elements in the tuple into a Python string by inserting a comma after each element, and then returning the result to the caller as a string.



See more highlights of this column: http://www.bianceng.cnhttp://www.bianceng.cn/Programming/extra/





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.