Integrated Java transliteration module and custom Java operators for Infosphere Streams
Brief introduction
The primary challenge for any solution provider in a growing market area is the use of data in dialect and linguistic inconsistencies. Because the growth market region has a variety of official languages including English, the language symbol of the region is gradually embedded in the English symbol. Therefore, you first need to perform transliteration to achieve consistency in your data before proceeding with processing/text analysis.
If you use a predetermined language, the data transliteration will provide you with a more unified and consistent result. This article describes the steps involved in performing real-time transliteration using Infosphere Streams's custom Java operators and icu4j libraries. IBM Infosphere Streams provides the ability to perform real-time analysis processes, providing a variety of toolkits and adapters that allow you to connect to and exchange data in real time and perform operations on the data. The advanced implementation architecture of real time transliteration is shown in Figure 1.
Figure 1. Real-time transliteration advanced solution diagram
Prerequisite
Business prerequisites: Should have the basic skills to design and run the Streams processing Language (SPL) application job using Infosphere Streams, and take advantage of the intermediate skills of Java programming. The source language must be encoded using the UTF-8, UTF-16 format.
Software Prerequisites: Infosphere Streams (2.0 or later), and Icu4j library.
Create a transliteration custom Java operator
Perform the following steps to create a transliteration custom Java operator.
Build the Streams Studio environment for Java operator Development as described in the Streams Information Center.
After the environment is built, the transliteration logic is written using the ICU4J library in the Java operator. The jar file for the icu4j library should be imported into the project workspace. The structure of the original Java operator in SPL is shown in Listing 1.
Listing 1. Format Java operators in Infosphere Streams
Public synchronized void Initialize (Operatorcontext context);
public void process (streaminginput<tuple> inputstream, Tuple Tuple);
public void Processpunctuation (streaminginput<tuple> inputstream,
streamingdata.punctuation marker);
public void Allportsready ();
public void shutdown ();
The logic of the operator should be in the process function. Listing 2 shows a sample code.
Listing 2. Performing a sample code of transliteration using Java operators
public string tobasecharacters (final String stext) {if (stext = null | | stext.length () = 0) Re
Turn stext;
Final char[] chars = Stext.tochararray ();
Final int isize = chars.length;
Final StringBuilder sb = new StringBuilder (isize);
for (int i = 0; i < isize i++) {String sletter = new String (new char[] {chars[i]});
Sletter = Normalizer.normalize (Sletter, NORMALIZER.NFKD);
try {byte[] Bletter = sletter.getbytes ("UTF-8");
Sb.append ((char) bletter[0]);
catch (Unsupportedencodingexception e) {}} return sb.tostring ();
Public final synchronized void process (final streaminginput input, Final Tuple Tuple) throws Exception {try {operatorcontext ctxtPeratorcontext ();
Transliterator t=transliterator.getinstance (ctxt.getparametervalues ("Sourcelanguage"). Get (0) + "-" +
Ctxt.getparametervalues ("Destlanguage"). Get (0));
streamingoutput<outputtuple> output = getoutput (0);
Outputtuple outputtuple = Output.newtuple ();
Boolean reject = false;
Read the source tuple String value = tuple.getstring ("InP");
if ((value = = null)) {throw (New Exception ("Input is null")); else {outputtuple.setstring ("Transliteratedtext", Tobasecharacters (t.transliterate) (Valu
E.tostring ()));
} output.submit (Outputtuple);
catch (Exception e) { }
.....