Real-time transliteration using the custom Java operators and icu4j of infosphere streams

Source: Internet
Author: User
Tags final spl normalizer tostring

Integrated Java transliteration module and custom Java operators for Infosphere Streams

Brief introduction

The primary challenge for any solution provider in a growing market area is the use of data in dialect and linguistic inconsistencies. Because the growth market region has a variety of official languages including English, the language symbol of the region is gradually embedded in the English symbol. Therefore, you first need to perform transliteration to achieve consistency in your data before proceeding with processing/text analysis.

If you use a predetermined language, the data transliteration will provide you with a more unified and consistent result. This article describes the steps involved in performing real-time transliteration using Infosphere Streams's custom Java operators and icu4j libraries. IBM Infosphere Streams provides the ability to perform real-time analysis processes, providing a variety of toolkits and adapters that allow you to connect to and exchange data in real time and perform operations on the data. The advanced implementation architecture of real time transliteration is shown in Figure 1.

Figure 1. Real-time transliteration advanced solution diagram

Prerequisite

Business prerequisites: Should have the basic skills to design and run the Streams processing Language (SPL) application job using Infosphere Streams, and take advantage of the intermediate skills of Java programming. The source language must be encoded using the UTF-8, UTF-16 format.

Software Prerequisites: Infosphere Streams (2.0 or later), and Icu4j library.

Create a transliteration custom Java operator

Perform the following steps to create a transliteration custom Java operator.

Build the Streams Studio environment for Java operator Development as described in the Streams Information Center.

After the environment is built, the transliteration logic is written using the ICU4J library in the Java operator. The jar file for the icu4j library should be imported into the project workspace. The structure of the original Java operator in SPL is shown in Listing 1.

Listing 1. Format Java operators in Infosphere Streams

Public synchronized void Initialize (Operatorcontext context);
                            
public void process (streaminginput<tuple> inputstream, Tuple Tuple);
                            
public void Processpunctuation (streaminginput<tuple> inputstream,
            streamingdata.punctuation marker);
                            
public void Allportsready ();
                            
public void shutdown ();

The logic of the operator should be in the process function. Listing 2 shows a sample code.
Listing 2. Performing a sample code of transliteration using Java operators

public string tobasecharacters (final String stext) {if (stext = null | | stext.length () = 0) Re
                            
        Turn stext;
        Final char[] chars = Stext.tochararray ();
        Final int isize = chars.length;
        Final StringBuilder sb = new StringBuilder (isize);
                for (int i = 0; i < isize i++) {String sletter = new String (new char[] {chars[i]});
                            
                Sletter = Normalizer.normalize (Sletter, NORMALIZER.NFKD);
                try {byte[] Bletter = sletter.getbytes ("UTF-8");
                Sb.append ((char) bletter[0]);
catch (Unsupportedencodingexception e) {}} return sb.tostring ();
                        Public final synchronized void process (final streaminginput input, Final Tuple Tuple) throws Exception {try {operatorcontext ctxtPeratorcontext ();
                Transliterator t=transliterator.getinstance (ctxt.getparametervalues ("Sourcelanguage"). Get (0) + "-" +               
                Ctxt.getparametervalues ("Destlanguage"). Get (0));
                streamingoutput<outputtuple> output = getoutput (0);
                Outputtuple outputtuple = Output.newtuple ();
                Boolean reject = false;
                            
                Read the source tuple String value = tuple.getstring ("InP");
                if ((value = = null)) {throw (New Exception ("Input is null")); else {outputtuple.setstring ("Transliteratedtext", Tobasecharacters (t.transliterate) (Valu
                E.tostring ()));                        
            } output.submit (Outputtuple);
                            
catch (Exception e) {            }
            ..... 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.