Cloudera search1.0.0 Environment Building (2): Near real-time (NRT) search using Flume-ng's Morphlinesolrsink

Source: Internet
Author: User
Tags json solr

In order to achieve near real-time search, there must be a mechanism to process the data in real time and then generate to the SOLR index, flume-ng just provide a mechanism, it can collect data in real time, and then through Morphlinesolrsink to the data ETL, It is finally written to the SOLR index so that it can query the new incoming data in near real time in the SOLR search engine.

Build steps:

1 We only do a demo here, so we've created a new file file01 to save two data, which will be submitted to the 44444-f agent via Flume-ng avro-client-h localhost-p file01 Flume.

These two data are as follows:

{"id": "1234567890", "User_friends_count": 111, "user_location": "Palo Alto", "user_description": "Desc1", "User_ Statuses_count ": 11111," User_followers_count ": 111," user_name ":" Name1 "," User_screen_name ":" Fake_user1 "," created_ At ":" 1985-09-04t18:01:01z "," Text ":" Sample Tweet One "," Retweet_count ": 0," retweeted ": false," in_reply_to_user_id ": 0 , "source": "Href=\" http:\/\/sample.com\ "", "in_reply_to_status_id": 0, "Media_url_https": null, "Expanded_url": null}
{"id": "2345678901", "User_friends_count": 222, "user_location": "San Francisco", "user_description": "DESC2", " User_statuses_count ": 222222," User_followers_count ": 222," user_name ":" Name2 "," User_screen_name ":" Fake_user2 "," Created_at ":" 1985-09-04t19:14:34z "," Text ":" Sample Tweet "," Retweet_count ": 0," retweeted ": false," in_reply_to_ USER_ID ": 0," source ":" Href=\ "http:\/\/sample.com\" "," in_reply_to_status_id ": 0," Media_url_https ": null," Expanded_ URL ": null}

Is two JSON data, and we will use Morphlines to perform ETL extraction of JSON data for several specified fields.

2 Configure the Flume-ng SOLR receiver in the flume configuration in cm, as shown in the following figure:

The Morphlines configuration file is as follows:

# Specify server locations in a solr_locator variable; Used later in # variable substitutions:solr_locator: {# Name of SOLR Collection Collection:collection1 # Zo Okeeper ensemble zkhost: "MASTER68:2181,SLAVE69:2181,SLAVE76:2181/SOLR"} # Specify an array of one or more Morphlin ES, each of the which defines an ETL # transformation chain. A Morphline consists of one or more potentially # nested commands.  A morphline is a-consume records such as Flume events, # HDFS files or blocks, turn them into a stream of records,
and pipe The stream # of records through a set of easily configurable transformations on its-to # SOLR. Morphlines: [{# Name used to identify a morphline.
    For example, used if there is multiple # morphlines in a morphline config file.
    ID:MORPHLINE1 # Import All Morphline commands in these Java packages and their subpackages. # Other commands, present on the classpath is not visible to this #Morphline. Importcommands: ["org.kitesdk.**", "org.apache.solr.**"] commands: [{READJSO            
            n {}} {extractjsonpaths {Flatten:false paths: {ID:/id      
          User_name:/user_screen_name created_at:/created_at text:/text }}} # consume the output record of the previous command and pipe another # record
      Downstream. # # Convert timestamp field to native SOLR timestamp format # such as 2012-09-06t07:14:34z to 2012-09-06t07:14 : 34.000Z {converttimestamp {field:created_at inputformats: ["Yyyy-mm-dd ' T ' HH:mm:ss ' Z ' "," YYYY-MM-DD "] inputtimezone:america/los_angeles outputformat:" Yyyy-mm-dd ' T ' HH:mm:ss. SSS ' Z ' "OUTPUTTIMEZONE:UTC}} # consume the OUTPUT record of the previous command and pipe another # record downstream.
      # # This command deletes the record fields, that is unknown to SOLR # Schema.xml. # # Recall that SOLR throws an exception on any attempt to load a document # that contains a field that's is no
      T specified in Schema.xml. {sanitizeunknownsolrfields {# location from which to fetch SOLR schema solrlocator: ${solr_ LOCATOR}} # Log the record at DEBUG level to slf4j {logdebug {format: "OUTP 
        UT record: {} ", args: [" @{} "]}} # load the record into a SOLR server or MapReduce Reducer { LOADSOLR {solrlocator: ${solr_locator}}]}]

Simply explain this morphlines configuration file, you first execute a readjson command that converts the contents of the read event into a JSON object. Then use the Extractjsonpaths command to extract the specific field value of the JSON object and reassign it to another field (for example, user_name:/user_screen_name is reading the value of user_screen_name and assigning it to User_ Name), then use Converttimestamp to format the Create_at field, and finally execute the sanitizeunknownsolrfields command to discard the field field that is not configured in SOLR's schema. That is, after ETL, the record retains only the configured fields in SOLR. The final record is then submitted to SOLR through the LOADSOLR directive.

3 Next is the configuration of the Flume agent:

Tier1.sources=source1
tier1.channels=channel1
tier1.sinks=sink1

tier1.sources.source1.type = Avro
tier1.sources.source1.bind = 0.0.0.0
tier1.sources.source1.port = 44444
Tier1.sources.source1.channels=channel1

tier1.channels.channel1.type=memory
tier1.channels.channel1.capacity=10000

Tier1.sinks.sink1.type = Org.apache.flume.sink.solr.morphline.MorphlineSolrSink
Tier1.sinks.sink1.channel = Channel1
Tier1.sinks.sink1.morphlineFile = morphlines.conf
tier1.sinks.sink1.morphlineId = morphline1

Here a note is that we configured in CM Flume-ng SOLR receiver, so morphlinefile directly write morphlines.conf on the line, otherwise you need to write absolute path, otherwise can not find Morphlines configuration file.

4 when the top three is ready, start the agent, then execute the flume-ng avro-client-h localhost-p 44444-f file01 command in the shell console and submit the data file we created in the first step to the agent.

5 after the execution, if there is no error, you can go to SOLR through the http://slave77:8983/solr/collection1/select?q=*:* to inquire about whether these two data has been created in the Search Engine index Library


If you see the results as shown below, congratulations, you have successfully completed the construction of the NRT architecture for this article.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.