Background: Kafka The completion of the message bus, so that the data of each system can be aggregated in the Kafka node, the next task is to maximize the value of data, let the data "Hui" talk.
Environment Preparation:
Kafka server.
CDH 5.8.3 Server, install Flume,solr,hue,hdfs,zookeeper service.
Flume provides a scalable, real-time data transmission channel, Morphline provides lightweight ETL functionality, Solrcloud+hue provides high-performance search engines and a variety of data presentation forms.
I. Environmental installation (abbreviated)
Two. Modify the CDH default configuration:
1. Configure the interface in Flume configuration flume dependent on SOLR.
2. In the SOLR configuration Interface configuration SOLR uses the Zookeeper storage configuration file, using HDFs to store the index file.
3. Configure hue-dependent SOLR in the Hue configuration interface
4. Configure the Hue interface to be accessible by the extranet.
Three. Configure each CDH service and develop code according to the scene.
Kafka Topic:eventcount
Topic Data Format:
{ "timestamp": "1481077173000", "AccountName": "Wang Xiaobao", "tagnames":[ "Incoming" ], "account": "Wxb", "EventType": "Phone", " Eventtags ": [ { " value ": 1, " name ":" Incoming " } ]}
1.SOLR create the corresponding collection.
1) Log in to any CDH node. Generates the collection configuration file skeleton.
$ solrctl instancedir--generate $HOME/solr_configs
2) Locate the Schema.xml file in the folder and modify the schema of the collection.
First step: Modify the filed. Many of the filed are predefined in Schema.xml, except that name=id,_root_,_version_ cannot be removed, and all others can be removed. field corresponds to the fields in JSON that need to be indexed.
(The timestamp in Notice:json corresponds to the following eventtime, and the timestamp below is the time Flume accepts Kafka data.) This is the transformation implemented through the Morphline configuration)
<Fieldname= "id"type= "string"indexed= "true"stored= "true"Required= "true"multivalued= "false" /> <!--points to the root document of a block of nested documents. Required for nested document support, could be removed otherwise - <Fieldname= "_root_"type= "string"indexed= "true"stored= "false"/> <Fieldname= "Account"type= "string"indexed= "true"stored= "true"/> <Fieldname= "AccountName"type= "string"indexed= "true"stored= "true"/> <Fieldname= "Subaccount"type= "string"indexed= "true"stored= "true"/> <Fieldname= "Subaccountname"type= "string"indexed= "true"stored= "true"/> <Fieldname= "Eventtime"type= "Tlong"indexed= "false"stored= "true"/> <Fieldname= "EventType"type= "string"indexed= "true"stored= "true"/> <Fieldname= "Eventtags"type= "string"indexed= "true"stored= "true"multivalued= "true"/> <Fieldname= "_attachment_body"type= "string"indexed= "false"stored= "true"/> <Fieldname= "Timestamp"type= "Tlong"indexed= "false"stored= "true"/> <Fieldname= "_version_"type= "Long"indexed= "true"stored= "true"/>
Step Two: Remove all copy field.
Step three: Add the Dynamic field dynamicfiled.
<name= "tws_*" type= "Text_ws" indexed= "true" stored= "true" multivalued= "true"/>
3) Upload configuration, create collection
--create event_count_records-s 3-c Event_count_records
2.Flume Configuration
Create a new role group KAFKA2SOLR, modify the proxy name to KAFKA2SOLR, and assign the server to the role group.
#Configure the name of the source channel sinkKafka2solr.sources =SOURCE_FROM_KAFKAKAFKA2SOLR. Channels =MEM_CHANNELKAFKA2SOLR. Sinks =Solrsink#Configure the source category to KafkaKafka2solr.sources.source_from_kafka.type = Org.apache.flume.source.kafka.KAFKASOURCEKAFKA2SOLR. Sources.source_from_kafka.channels =MEM_CHANNELKAFKA2SOLR. sources.source_from_kafka.batchSize = 100KAFKA2SOLR. sources.source_from_kafka.kafka.bootstrap.servers= kafkanode0:9092,kafkanode1:9092,kafkanode2:9092KAFKA2SOLR. Sources.source_from_kafka.kafka.topics =EVENTCOUNTKAFKA2SOLR. sources.source_from_kafka.kafka.consumer.group.id =FLUME_SOLR_CALLERKAFKA2SOLR. Sources.source_from_kafka.kafka.consumer.auto.offset.Reset=Latest#Configuration channel type is memory, usually set to file in production environment or directly with Kafka as channelKafka2solr.channels.mem_channel.type =MEMORYKAFKA2SOLR. channels.mem_channel.keep-alive = 60#Other config values specific to each type of channel (sink or source)#can is defined as well#in this case, it specifies the capacity of the memory channelKafka2solr.channels.mem_channel.capacity = 10000KAFKA2SOLR. channels.mem_channel.transactionCapacity = 3000#configure sink to SOLR and use Morphline to transform DataKafka2solr.sinks.solrSink.type = Org.apache.flume.sink.solr.morphline.MORPHLINESOLRSINKKAFKA2SOLR. Sinks.solrSink.channel =Mem_channelKafka2solr.sinks.solrSink.morphlineFile = Morphlines.CONFKAFKA2SOLR. sinks.solrsink.morphlineid=MORPHLINE1KAFKA2SOLR. sinks.solrsink.isignoringrecoverableexceptions=true
SOLR Receiver configuration for 3.flume-ng
Solr_locator: { # Name of SOLR collection collection: event_count_records #
#CDH的专有写法, the open source version is not supported. Zkhost: "$ZK _host"}morphlines:[{ID:morphline1 importcommands: ["org.kitesdk.**", "org.apache.solr.**"] Commands: [ { #Flume The Kafka JSON data is in the form of a binary stream, which needs to read the JSONreadjson{}} { #The JSON field that is read must be converted to filed to be indexed by SOLR toextractjsonpaths {Flatten:truePaths:{ Account:/Accountaccountname:/Accountnamesubaccount:/Subaccountsubaccountname:/Subaccountnameeventtime:/Timestampeventtype:/Eventtypeeventtags: "/eventtags[]/name"#save timestamp by minuteseventtimeinminute_tdt:/timestamp#save timestamp by the houreventtimeinhour_tdt:/timestamp#save timestamp by dayeventtimeinday_tdt:/timestamp#_TDT suffix is dynamically recognized as a Date type index field#index at different time intervals to increase query performance} }}#convert long time to date format{converttimestamp {field:Eventtimeinminute_tdt inputformats: ["Unixtimeinmillis"] Inputtimezone:UTC OutputFormat: "Yyyy-mm-dd ' T ' HH:mm:ss. SSS ' Z/minute ' "Outputtimezone: asia/Shanghai}} {converttimestamp {field:Eventtimeinhour_tdt inputformats: ["Unixtimeinmillis"] Inputtimezone:UTC OutputFormat: "Yyyy-mm-dd ' T ' HH:mm:ss. SSS ' Z/hour ' "Outputtimezone: asia/Shanghai}} {converttimestamp {field:Eventtimeinday_tdt inputformats: ["Unixtimeinmillis"] Inputtimezone:UTC OutputFormat: "Yyyy-mm-dd ' T ' HH:mm:ss. SSS ' Z/day ' "Outputtimezone: asia/Shanghai}}#when the JSON data in the Kafka is passed to Flume, it is put into the _attachment_body field, and the Readjson becomes the Jsonnode object, which is required after ToString to save{toString {field:_attachment_body}}#generate a UUID for each record{generateuuid {field:ID}}#The TWS prefix is added to the Undefined solr field, and the Jianjian index of the word is dynamically undefined, based on the tws_* defined in Schema.xml as the Text_ws type. {sanitizeunknownsolrfields {#Location from which to fetch SOLR schemaSolrlocator:${solr_locator} renametoprefix: "Tws_" } } #importing data into SOLR{LOADSOLR {solrlocator:${solr_locator}}} ] }]
Restart the affected Flume node, and the data begins to be imported to SOLR.
3. Check the data in SOLR with Hue.
See Solr+hue combat.
Kafka+flume+morphline+solr+hue Data Combination index