The construction process of the statistical analysis system, which is completely independently completed, is mainly used in the Php+hadoop+hive+thrift+mysql realization
Installation
Hadoop Installation: http://www.powerxing.com/install-hadoop/
Hadoop cluster configuration: http://www.powerxing.com/install-hadoop-cluster/
Hive Installation: https://chu888chu888.gitbooks.io/hadoopstudy/content/Content/8/chapter0807.html
Installation of the specific tutorial please see the above link, the local test only use a single-machine configuration, cluster configuration (flume used later) to see the above detailed links, because there was no contact with the Java related, here is a few problems encountered.
- 1.x and 2.x versions of Hadoop and hive correspond to
- Java/hadoop related environment variable configuration, accustomed to PHP children's shoes in this piece may be easy to ignore
- Start Hadoop prompt
Starting namenodes on []
, namenodes is empty because no IP or port is specified, modify the hadoop/core-site.xml
following
<configuration><property><name>dfs.namenode.rpc-address</name><value>127.0.0.0:9001</value></property></configuration>
- After the installation is completed
jps
, the input can be viewed to namenode,datanode etc.
Escalation and receipt
- Swoole and Workerman have a simple version of the implementation of data monitoring, including escalation, reception, storage, display, mainly using UDP upload (Swoole version has been upgraded to TCP long connection), Redis cache, file persistence, highcharts display, can be used as a thought reference
Swoole-statistics:https://github.com/smalleyes/statistics
Workerman-statistics:https://github.com/walkor/workerman-statistics
- This example uses the interface provided by the Swoole to implement UDP transmission, because the escalation data is a certain degree of fault tolerance, so the choice of UDP efficiency first
- Receive data is temporarily stored in Redis, stored in files every few minutes, file names are segmented by module and time, field | Split (back to Hive)
Data dump create hive data table
- The Hive data table is written according to the file data format,
TERMINATED BY
and the field with the previous file field delimiter wants to correspond
- Partitioning the table by date
PARTITIONED BY
CREATETABLE Login (Timeint comment ' Landing time ', type string comment Email,username,qq ", Device string comment ' landing device, pc,android, iOS ', IP string comment ' login IP ', UID int comment ' user ID ', is_old int Comment ' whether old user ') partitioned by ( Date ' string COMMENT ' date ') ROW Span class= "Hljs-keyword" >format delimited fields TERMINATED by ' | ';
- Timing (Crontab) Creating Hadoop partitions
hive -e "use web_stat; alter table login add if not exists partition (date=‘${web_stat_day}‘)"
Switching
- The flume listens to the file directory, transfers data to a server that can access the HDFs cluster, and transmits it to 7000 ports on the 224 machine.
#agent3表示代理名称 Loginagent3.sources=source1agent3.sinks=sink1agent3.channels=channel1#配置source1agent3. sources.source1.type=spooldiragent3.sources.source1.spooldir=/data/releases/stat/ data/10001/agent3.sources.source1.channels=channel1agent3.sources.source1.fileheader = false# Configure Sink1agent3.sinks.sink1.type=avroagent3.sinks.sink1.hostname=192.168. 23.224agent3.sinks.sink1.port=7000agent3.sinks.sink1.channel=channel1# Configuring Channel1agent3.channels.channel1.type=fileagent3.channels.channel1.checkpointdir=/data/flume_data/ checkpoint_loginagent3.channels.channel1.datadirs=/data/flume_data/channeldata_login
# 加到supervisor守护进程/home/flume/bin/flume-ng agent -n agent3 -c /home/flume/conf/ -f /home/flume/conf/statistics/login_flume.conf -Dflume.root.logger=info,console
- 224 machine listens to 7000 ports, writes data to the HDFS cluster
#agent1表示代理名称agent4. Sources=source1agent4.sinks=sink1agent4.channels=channel1#配置source1agent4. Sources.source1.type=avroagent4.sources.source1.Bind=192.168.23.224agent4.sources.source1.port=7000agent4.sources.source1.channels=channel1#配置sink1agent4. Sinks.sink1.type=hdfsagent4.sinks.sink1.hdfs.path=hdfs:Hdfs/umr-ubvzlf/uhiveubnhq5/warehouse/web_stat.db/login/date\= %y-%m-% Dagent4.sinks.sink1.hdfs.filetype=datastreamagent4.sinks.sink1.hdfs.fileprefix=buffer_census _agent4.sinks.sink1.hdfs.writeformat=textagent4.sinks.sink1.hdfs.rollinterval=30agent4.sinks.sink1.hdfs.inuseprefix =. agent4.sinks.sink1.hdfs.rollsize= 536870912agent4.sinks.sink1.hdfs.uselocaltimestamp = Trueagent4.sinks.sink1.hdfs.rollcount=0agent4.sinks.sink1.channel=channel1 #配置channel1agent4. Channels.channel1.type=fileagent4.channels.channel1.checkpointdir=/data/flume_data/login_checkpointagent4.channels.channel1.datadirs=/ Data/flume_data/login_channeldata
# 加到supervisor守护进程/usr/local/flume/bin/flume-ng agent -n agent4 -c /usr/local/flume/conf/ -f /usr/local/flume/conf/statistics/login_flume.conf -Dflume.root.logger=info,console
Cleaning data
Through Thrift
the PHP extension package call Hive
, write the class SQL HQL conversion to the MapReduce task to read the computed HDFs data, and store the results in MySQL
php-thrift-client
: https://github.com/garamon/php-thrift-hive-client
Define' Thrift_hive ', ROOT.'/libs/thrift ');$GLOBALS [' thrift_root '] = thrift_hive.'/lib ';Require_once$GLOBALS [' Thrift_root '].'/packages/hive_service/thrifthive.php ';Require_once$GLOBALS [' Thrift_root '].'/transport/tsocket.php ';Require_once$GLOBALS [' Thrift_root '].'/protocol/tbinaryprotocol.php ';Require_once thrift_hive.'/thrifthiveclientex.php ';$transport =New \tsocket ( ' 127.0.0.1 ', 10000); $transport->setsendtimeout (600 * 1000); $transport->setrecvtimeout (600 * 1000); $this->client = new \thrifthiveclientex (new \tbinaryprotocol ( $transport)); $this->client->open (); $this->client->execute ( "show databases"); $result = $this->client->fetchall (); Var_dump ( $result); $this->client->close ();
- HQL Syntax Description: https://chu888chu888.gitbooks.io/hadoopstudy/content/Content/8/chapter0803.html
- Note that as far as possible to convert the HQL statement to a mapreduce task, or not using the big data analysis of Hadoop, it makes no sense
- For example, the following logic, taken out to parse in memory, such logic as possible to avoid , because SQL in hive execution is ordinary data, not converted to MapReduce
select * from login limit 5;
// php处理$count = 0;foreach ($queryResult as $row) { $count ++;}
- Convert to MapReduce at once, leveraging the computing power of Hadoop
select type,count(*) from login group by type; // 这样就用到了
- When a partition assertion is used for a table
PARTITIONED BY
, the query can take advantage of the feature of the partition pruning (input pruning), but the assertion field must be where
nearest to the keyword to be exploited
// 如前面的login表使用到了date分区断言,这里就得把date条件放在第一位select count(*) from login where date=‘2016-08-23‘ and is_old=1;
- The equivalent table is not supported in hive, as follows
select * from dual a,dual b where a.key = b.key;
Should be written as:
select * from dual a join dual b on a.key = b.key;
- Insert is not supported in hive and is not logically allowed, because Hadoop is what we use to make data analytics bigger, and not as a business subdivision.
Data Report Presentation
This step is simple, reading MySQL data, using highcharts
tools such as various displays, you can also use crontab timed PHP script to send daily, weekly, etc.
Subsequent updates
Recently see some information and other people communicate found that cleaning data this step without PHP, can focus on HQL implementation of cleaning logic, the results are stored in Hadoop, and then use Sqoop
the Hadoop data and MySQL data synchronization. That simplifies the process, eliminating the manual insertion of MySQL, but also to make the data more real-time, for two times the cleaning logic of the HQL to do the foreshadowing