Php+hadoop Realization of statistical analysis of data

Source: Internet
Author: User
Tags php script

The construction process of the statistical analysis system, which is completely independently completed, is mainly used in the Php+hadoop+hive+thrift+mysql realization

Installation

Hadoop Installation: http://www.powerxing.com/install-hadoop/
Hadoop cluster configuration: http://www.powerxing.com/install-hadoop-cluster/
Hive Installation: https://chu888chu888.gitbooks.io/hadoopstudy/content/Content/8/chapter0807.html

Installation of the specific tutorial please see the above link, the local test only use a single-machine configuration, cluster configuration (flume used later) to see the above detailed links, because there was no contact with the Java related, here is a few problems encountered.

    • 1.x and 2.x versions of Hadoop and hive correspond to
    • Java/hadoop related environment variable configuration, accustomed to PHP children's shoes in this piece may be easy to ignore
    • Start Hadoop prompt Starting namenodes on [] , namenodes is empty because no IP or port is specified, modify the hadoop/core-site.xml following
<configuration><property><name>dfs.namenode.rpc-address</name><value>127.0.0.0:9001</value></property></configuration>
    • After the installation is completed jps , the input can be viewed to namenode,datanode etc.
Escalation and receipt
    • Swoole and Workerman have a simple version of the implementation of data monitoring, including escalation, reception, storage, display, mainly using UDP upload (Swoole version has been upgraded to TCP long connection), Redis cache, file persistence, highcharts display, can be used as a thought reference
      Swoole-statistics:https://github.com/smalleyes/statistics
      Workerman-statistics:https://github.com/walkor/workerman-statistics
    • This example uses the interface provided by the Swoole to implement UDP transmission, because the escalation data is a certain degree of fault tolerance, so the choice of UDP efficiency first
    • Receive data is temporarily stored in Redis, stored in files every few minutes, file names are segmented by module and time, field | Split (back to Hive)
Data dump create hive data table
    • The Hive data table is written according to the file data format, TERMINATED BY and the field with the previous file field delimiter wants to correspond
    • Partitioning the table by datePARTITIONED BY
CREATETABLE Login (Timeint comment ' Landing time ', type string comment  Email,username,qq ", Device string comment  ' landing device, pc,android, iOS ', IP string comment  ' login IP ', UID int comment  ' user ID ', is_old int Comment  ' whether old user ') partitioned by ( Date ' string COMMENT  ' date ') ROW Span class= "Hljs-keyword" >format delimited fields TERMINATED by  ' | ';         
    • Timing (Crontab) Creating Hadoop partitions
hive -e "use web_stat; alter table login add if not exists partition (date=‘${web_stat_day}‘)"
Switching
    • The flume listens to the file directory, transfers data to a server that can access the HDFs cluster, and transmits it to 7000 ports on the 224 machine.
#agent3表示代理名称 Loginagent3.sources=source1agent3.sinks=sink1agent3.channels=channel1#配置source1agent3. sources.source1.type=spooldiragent3.sources.source1.spooldir=/data/releases/stat/ data/10001/agent3.sources.source1.channels=channel1agent3.sources.source1.fileheader = false# Configure Sink1agent3.sinks.sink1.type=avroagent3.sinks.sink1.hostname=192.168.  23.224agent3.sinks.sink1.port=7000agent3.sinks.sink1.channel=channel1# Configuring Channel1agent3.channels.channel1.type=fileagent3.channels.channel1.checkpointdir=/data/flume_data/ checkpoint_loginagent3.channels.channel1.datadirs=/data/flume_data/channeldata_login   
    • Start flume
# 加到supervisor守护进程/home/flume/bin/flume-ng agent -n agent3 -c /home/flume/conf/ -f /home/flume/conf/statistics/login_flume.conf  -Dflume.root.logger=info,console
    • 224 machine listens to 7000 ports, writes data to the HDFS cluster
#agent1表示代理名称agent4. Sources=source1agent4.sinks=sink1agent4.channels=channel1#配置source1agent4. Sources.source1.type=avroagent4.sources.source1.Bind=192.168.23.224agent4.sources.source1.port=7000agent4.sources.source1.channels=channel1#配置sink1agent4. Sinks.sink1.type=hdfsagent4.sinks.sink1.hdfs.path=hdfs:Hdfs/umr-ubvzlf/uhiveubnhq5/warehouse/web_stat.db/login/date\= %y-%m-% Dagent4.sinks.sink1.hdfs.filetype=datastreamagent4.sinks.sink1.hdfs.fileprefix=buffer_census _agent4.sinks.sink1.hdfs.writeformat=textagent4.sinks.sink1.hdfs.rollinterval=30agent4.sinks.sink1.hdfs.inuseprefix =. agent4.sinks.sink1.hdfs.rollsize= 536870912agent4.sinks.sink1.hdfs.uselocaltimestamp = Trueagent4.sinks.sink1.hdfs.rollcount=0agent4.sinks.sink1.channel=channel1 #配置channel1agent4. Channels.channel1.type=fileagent4.channels.channel1.checkpointdir=/data/flume_data/login_checkpointagent4.channels.channel1.datadirs=/ Data/flume_data/login_channeldata           
    • Start
# 加到supervisor守护进程/usr/local/flume/bin/flume-ng agent -n agent4 -c /usr/local/flume/conf/ -f /usr/local/flume/conf/statistics/login_flume.conf -Dflume.root.logger=info,console
Cleaning data

Through Thrift the PHP extension package call Hive , write the class SQL HQL conversion to the MapReduce task to read the computed HDFs data, and store the results in MySQL
php-thrift-client: https://github.com/garamon/php-thrift-hive-client

Define' Thrift_hive ', ROOT.'/libs/thrift ');$GLOBALS [' thrift_root '] = thrift_hive.'/lib ';Require_once$GLOBALS [' Thrift_root '].'/packages/hive_service/thrifthive.php ';Require_once$GLOBALS [' Thrift_root '].'/transport/tsocket.php ';Require_once$GLOBALS [' Thrift_root '].'/protocol/tbinaryprotocol.php ';Require_once thrift_hive.'/thrifthiveclientex.php ';$transport =New \tsocket ( ' 127.0.0.1 ', 10000);  $transport->setsendtimeout (600 * 1000);  $transport->setrecvtimeout (600 * 1000);  $this->client = new \thrifthiveclientex (new \tbinaryprotocol ( $transport));  $this->client->open ();  $this->client->execute ( "show databases");  $result =  $this->client->fetchall (); Var_dump ( $result);  $this->client->close ();          
    • HQL Syntax Description: https://chu888chu888.gitbooks.io/hadoopstudy/content/Content/8/chapter0803.html
    • Note that as far as possible to convert the HQL statement to a mapreduce task, or not using the big data analysis of Hadoop, it makes no sense
    • For example, the following logic, taken out to parse in memory, such logic as possible to avoid , because SQL in hive execution is ordinary data, not converted to MapReduce
select * from login limit 5;
// php处理$count = 0;foreach ($queryResult as $row) { $count ++;}
    • Convert to MapReduce at once, leveraging the computing power of Hadoop
select type,count(*) from login group by type; // 这样就用到了
    • When a partition assertion is used for a table PARTITIONED BY , the query can take advantage of the feature of the partition pruning (input pruning), but the assertion field must be where nearest to the keyword to be exploited
// 如前面的login表使用到了date分区断言,这里就得把date条件放在第一位select count(*) from login where date=‘2016-08-23‘ and is_old=1;
    • The equivalent table is not supported in hive, as follows
select * from dual a,dual b where a.key = b.key;

Should be written as:

select * from dual a join dual b on a.key = b.key;
    • Insert is not supported in hive and is not logically allowed, because Hadoop is what we use to make data analytics bigger, and not as a business subdivision.
Data Report Presentation

This step is simple, reading MySQL data, using highcharts tools such as various displays, you can also use crontab timed PHP script to send daily, weekly, etc.

Subsequent updates

Recently see some information and other people communicate found that cleaning data this step without PHP, can focus on HQL implementation of cleaning logic, the results are stored in Hadoop, and then use Sqoop the Hadoop data and MySQL data synchronization. That simplifies the process, eliminating the manual insertion of MySQL, but also to make the data more real-time, for two times the cleaning logic of the HQL to do the foreshadowing

Viagra

Php+hadoop Realization of statistical analysis of data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.