Hue installation and configuration practices
Hue is an open-source Apache Hadoop UI system. It was first evolved from Cloudera Desktop and contributed to the open-source community by Cloudera. It is implemented based on the Python Web framework Django. By using Hue, we can interact with the Hadoop cluster on the Web Console of the browser to analyze and process data, such as operating data on HDFS and running MapReduce Job. I have heard of the convenience and power of Hue for a long time and have never tried it myself. Next, let's take a look at the feature set supported by Hue through the official website:
- By default, session data is managed based on lightweight sqlite databases. user authentication and authorization can be customized to MySQL, Postgresql, and Oracle.
- Access HDFS based on File Browser
- Develop and run Hive query based on Hive Editor
- Supports Solr-based search applications, and provides visual data views and dashboards (Dashboard)
- Supports interactive queries for Impala-based applications
- Supports Spark editor and Dashboard)
- Supports Pig editor and submits script tasks.
- The Oozie editor is supported. You can submit and monitor Workflow, Coordinator, and Bundle through a dashboard.
- Supports HBase browsers to visualize data, query data, and modify HBase tables.
- Supports the Metastore browser to access Hive metadata and HCatalog
- Support Job Explorer with access to MapReduce Job (MR1/MR2-YARN)
- Supports the Job designer to create MapReduce, Streaming, and Java jobs.
- Support Sqoop 2 editor and Dashboard)
- Supports ZooKeeper browsers and editors
- Supports MySql, PostGresql, Sqlite, and Oracle database query editors.
Next, we verify some features of Hue through actual installation.
Environment preparation
Here, the Basic Environment and Its configuration are as follows:
- CentOS-6.6 (Final)
- JDK-1.7.0_25
- Maven-3.2.1
- Git-1.7.1
- Hue-3.7.0 (branch-3.7.1)
- Hadoop-2.2.0
- Hive-1, 0.14
- Python-1, 2.6.6
Based on the above software tools, ensure correct installation and configuration. It should be noted that we use Hue to execute Hive queries, and we need to start the HiveServer2 service:
cd /usr/local/hivebin/hiveserver2 &
Otherwise, Hive query cannot be executed through Hue Web control.
Install configurations
I have created a hadoop user. As a hadoop user, I first use the yum tool to install Hue-related dependent software:
sudo yum install krb5-devel cyrus-sasl-gssapi cyrus-sasl-deve libxml2-devel libxslt-devel mysql mysql-devel openldap-devel python-devel python-simplejson sqlite-devel
Then, run the following command to download and build the Hue package:
cd /usr/local/sudo git clone https://github.com/cloudera/hue.git branch-3.7.1sudo chown -R hadoop:hadoop branch-3.7.1/cd branch-3.7.1/make apps
If there is no problem with the above process, we have installed Hue. The Hue configuration file is/usr/local/branch-3.7.1/desktop/conf/pseudo-distributed.ini, the default configuration file does not run properly Hue, so you need to modify the content, corresponds to the Hadoop cluster configuration. The configuration file divides the configuration into multiple segments based on the integration of different software. Each segment contains sub-segments to facilitate configuration management, as shown below (the Sub-segment name is omitted ):
- Desktop
- Libsaml
- Libopenid
- Liboauth
- Librdbms
- Hadoop
- Filebrowser
- Liboozie
- Oozie
- Beeswax
- Impala
- Pig
- Sqoop
- Proxy
- Hbase
- Search
- Indexer
- Jobsub
- Jobbrowser
- Zookeeper
- Spark
- Useradmin
- Libsentry
We can easily configure what we need as needed. The following table describes how to modify the configuration file:
Hue configuration section |
Hue configuration item |
Hue configuration value |
Description |
Desktop |
Default_hdfs_superuser |
Hadoop |
HDFS user management |
Desktop |
Http_host |
10.10.4.125 |
Host/IP address of Hue Web Server |
Desktop |
Http_port |
8000 |
Hue Web Server Service port |
Desktop |
Server_user |
Hadoop |
Process user running Hue Web Server |
Desktop |
Server_group |
Hadoop |
Process User Group Running Hue Web Server |
Desktop |
Default_user |
Yanjun |
Hue Administrator |
Hadoop/hdfs_clusters |
Fs_defaultfs |
Hdfs: // hadoop6: 8020 |
Corresponds to the core-site.xml configuration item fs. defaultFS |
Hadoop/hdfs_clusters |
Hadoop_conf_dir |
/Usr/local/hadoop/etc/hadoop |
Hadoop configuration file directory |
Hadoop/yarn_clusters |
Resourcemanager_host |
Hadoop6 |
Corresponds to the yarn-site.xml configuration item yarn. resourcemanager. hostname |
Hadoop/yarn_clusters |
Resourcemanager_port |
8032 |
ResourceManager service port number |
Hadoop/yarn_clusters |
Resourcemanager_api_url |
Http: // hadoop6: 8088 |
Corresponds to yarn-site.xml configuration item yarn. resourcemanager. webapp. address |
Hadoop/yarn_clusters |
Proxy_api_url |
Http: // hadoop6: 8888 |
Configure yarn. yarn-site.xml for web-proxy.address |
Hadoop/yarn_clusters |
History_server_api_url |
Http: // hadoo6: 19888 |
Mapreduce. jobhistory. webapp. address corresponding to the mapred-site.xml configuration item |
Beeswax |
Hive_server_host |
10.10.4.125 |
Node host name/IP address of Hive |
Beeswax |
Hive_server_port |
10000 |
HiveServer2 service port number |
Beeswax |
Hive_conf_dir |
/Usr/local/hive/conf |
Hive configuration file directory |
The content related to the Hadoop cluster and Hive (Hive is configured for the beeswax segment and interacts with Hive through HIveServer2) are configured above ).
Finally, start the Hue service and execute the following command:
cd /usr/local/branch-3.7.1/build/env/bin/supervisor &
Hue function verification
We mainly execute Hive queries on the Hue Web Console, so we need to prepare Hive-related tables and data.
First, create a database in Hive (Authorize if you do not have the permission ):
GRANT ALL TO USER hadoop;CREATE DATABASE user_db;
Here, the hadoop user is the management user of Hive and can grant all permissions to this user.
Create an example table. The table creation DDL is as follows:
CREATE TABLE user_db.daily_user_info ( device_type int, version string, channel string, udid string)PARTITIONED BY ( stat_date string)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The format of the prepared data file is as follows:
.2.1 C-gbnpk b01b8178b86cebb9fddc035bb238876d 3.0.7 A-wanglouko e2b7a3d8713d51c0215c3a4affacbc95 1.2.7 H-follower 766e7b2d2eedba2996498605fa03ed331.2.7 A-shiry d2924e24d9dbc887c3bea5a1682204d9 1.5.1 Z-wammer f880af48ba2567de0f3f9a6bb70fa962 1.2.7 H-clouda aa051d9e2accbae74004d761ec747110 2.2.13 H-clouda 02a32fd61c60dd2c5d9ed8a826c53be42.5.9 B-ywsy 04cc447ad65dcea5a131d5a993268edf
Each field is separated by a TAB. The meaning of each field corresponds to the field user_db.daily_user_info in the preceding table. Then, we load the test data to the partitions in the example table:
LOAD DATA LOCAL INPATH '/home/hadoop/u2014-12-05.log' OVERWRITE INTO TABLE user_db.daily_user_info PARTITION (stat_date='2014-12-05');LOAD DATA LOCAL INPATH '/home/hadoop/u2014-12-06.log' OVERWRITE INTO TABLE user_db.daily_user_info PARTITION (stat_date='2014-12-06');LOAD DATA LOCAL INPATH '/home/hadoop/u2014-12-07.log' OVERWRITE INTO TABLE user_db.daily_user_info PARTITION (stat_date='2014-12-07');LOAD DATA LOCAL INPATH '/home/hadoop/u2014-12-08.log' OVERWRITE INTO TABLE user_db.daily_user_info PARTITION (stat_date='2014-12-08');LOAD DATA LOCAL INPATH '/home/hadoop/u2014-12-09.log' OVERWRITE INTO TABLE user_db.daily_user_info PARTITION (stat_date='2014-12-09');LOAD DATA LOCAL INPATH '/home/hadoop/u2014-12-10.log' OVERWRITE INTO TABLE user_db.daily_user_info PARTITION (stat_date='2014-12-10');LOAD DATA LOCAL INPATH '/home/hadoop/u2014-12-11.log' OVERWRITE INTO TABLE user_db.daily_user_info PARTITION (stat_date='2014-12-11');
You can log on through the Hive CLI interface to view the table data:
SELECT COUNT(1) FROM daily_user_info;
I have 241709545 records as test data.
After the Hue service is successfully started, you can directly access http: // 10.10.4.125: 8000/in a browser to log on. To enable this function for the first time, you need to enter the Default User and password, and then you can log in, as shown in:
When you log on to the console for the first time, you can select the user as the Hue administrator. You have a high permission to add users and manage the operation permissions of users and their user groups.
After successful logon, go to the Hue Web Console homepage, as shown in:
After successful logon, the system will first perform some basic environment configuration checks, which are related to which applications are specified when we actually modify the configuration.
After the user logs on, select the Hive menu item under Query Editors ,:
When submitting a query, because the query has been executed for a long time, you can wait for the query to be executed. The final result is displayed on the Results tab of the current room, you can also view the Hive background execution status during execution.
You can view jobs running in various States on the Hadoop cluster through the Job Browser http: // 10.10.4.125: 8000/jobbrowser, including Succeeded, Running, Failed, and Killed ,:
To view the specific Job execution status, you must correctly configure and start the JobHistoryServer and WebAppProxyServer services of the Hadoop cluster. You can view the relevant data on the Web page. Our example is as follows ,:
If you want to view the execution of a MapTask or ReduceTask corresponding to a Job, you can click the corresponding link. This is similar to the Job Web management interface of Hadoop YARN, which makes monitoring very convenient.
- User Management and authorization
After successfully logging on as an authorized administrator user, you can click the user in the upper-right corner (yanjun here). The "Manage Users" menu item is displayed in the drop-down list, where you can create a new user, and specify the access permission, as shown in:
Above, I created several users and specified the group to which the user belongs (Groups, supports group management ). In fact, we can set different Hue applications to different groups, and then assign new users to the relevant groups. In this way, we can control the permissions of users to access Hue applications. The user who created and assigned permissions above can log on to the Hue Web management system by setting the user name and password, and interact with various Hadoop-related applications (such as MySQL and Spark.
Summary
Through the above understanding and problems encountered during the installation and configuration process, let's make a summary:
- If you install and configure Hue Based on the CentOS environment, it may be relatively complicated and may not be easy to complete. I started to configure Based on CentOS-5.11 (Final), and the configuration was not successful, probably because the Hue version used is too high (branch-3.0 and branch-3.7.1 I have tried ), or it may be caused by problems such as the installation of some software packages on which CentOS depends. We recommend that you use a newer version of CentOS, where I use CentOS-6.6 (Final), branch-3.7.1 source code compilation for Hue, and Python requires 2.6 +.
- With Hue, we may also be interested in user management and permission assignment. Therefore, we can use other relational databases, such as MySQL, as needed, and back up data, to prevent the loss of user data related to the Hue application and prevent problems such as the inability to access the Hadoop cluster. You need to modify the Hue configuration file and change the default storage method sqlite3 to a familiar relational database. Currently, MySQL, Postgresql, and Oracle are supported.
- If necessary, it may combine the underlying access control mechanism of the Hadoop cluster, such as Kerberos or Hadoop SLA, with the user management and authorization authentication functions of Hue to better restrict and control access permissions.
- Based on the Hue features we mentioned earlier, we can select different Hue applications based on our actual application scenarios. Through this plug-in configuration, we can start the application and interact with it through Hue, such as Oozie, Pig, Spark, and HBase.
- If you use a lower version of Hive, such as 0.12, you may encounter problems during verification. You can select a compatible version of Hue Based on the Hive version to install the configuration.
- Due to this installation and configuration practice, the CDH Software Package released by Cloudera is not used. It may be smoother if CDH is used.