Through this case, a relatively complete and complex data processing case is given, and the detailed analysis of the case is given.
Organizational chart of management content of human resource system
1) The construction of database and table of human resource system.
2) Loading of data from HR system.
3) The data of the human resources system is queried.
Basic information for Employees
Employee's name, employee ID, employee's gender, employee's age, year of entry, position, Department ID
michael,1,male,37,2001,developer,2
andy,2,female,33,2003,manager,1
justin,3,female,23,2013,recruitingspecialist,3
john,4,male,22,2014,developer,2
herry,5,male,27,2010,developer,1
brewster,6,male,37,2001,manager,2
brice,7,female,30,2003,manager,3
justin,8,male,23,2013,recruitingspecialist,3
john,9,male,22,2014,developer,1
herry,10,female,27,2010,recruitingspecialist,3
Department basic information
Department name, number, data content
management,1
researchanddevelopment,2
humanresources,3
Staff attendance information
Employee ID, year, month information, staff overtime, late, absenteeism, early hours of information
1,2015,12,0,2,4,0
2,2015,8,5,0,5,3
3,2015,3,16,4,1,5
4,2015,3,0,0,0,0
5,2015,3,0,3,0,0
6,2015,3,32,0,0,0
7,2015,3,0,16,3,32
8,2015,19,36,0,0,0,3
9,2015,5,6,30,0,2,2
10,2015,10,6,56,40,0,32
1,2014,12,0,2,4,0
2,2014,38,5,40,5,3
3,2014,23,16,24,1,5
4,2014,23,0,20,0,0
5,2014,3,0,3,20,0
6,2014,23,32,0,0,0
7,2014,43,0,16,3,32
8,2014,49,36,0,20,0,3
9,2014,45,6,30,0,22,2
10,2014,40,6,56,40,0,22
Employee Payroll List
Employee ID, Salary
1,5000
2,10000
3,6000
4,7000
5,5000
6,11000
7,12000
8,5500
9,6500
10,4500
The construction of database and table of human resource system
The data of HR system is loaded into the HRS data of Hive Warehouse, and the data of human resource system are built separately.
1. Start Spark-shell
Bin/spark-shell--executor-memory 2g--driver-memory 1g--master spark://spark01:7077
Where Spark01 is the master node of the current spark cluster.
As the current use of hive as a data warehouse, as to how to install and configuration, not much to repeat, very simple, hive-site.xml file configuration and start the Metastore service and other preparation operations.
To remove unnecessary log information:
Scala > Import org.apache.log4j. {Level,logger}
Scala > Logger.getlogger ("Org.apache.spark"). SetLevel (Level.warn)
Scala > Logger.getlogger ("Org.apache.spark,sql"). SetLevel (Level.warn)
Scala > Logger.getlogger ("org.apache.hadoop.hive.ql"). SetLevel (Level.warn)
When submitting as an application, you can set the log level in the profile conf/log4j.properties as follows
Log4j.logger.org.apache.spark = WARN
Log4j.logger.org.apache.spark.sql= WARN
LOG4J.LOGGER.ORG.APACHE.HADOOP.HIVE.QL = WARN
2. Building and using HRS database
1) Use the CREATE DATABASE statement, which is called hrs, to store all the data in the human resources system.
Scala > Sqlcontext.sql ("CREATE DATABASE HRS")
2) use a database of HR systems hrs
Scala > Sqlcontext.sql ("Use HRS")
3. Data Building Table
1) Construction of basic information table for workers people
Scala > Sqlcontext.sql ("CREATE TABLE IF not EXISTS people (
Name STRING,
ID INT,
Gender STRING,
Age INT,
Year INT,
Position STRING,
Depid INT
)
ROW FORMAT delimited TERMINATED by ', ' LINES TERMINATED by ' \ n ')
2) Building Departmental basic Information Table Department
Scala > Sqlcontext.sql ("CREATE TABLE IF not EXISTS Department (
Name STRING,
Depid INT,
)
ROW FORMAT delimited TERMINATED by ', ' LINES TERMINATED by ' \ n ')
3) Construction of staff attendance information table Attendance
Scala > Sqlcontext.sql ("CREATE TABLE IF not EXISTS attendance (
ID INT,
Year INT,
Month INT,
Overtime INT,
Latetime INT,
Absenteeism INT,
Leaveearlytime INT,
)
ROW FORMAT delimited TERMINATED by ', ' LINES TERMINATED by ' \ n ')
4) Construction of Worker's salary list table salary
Scala > Sqlcontext.sql ("CREATE TABLE IF not EXISTS attendance (
ID INT,
Salary INT,
)
ROW FORMAT delimited TERMINATED by ', ' LINES TERMINATED by ' \ n ')
Loading of data from human resources systems
Load data from the local 4 files to four tables, respectively
1) Loading data operation of basic information table for employees
Scala > Sqlcontext.sql ("LOAD DATA LOCAl inpath '/usr/local/data/people.txt ' OVERWRITE into TABLE people")
Where OVERWRITE represents data that overwrites the current table, which clears the table data and then inserts the data into the table.
2) Loading data operations for departmental basic information tables
Scala > Sqlcontext.sql ("LOAD DATA LOCAl inpath '/usr/local/data/department.txt ' OVERWRITE into TABLE Department")
3) Staff Attendance basic Information table loading data operation
Scala > Sqlcontext.sql ("LOAD DATA LOCAl inpath '/usr/local/data/attendance.txt ' OVERWRITE into TABLE attendance")
4) The load data operation of the basic information table of workers ' wages
Scala > Sqlcontext.sql ("LOAD DATA LOCAl inpath '/usr/local/data/salary.txt ' OVERWRITE into TABLE salary")
Loading of data from human resources systems
The human resources System data Common query operation has the department staff number inquiry, the Department staff salary TOPN query, the department staff average salary ranking, each department employee salary total number inquires.
View information for each table while viewing the schema information in the echo of the interface
Scala > Sqlcontext.sql ("SELECT * from people")
Scala > Sqlcontext.sql ("SELECT * from Department")
Scala > Sqlcontext.sql ("SELECT * from Attendance")
Scala > Sqlcontext.sql ("SELECT * from Salary")
1, the number of departmental staff inquiries
First, the People table data and the Department table data join operation, and then according to Department department name grouping, after grouping for the people in the unique identification of a worker's ID field for statistics, and finally get the total number of employees in each department statistics information.
Scala > Sqlcontext.sql ("Select B.name,count (a.id) from people a joins Department B on a.depid = B.depid GROUP by b.name "). Show
2, the total number of employees in various departments, the average value of the ranking
First, according to the department ID of the People Table data and Department table data join operation, according to the employee ID join salary table data, and then grouped according to the department name of the department, grouped after the salary of employees to sum or average, and is sorted according to the value size. (The default sort is small to large)
Scala > Sqlcontext.sql ("Select B.name,sum (c.salary) as s from people a joins Department B on a.depid = B.depid Join Sala Ry C on a.id = C.id GROUP by B.name ORDER by S "). Show
Spark SQL Getting Started case human resources system data processing