Spark SQL Getting Started case human resources system data processing

Source: Internet
Author: User
Tags joins log4j

  

Through this case, a relatively complete and complex data processing case is given, and the detailed analysis of the case is given.

Organizational chart of management content of human resource system

1) The construction of database and table of human resource system.

2) Loading of data from HR system.

3) The data of the human resources system is queried.

Basic information for Employees

Employee's name, employee ID, employee's gender, employee's age, year of entry, position, Department ID

michael,1,male,37,2001,developer,2
andy,2,female,33,2003,manager,1
justin,3,female,23,2013,recruitingspecialist,3
john,4,male,22,2014,developer,2
herry,5,male,27,2010,developer,1
brewster,6,male,37,2001,manager,2
brice,7,female,30,2003,manager,3
justin,8,male,23,2013,recruitingspecialist,3
john,9,male,22,2014,developer,1
herry,10,female,27,2010,recruitingspecialist,3

Department basic information

Department name, number, data content

management,1
researchanddevelopment,2
humanresources,3

Staff attendance information

Employee ID, year, month information, staff overtime, late, absenteeism, early hours of information

1,2015,12,0,2,4,0
2,2015,8,5,0,5,3
3,2015,3,16,4,1,5
4,2015,3,0,0,0,0
5,2015,3,0,3,0,0
6,2015,3,32,0,0,0
7,2015,3,0,16,3,32
8,2015,19,36,0,0,0,3
9,2015,5,6,30,0,2,2
10,2015,10,6,56,40,0,32
1,2014,12,0,2,4,0
2,2014,38,5,40,5,3
3,2014,23,16,24,1,5
4,2014,23,0,20,0,0
5,2014,3,0,3,20,0
6,2014,23,32,0,0,0
7,2014,43,0,16,3,32
8,2014,49,36,0,20,0,3
9,2014,45,6,30,0,22,2
10,2014,40,6,56,40,0,22

Employee Payroll List

Employee ID, Salary

1,5000
2,10000
3,6000
4,7000
5,5000
6,11000
7,12000
8,5500
9,6500
10,4500

The construction of database and table of human resource system

The data of HR system is loaded into the HRS data of Hive Warehouse, and the data of human resource system are built separately.

1. Start Spark-shell

Bin/spark-shell--executor-memory 2g--driver-memory 1g--master spark://spark01:7077

Where Spark01 is the master node of the current spark cluster.

As the current use of hive as a data warehouse, as to how to install and configuration, not much to repeat, very simple, hive-site.xml file configuration and start the Metastore service and other preparation operations.

  

To remove unnecessary log information:

Scala > Import org.apache.log4j. {Level,logger}

Scala > Logger.getlogger ("Org.apache.spark"). SetLevel (Level.warn)

Scala > Logger.getlogger ("Org.apache.spark,sql"). SetLevel (Level.warn)
Scala > Logger.getlogger ("org.apache.hadoop.hive.ql"). SetLevel (Level.warn)
When submitting as an application, you can set the log level in the profile conf/log4j.properties as follows

Log4j.logger.org.apache.spark = WARN

Log4j.logger.org.apache.spark.sql= WARN

LOG4J.LOGGER.ORG.APACHE.HADOOP.HIVE.QL = WARN

2. Building and using HRS database

1) Use the CREATE DATABASE statement, which is called hrs, to store all the data in the human resources system.

Scala > Sqlcontext.sql ("CREATE DATABASE HRS")

2) use a database of HR systems hrs

Scala > Sqlcontext.sql ("Use HRS")

 

3. Data Building Table

1) Construction of basic information table for workers people

Scala > Sqlcontext.sql ("CREATE TABLE IF not EXISTS people (

Name STRING,

ID INT,

Gender STRING,

Age INT,

Year INT,

Position STRING,

Depid INT

)

ROW FORMAT delimited TERMINATED by ', ' LINES TERMINATED by ' \ n ')

2) Building Departmental basic Information Table Department

Scala > Sqlcontext.sql ("CREATE TABLE IF not EXISTS Department (

Name STRING,

Depid INT,

)

ROW FORMAT delimited TERMINATED by ', ' LINES TERMINATED by ' \ n ')

3) Construction of staff attendance information table Attendance

Scala > Sqlcontext.sql ("CREATE TABLE IF not EXISTS attendance (

ID INT,

Year INT,

Month INT,

Overtime INT,

Latetime INT,

Absenteeism INT,

Leaveearlytime INT,

)

ROW FORMAT delimited TERMINATED by ', ' LINES TERMINATED by ' \ n ')

4) Construction of Worker's salary list table salary

Scala > Sqlcontext.sql ("CREATE TABLE IF not EXISTS attendance (

ID INT,

Salary INT,

)

ROW FORMAT delimited TERMINATED by ', ' LINES TERMINATED by ' \ n ')

Loading of data from human resources systems

Load data from the local 4 files to four tables, respectively

1) Loading data operation of basic information table for employees

Scala > Sqlcontext.sql ("LOAD DATA LOCAl inpath '/usr/local/data/people.txt ' OVERWRITE into TABLE people")

Where OVERWRITE represents data that overwrites the current table, which clears the table data and then inserts the data into the table.

2) Loading data operations for departmental basic information tables

Scala > Sqlcontext.sql ("LOAD DATA LOCAl inpath '/usr/local/data/department.txt ' OVERWRITE into TABLE Department")

3) Staff Attendance basic Information table loading data operation

Scala > Sqlcontext.sql ("LOAD DATA LOCAl inpath '/usr/local/data/attendance.txt ' OVERWRITE into TABLE attendance")

4) The load data operation of the basic information table of workers ' wages

Scala > Sqlcontext.sql ("LOAD DATA LOCAl inpath '/usr/local/data/salary.txt ' OVERWRITE into TABLE salary")

Loading of data from human resources systems

 The human resources System data Common query operation has the department staff number inquiry, the Department staff salary TOPN query, the department staff average salary ranking, each department employee salary total number inquires.

View information for each table while viewing the schema information in the echo of the interface

Scala > Sqlcontext.sql ("SELECT * from people")

Scala > Sqlcontext.sql ("SELECT * from Department")

Scala > Sqlcontext.sql ("SELECT * from Attendance")

Scala > Sqlcontext.sql ("SELECT * from Salary")

1, the number of departmental staff inquiries

First, the People table data and the Department table data join operation, and then according to Department department name grouping, after grouping for the people in the unique identification of a worker's ID field for statistics, and finally get the total number of employees in each department statistics information.

Scala > Sqlcontext.sql ("Select B.name,count (a.id) from people a joins Department B on a.depid = B.depid GROUP by b.name "). Show

  

2, the total number of employees in various departments, the average value of the ranking

First, according to the department ID of the People Table data and Department table data join operation, according to the employee ID join salary table data, and then grouped according to the department name of the department, grouped after the salary of employees to sum or average, and is sorted according to the value size. (The default sort is small to large)

Scala > Sqlcontext.sql ("Select B.name,sum (c.salary) as s from people a joins Department B on a.depid = B.depid Join Sala Ry C on a.id = C.id GROUP by B.name ORDER by S "). Show

  

  

Spark SQL Getting Started case human resources system data processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.