Pyspark processing data and charting analysis
Pyspark Introduction
- The official interpretation of Pyspark: "Pyspark is the Python API for Spark". That is, the Python programming interface that Pyspark provides for spark.
- Spark uses py4j to enable Python to interoperate with Java, enabling the use of Python to write spark programs. Spark also provides Pyspark, a spark python shell, that can be used to write spark programs in Python in an interactive fashion. Such as:
From Pyspark import Sparkcontext sc = sparkcontext ("local", "Job Name", pyfiles=[' myfile.py ', ' lib.zip ', ' App.egg ') words =sc.textfile ("/usr/share/dict/words") Words.filter (Lambda w:w.startswith ("spar")). Take (5)
Pyspark Document Home Page interface:
Pyspark is built on top of the Java API, such as:
process data and chart analysis
I'm going to pyspark the real data set and analyze it graphically. First I need to introduce the dataset and the environment for data processing.
Data Set
The Movielens dataset was collected by Minnesota University's Grouplens research project on the film scoring site (movielens.umn.edu), The data set contains data for a total of seven months from September 19, 1997 to April 22, 1998. The data has been processed (the data that has been cleared for less than 20 times and the information is not complete)
Movielens Data set:
Movielens data set, the user scored the film they had seen, with a score of a. The movielens includes two libraries of different sizes, suitable for different sizes of algorithms. The small-scale library is the data of 943 independent users who scored 10,000 times for 1682 films (I used this small scale for data processing and analysis), and by analyzing the data set, we gave the user the prediction of the other films that he had not watched, and recommended them to the user with a high predictive score. Think these movies are the next movie that users are interested in.
Data Set Structure:
1,943 users of the 1682 field film scores, the number of judges 100000 times, scoring standard: points.
2, each user at least to judge the film.
3, simple statistics of the user's information (age, gender, occupation, Zip)
Data usage:
For scientific research units and research and development enterprises to use, can be used in data mining, recommendation systems, artificial intelligence and other fields, complex network research and other fields.
environment for data processing
- Hadoop Pseudo-distributed environment
- Spark Standalone Environment
- Anaconda Environment: (:https://www.continuum.io/downloads)
- Anaconda python is a collection of Python science and technology packages that contain more than A few popular packages for scientific computing, math, engineering, and data analysis. Here I mainly use some of its packages, lest I install some Python bag trouble.
Other:
processing One (user age statistics analysis)
Processing An Introduction:
The age of the user information is obtained by data processing to the user. The age is then counted and the graph frame in Python is used to generate the histogram, and finally the matplotlib of the audience age distribution is analyzed by histogram analysis.
Process All code:
#加载HDFS上面的用户数据user _data = Sc.textfile ("Hdfs:/input/ml-100k/u.user") #打印加载的用户信息第一条user_data. First () #用 "|" The delimiter splits the data for each row, and then returns the data to User_fieldsuser_fields = User_data.map (lambda line:line.split ("|")) #统计总的用户数num_users = User_fields.map (lambda fields:fields[0]). Count () #统计性别的种类数, the distinct () function is used to remove the weight. Num_genders = User_fields.map (lambda fields:fields[2]). DISTINCT (). Count () #统计职位种类数num_occupations = User_fields.map ( Lambda fields:fields[3]). DISTINCT (). Count () #统计邮政编码种类数num_zipcodes = User_fields.map (lambda fields:fields[4]). Distinct (). Count () #打印统计的这些信息print "Users:%d, genders:%d, occupations:%d, ZIP codes:%d"% (Num_users, num_genders, num_ Occupations, num_zipcodes) #统计用户年龄ages = User_fields.map (lambda x:int (x[1)). Collect () # Generate graphs from Matplotlib in Python for analyst analysis import Matplotlib.pyplot as Plthist (ages, bins=20, color= ' LightBlue ', normed=true) Fig = PLT.GCF () fig.set_size_inches (+) plt.show ()
Enter the Spark installation directory and enter the following command to turn on Pyspark:
./bin/pyspark
the user data (u.user)above the HDFs is then loaded and the data format is displayed by printing the first datathrough user_data.first () .
Statistics for All user information above HDFs: A total of 943 users, male and female two gender, in the position,795 different zip codes.
Matplotlib is a Python graphics framework, Below is Matplotlib Printing Information for the working process:
matplotlib graphical display of the data after statistics:
User Age Distribution Chart:
Conclusion:
Through the generated histogram we can see that the audience age group of these films tend to be young, and most of the user's age is between.
processing Two (user position statistics analysis)
Processing Two introduction:
First, the user data processing, the user information to obtain the type of jobs and the number of users of each position. The position is then counted and the graph frame in Python is used to generate the histogram, and the matplotlib of the audience and the number distribution trend are analyzed by histogram analysis.
Process two of all code:
#处理职位那一列, the process of processing jobs count_by_occupation = User_fields.map (lambda fields: (Fields[3], 1) through a WordCount processing procedure similar to the MapReduce classic example. Reducebykey (lambda x, y:x + y). Collect () #导入numpy模块import NumPy as np# gets the user position and displays the x-axis data as a histogram x_axis1 = Np.array ([c[0] for C i n count_by_occupation]) #获取用户的各个职位数 and as y-axis data display Y_AXIS1 = Np.array ([c[1] for C in Count_by_occupation]) # Let the x-axis category appear in ascending order of each position in the y-axis X_axis = X_axis1[np.argsort (y_axis1)] #y轴也是升序y_axis = Y_axis1[np.argsort (y_axis1)]# Set the X-axis range in the histogram and Widthpos = Np.arange (len (x_axis)) width = 1.0# The statistics of position information using matplotlib generate histogram from matplotlib import Pyplot as Pltax = Plt.axes () ax.set_xticks (pos + (WIDTH/2)) ax.set_xticklabels (X_axis) Plt.bar (POS, Y_axis, Width, color= ' LightBlue ') plt.xticks (rotation=30) FIG = PLT.GCF () fig.set_size_inches (+) Plt.show ()
User position Information Processing process:
User position information statistics and generate histogram chart:
User Position Map:
Conclusion:
From the resulting chart, we can see that the majority of movie viewers are student, educator, administrator, engineer and programmer. and the number of student is a big step ahead of other positions.
processing three (statistical analysis of the film release information)
Processing Three introduction:
- First, the user data processing, to obtain the user evaluation of the movie release time information. Then take the 1998 year as the maximum age minus the number of years the movie was published (data set statistics are 1998 years) to get the value as the x - axis, followed by Python the graph frame in the matplotlib generates a histogram and finally analyzes the trend of the movie release time by histogram.
- Movie information has some dirty data, so it needs to be processed first.
Handle all three of the code:
#从HDFS中加载u. Item Data Movie_data = Sc.textfile ("Hdfs:/input/ml-100k/u.item") #打印第一条数据, viewing data format Print Movie_data.first () # Total number of movies Num_movies = Movie_data.count () print "movies:%d"% num_movies# define function function for the preprocessing of movie data, for the wrong age, use 1900 to fill def convert_year (x): Try:return int (x[-4:]) Except:return 1900 # There is a ' bad ' data point with a blank Year,which We set to the and would filter out later# use "|" Delimiter splits each row of data Movie_fields = Movie_data.map (lambda lines:lines.split ("|")) #提取分割后电影发布年限信息, and do dirty data preprocessing years = Movie_fields.map (Lambda fields:fields[2]). Map (Lambda x:convert_year (x)) # Get those movies with age 1900 (partly dirty data) years_filtered = Years.filter (lambda x:x! = 1900) #计算出电影发布时间与1998年的年限差movie_ages = Years_ Filtered.map (Lambda yr:1998-yr). Countbyvalue () #将年限差作为x轴, number of films as a histogram of the y axis values = movie_ages.values () bins = Movie_ Ages.keys () from matplotlib import Pyplot as Plt1plt1.hist (values, bins=bins, color= ' LightBlue ', normed=true) FIG = PLT1.GCF () fig.set_size_inches (16,10) plt1.show ()
Load the movie data from HDFs and print the first piece of data to view the data format:
Print the Movie data format:
Total number of movies printed:
Movie release age statistics and generate histogram:
Movie Release Age Distribution chart: (x -axis is 1998 minus movie release years )
Conclusion:
From the resulting chart, we can see that the vast majority of movie release times are between 1988-1998 years.
processing four (user scoring statistical analysis)
Processing Four introduction:
First, the user data processing, to obtain the user's score number of the film, and then statistical score 1-5 each score number, and then draw a chart for analysis.
Process four of all code:
#从HDFS上面加载用户评分数据rating_data = Sc.textfile ("Hdfs:/input/ml-100k/u.data") print Rating_data.first () #统计评分记录总数num_ Ratings = Rating_data.count () print "ratings:%d"% num_ratings# uses the "\ T" character to split each row of data Rating_data = Rating_data.map (lambda line: Line.split ("\ T")) #获取每条数据中的用户评分数集合ratings = Rating_data.map (lambda fields:int (fields[2)) #获取最大评分数max_rating = Ratings.reduce (lambda x, Y:max (x, y)) #获取最小评分数min_rating = Ratings.reduce (lambda x, y:min (x, y)) #获取平均评分数mean_rating = Rat Ings.reduce (lambda x, y:x + y)/num_ratings# Gets the median of the score median_rating = Np.median (Ratings.collect ()) #每位用户平均评分ratings_per_ user = num_ratings/num_users# A couple of movies per user Ratings_per_movie = num_ratings/num_movies# print "min Rating:%d"% min _ratingprint "Max rating:%d"% max_ratingprint "Average rating:%2.2f"% mean_ratingprint "Median rating:%d"% Median_ra Tingprint "Average # of ratings per User:%2.2f"% ratings_per_userprint "Average # of ratings per movie:%2.2f"% ratings _per_movie# get scoring Data count_by_rating = Ratings.countbyvalue (The import numpy as np#x axis shows each score (1-5) X_axis = Np.array (Count_by_rating.keys ()) #y轴显示每个评分所占概率, the total probability and for 1y_axis = Np.array ([ Float (c) for C in Count_by_rating.values ()]) y_axis_normed = y_axis/y_axis.sum () pos = Np.arange (len (x_axis)) width = 1.0# makes Generate histograms with matplotlib from matplotlib import pyplot as Plt2ax = Plt2.axes () ax.set_xticks (pos + (WIDTH/2)) Ax.set_xticklabels (x _axis) Plt2.bar (POS, y_axis_normed, Width, color= ' lightblue ') plt2.xticks (rotation=30) FIG = PLT2.GCF () fig.set_size_ Inches (+) plt2.show ()
loading Data from HDFS
Total Score Records:
Some statistical information of the ratings;
Statistical scoring information and generate a histogram chart:
User Movie Evaluation Map:
Conclusion:
We can see that the scores of the films are mostly between 3-5 points.
processing Five (user total scoring statistical analysis)
Handling Five profiles:
First, the user data processing, to obtain the user's overall rating of the film (at least three times per person, score between 1-5) and then draw a chart for analysis.
Process four of all code:
#获取用户评分次数和每次评分user_ratings_grouped = Rating_data.map (lambda fields: (int (fields[0]), int (fields[2])). Groupbykey () # User ID and total number of ratings for this user User_ratings_byuser = User_ratings_grouped.map (lambda (k, v): (K,len (v))) #打印5条结果user_ratings_ Byuser.take (5) #生成柱状图from matplotlib import pyplot as Plt3user_ratings_byuser_local = User_ratings_byuser.map (Lambda (k , V): v). Collect () plt3.hist (user_ratings_byuser_local, bins=200, color= ' LightBlue ', normed=true) FIG = PLT3.GCF () Fig.set_size_inches (16,10) plt3.show ()
Print User 5 results after processing:
Generate a total number of Per user ratings distribution map:
Conclusion:
You can see that the overall rating is within the majority of the total. Of course, There is a part of it.
Precautions
1. To display the Python icon, the operating system must have a graphical interface.
2, Python must have matplotlib module.
3, must be the root user to open pyspark, or will be reported the following error, do not have permission to connect x Server.
Pyspark processing data and charting analysis