Mandarin jargon do not want to speak, introduction also don't want to fight, all know Pyspark and KDD-99 is what?
Do not know the words ... Point here 1
or here, 2.
reprint remember to indicate the source
http://blog.csdn.net/isinstance/article/details/51329766
Pyspark itself is written in Scala, and the Scala language is the state of Java's metamorphosis, although Spark also supports Python, but it's not as good as Scala's support, and there are few books for Pyspark.
So just a few days ago to study a few, and now share the exchange with you.
First I was using the replacement kdd-99-10-precent file, how to replace the file, see here to replace the file
And then the result is about a 70-and-A-M look,
Let's turn on spark first.
Use terminal to enter Spark's home directory
then enter Ipython=1./bin/pyspark
Yes, that's right, I used Ipython, what's Ipython? Dot here Ipython
Then put your replacement file into the new folder 1 in the Spark home directory
go back to Terminal and import the module
fromimport SparkContext, SparkConffromimport KMeansfromimport StandardScalerfromimport arrayfromimport sqrt
Then define four functions
def parse_interaction(line):Line_split = Line.split (",") Clean_line_split = line_split[0:-1]return(line_split[-1], Array ([Float (x) forXinchClean_line_split])) def distance(A, b): returnsqrt (A.zip (b). Map (LambdaX: (x[0]-x[1]). Map (Lambdax:x*x). Reduce (LambdaA,B:A+B)) def dist_to_centroid(datum, clusters):Cluster = Clusters.predict (datum) centroid = Clusters.centers[cluster]returnsqrt (SUM ([x**2 forXinch(centroid-datum)])) def clustering_score(data, K):clusters = Kmeans.train (data, K, maxiterations=Ten, runs=5, initializationmode="Random") result = (k, clusters, Data.map (LambdaDatum:dist_to_centroid (datum, clusters)). mean ())Print "Clustering score for k=% (k) d is% (score) F"% {"K"K"Score": result[2]}returnResult
The first function, Parse_interaction, is to cut down the labels tail of the file.
The second function, distance (a, b), is also known to calculate the spatial distance between two points.
The third function, dist_to_centroid (datum, clusters), is used to calculate the distance from each point to the center of a cluster.
The fourth function Clustering_score (data, k) is the central function in all functions, it trains Kmeans, and then calls the third function calculation, Finally, a result is returned containing the K value of the cluster training and the coordinate of the center point and the average distance from all points to the center point.
All of this work is done around this function.
After the function is defined, start the preliminary data preparation work
Define
max_k = 30data_file = "1/result"
To explain, Max_k is the maximum value of the test K value to be used when selecting a test for k values, and if you have a rich resource, you can also choose 100,150, which is tentatively 30.
Then Data_file is the location of the replacement file. Just said. Create a new folder in the Spark home directory 1, and then put the replaced file in the line
raw_data = sc.textFile(data_file)
Loading data
lineline.strip().split(",")[-1])
This step is to cut the label text out of the data set.
Convenient for later work on the comparison
parsed_data = raw_data.map(parse_interaction)parsed_data_values = parsed_data.values().cache()
This step is to cut the data and cut out the labels items that cannot be computed.
Then read the data into the inside inch
Then the data matrix in the data looks sparse, and now we turn it into a dense matrix
standardizer = StandardScaler(True, True)standardizer_model = standardizer.fit(parsed_data_values)standardized_data_values = standardizer_model.transform(parsed_data_values)
Use the normalization function of spark to add data-intensive sessions
And then get the normalized data standardized_data_values
And then we're going to train the data.
maprange(10, max_k+110))
There are three parameters in the range function in this step, max_k+1,10
Or that sentence, if you calculate resource-rich words, you can change the last 10 to 5 or delete it directly
The meaning of that 10 is the number of steps within the range of 10 to max_k+1
Just one step at a 10-step look.
The first time K takes 10, the second time jumps directly to 20.
And then output a bit.
min_k = min(scores, key=lambda x: x[2])[0]print"Best k value is %(best_k)d" % {"best_k": min_k}
Best_k is the best K value.
Remember that the return value of the function does not
Now we're going to pull out the model that we've trained.
best_model = min(scores, key=lambda x: x[2])[1]
The best module to be extracted here is the model with the best K values trained
The last is the clustering operation Kmeans
cluster_assignments_sample = standardized_data_values.map(lambda datum: str(best_model.predict(datum))+","+",".join(map(str,datum)))
Then save the results of the calculation
cluster_assignments_sample.saveAsTextFile("standardized")labels.saveAsTextFile("labels")
files generated under the standardized directory
The files generated under the labels directory are the same as the names above, except in the same directory.
In the standardized directory, each file corresponds to the file in the labels directory, and each line in the file is one by one corresponding to the one by one.
Here we will change the file in standardized in order to
Well, I know the name sucks, but it's simple ...
Ha ha
And then change the name in the labels directory.
After renaming, drag all files except _successs into a separate file
Ignore that GG Beginning file, that is the output file after the run
You saw the ll.py, too. The name is more casual ...
That's the merge file.
You can open the file to see
Standardized the file in the directory to intercept a row
16,-0.0677916519216,-1.23798593805,1.59687725081,-3.03993409184,-0.00306168273249,-0.0262872999946,-0.00667341089394,-0.0477201370966,-0.00257146549756,-0.0441358669765,-0.00978217473084,-0.417191282147,-0.00567867840782,-0.0105519415435,-0.00467566599809,-0.00564000635973,-0.011232070172,-0.00991896489162,-0.0276317877068,0.0,0.0,-0.0372626246721,-0.404816973483,-1.14040006348,-0.464089282412,-0.463520002343,4.06939148225,4.05898474876,-1.91027154631,0.596281475651,-0.203632862202,0.347966485214,-1.66602170748,-1.71327236727,0.449337984497,-1.25061954309,-0.15862913227,-0.464417155895,-0.463201923033,4.08467150823,4.09571132604
So, the first data 16 is the name of the cluster that predicts and classifies such a row of data
It belongs to the 16th cluster.
and it's going to be tricky.
Labels is saved in another file, and the result of clustering is saved in another file
How to merge
Anyway, I'm here to use the soil method, have the data can be merged and output in spark please contact me
[Email protected]
Let me introduce my way to the Earth.
This github portal.
Soil approach
then the source code with the time to change, change the file name.
And then it's waiting for it to merge the correspondence of each row in each line of data.
There is one more step to extracting the results ...
Put the final result file, mine is the GG series copied to another file
Then execute the following code
GitHub
Get Data results
Finally, you can see the output in the terminal.
And finally, my output is like this.
From the last count.py code in a large number of comments can also be seen, I am the process of data processing still sway around
How exactly to separate and record different types
Finally, it was saved by cluster, and each cluster was saved in a file.
Finally, I used the Pandas value_counts () function to output the result
Then welcome to the students who know how to merge data directly in Spark send me an email
[Email protected]
Forgive me for not taking a good name ...
Changes to the code can be directly on GitHub fork, modify, and then push to me
Cluster analysis experiment of KDD-99 data set based on Pyspark