Cluster analysis experiment of KDD-99 data set based on Pyspark

Source: Internet
Author: User
Tags pyspark

Mandarin jargon do not want to speak, introduction also don't want to fight, all know Pyspark and KDD-99 is what?
Do not know the words ... Point here 1
or here, 2.

reprint remember to indicate the source
http://blog.csdn.net/isinstance/article/details/51329766

Pyspark itself is written in Scala, and the Scala language is the state of Java's metamorphosis, although Spark also supports Python, but it's not as good as Scala's support, and there are few books for Pyspark.

So just a few days ago to study a few, and now share the exchange with you.
First I was using the replacement kdd-99-10-precent file, how to replace the file, see here to replace the file
And then the result is about a 70-and-A-M look,
Let's turn on spark first.

Use terminal to enter Spark's home directory
then enter Ipython=1./bin/pyspark
Yes, that's right, I used Ipython, what's Ipython? Dot here Ipython

Then put your replacement file into the new folder 1 in the Spark home directory

go back to Terminal and import the module

fromimport SparkContext, SparkConffromimport KMeansfromimport StandardScalerfromimport arrayfromimport sqrt

Then define four functions

 def parse_interaction(line):Line_split = Line.split (",") Clean_line_split = line_split[0:-1]return(line_split[-1], Array ([Float (x) forXinchClean_line_split])) def distance(A, b):    returnsqrt (A.zip (b). Map (LambdaX: (x[0]-x[1]). Map (Lambdax:x*x). Reduce (LambdaA,B:A+B)) def dist_to_centroid(datum, clusters):Cluster = Clusters.predict (datum) centroid = Clusters.centers[cluster]returnsqrt (SUM ([x**2  forXinch(centroid-datum)])) def clustering_score(data, K):clusters = Kmeans.train (data, K, maxiterations=Ten, runs=5, initializationmode="Random") result = (k, clusters, Data.map (LambdaDatum:dist_to_centroid (datum, clusters)). mean ())Print "Clustering score for k=% (k) d is% (score) F"% {"K"K"Score": result[2]}returnResult

The first function, Parse_interaction, is to cut down the labels tail of the file.
The second function, distance (a, b), is also known to calculate the spatial distance between two points.
The third function, dist_to_centroid (datum, clusters), is used to calculate the distance from each point to the center of a cluster.
The fourth function Clustering_score (data, k) is the central function in all functions, it trains Kmeans, and then calls the third function calculation, Finally, a result is returned containing the K value of the cluster training and the coordinate of the center point and the average distance from all points to the center point.
All of this work is done around this function.

After the function is defined, start the preliminary data preparation work
Define

max_k = 30data_file = "1/result"

To explain, Max_k is the maximum value of the test K value to be used when selecting a test for k values, and if you have a rich resource, you can also choose 100,150, which is tentatively 30.
Then Data_file is the location of the replacement file. Just said. Create a new folder in the Spark home directory 1, and then put the replaced file in the line

raw_data = sc.textFile(data_file)

Loading data

lineline.strip().split(",")[-1])

This step is to cut the label text out of the data set.
Convenient for later work on the comparison

parsed_data = raw_data.map(parse_interaction)parsed_data_values = parsed_data.values().cache()

This step is to cut the data and cut out the labels items that cannot be computed.
Then read the data into the inside inch

Then the data matrix in the data looks sparse, and now we turn it into a dense matrix

standardizer = StandardScaler(True, True)standardizer_model = standardizer.fit(parsed_data_values)standardized_data_values = standardizer_model.transform(parsed_data_values)

Use the normalization function of spark to add data-intensive sessions
And then get the normalized data standardized_data_values

And then we're going to train the data.

maprange(10, max_k+110))

There are three parameters in the range function in this step, max_k+1,10
Or that sentence, if you calculate resource-rich words, you can change the last 10 to 5 or delete it directly
The meaning of that 10 is the number of steps within the range of 10 to max_k+1
Just one step at a 10-step look.
The first time K takes 10, the second time jumps directly to 20.

And then output a bit.

min_k = min(scores, key=lambda x: x[2])[0]print"Best k value is %(best_k)d" % {"best_k": min_k}

Best_k is the best K value.

Remember that the return value of the function does not
Now we're going to pull out the model that we've trained.

best_model = min(scores, key=lambda x: x[2])[1]

The best module to be extracted here is the model with the best K values trained

The last is the clustering operation Kmeans

cluster_assignments_sample = standardized_data_values.map(lambda datum: str(best_model.predict(datum))+","+",".join(map(str,datum)))

Then save the results of the calculation

cluster_assignments_sample.saveAsTextFile("standardized")labels.saveAsTextFile("labels")

files generated under the standardized directory

The files generated under the labels directory are the same as the names above, except in the same directory.

In the standardized directory, each file corresponds to the file in the labels directory, and each line in the file is one by one corresponding to the one by one.
Here we will change the file in standardized in order to

Well, I know the name sucks, but it's simple ...
Ha ha
And then change the name in the labels directory.

After renaming, drag all files except _successs into a separate file

Ignore that GG Beginning file, that is the output file after the run
You saw the ll.py, too. The name is more casual ...
That's the merge file.

You can open the file to see
Standardized the file in the directory to intercept a row

16,-0.0677916519216,-1.23798593805,1.59687725081,-3.03993409184,-0.00306168273249,-0.0262872999946,-0.00667341089394,-0.0477201370966,-0.00257146549756,-0.0441358669765,-0.00978217473084,-0.417191282147,-0.00567867840782,-0.0105519415435,-0.00467566599809,-0.00564000635973,-0.011232070172,-0.00991896489162,-0.0276317877068,0.0,0.0,-0.0372626246721,-0.404816973483,-1.14040006348,-0.464089282412,-0.463520002343,4.06939148225,4.05898474876,-1.91027154631,0.596281475651,-0.203632862202,0.347966485214,-1.66602170748,-1.71327236727,0.449337984497,-1.25061954309,-0.15862913227,-0.464417155895,-0.463201923033,4.08467150823,4.09571132604

So, the first data 16 is the name of the cluster that predicts and classifies such a row of data
It belongs to the 16th cluster.
and it's going to be tricky.
Labels is saved in another file, and the result of clustering is saved in another file
How to merge
Anyway, I'm here to use the soil method, have the data can be merged and output in spark please contact me
[Email protected]
Let me introduce my way to the Earth.
This github portal.
Soil approach

then the source code with the time to change, change the file name.
And then it's waiting for it to merge the correspondence of each row in each line of data.
There is one more step to extracting the results ...
Put the final result file, mine is the GG series copied to another file
Then execute the following code

GitHub
Get Data results
Finally, you can see the output in the terminal.

And finally, my output is like this.

From the last count.py code in a large number of comments can also be seen, I am the process of data processing still sway around
How exactly to separate and record different types
Finally, it was saved by cluster, and each cluster was saved in a file.
Finally, I used the Pandas value_counts () function to output the result

Then welcome to the students who know how to merge data directly in Spark send me an email
[Email protected]
Forgive me for not taking a good name ...
Changes to the code can be directly on GitHub fork, modify, and then push to me

Cluster analysis experiment of KDD-99 data set based on Pyspark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.