Using Scala to experiment with the gradient descent algorithm on spark

Source: Internet
Author: User
Tags hadoop fs

The first reference is this article: http://blog.csdn.net/sadfasdgaaaasdfa/article/details/45970185

But the function is too old. So to change. In addition the starting point is my own article http://www.cnblogs.com/charlesblc/p/6206198.html inside about the gradient descent of that picture.

It takes a lot of time to change to a random vector, and finally it's done. The code is as follows:

 Packagecom.spark.myImportorg.apache.log4j. {level, Logger}ImportOrg.apache.spark. {sparkconf, sparkcontext}ImportBreeze.linalg.DenseVectorImportBreeze.numerics.exp/*** Created by Baidu on 16/11/28. */Object gradientdemo{ Case classDatapoint (x:densevector[double], y:double) //Case class see def parsepoint (x:array[double]): Datapoint = {    //Datapoint (vectors.dense (x.slice (0, x.size-2)), X (x.size-1))Datapoint (Densevector (x.slice (0, x.size-2)), X (x.size-1) } def main (args:array[string]) {Logger.getlogger ("Org.apache.spark"). SetLevel (Level.warn) Val conf=Newsparkconf () Val SC=Newsparkcontext (conf) println ("Begin Load Gradient file")    //Load Data SetVal Text = Sc.textfile ("Hdfs://master. Hadoop:8390/gradient_data/spam.data.txt ") Val Lines=Text.map { line=Line.split (" "). Map (_.todouble)} Val points= Lines.map (Parsepoint (_))//(Parsepoint (_)) looks the samevar w = Densevector.rand (Lines.first (). size-2) Val iterations= 100 for(I <-1to iterations) {val Gradient= Points.map (P = =        (1/(1 + exp (-P.Y * (w dot p.x)))-1) * P.Y *p.x). Reduce (_+_) W-=gradient} println ("Finish data loading, W num:" + w.length + "; W: "+W)}}

Then on the M42n05 machine, the first use is to Http://www-stat.Stanford.edu/~tibs/elemstatlearn/datasets/spam.data This file is copied to Hadoop:

$hadoop fs-mkdir/-put spam.data.txt/gradient_data/-ls/gradient_data/1 Items- rw-r--r--   3 work supergroup     698341 2016-12-21 17:59/gradient_data/spam.data.txt

Then copy the jar package and run the command:

$./bin/spark-submit--classCom.spark.my.GradientDemo--master Spark://10.117.146.12:7077 Myjars/scala-demo.jarGet output:16/12/21 18:17:57 WARN util. Nativecodeloader:unable to loadnative-hadoop Library forYour platform ... using builtin-Java classes where applicable16/12/21 18:17:58INFO util.log:Logging initialized @1689ms16/12/21 18:17:58 INFO Server. server:jetty-9.2.z-SNAPSHOT16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/jobs,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/jobs/json,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/jobs/job,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/jobs/job/json,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/stages,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/stages/json,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/stages/stage,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/stages/stage/json,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/stages/pool,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/stages/pool/json,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/storage,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/storage/json,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/storage/rdd,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/storage/rdd/json,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/environment,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/environment/json,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email protected]{/executors,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/executors/json,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/executors/threaddump,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/executors/threaddump/json,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email protected]{/Static,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email protected]{/,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/api,NULL, AVAILABLE}16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/stages/stage/kill,NULL, AVAILABLE}16/12/21 18:17:58 INFO Server. serverconnector:started [Email protected]{http/1.1}{0.0.0.0:4040}16/12/21 18:17:58INFO Server. server:started @1811ms16/12/21 18:17:58 INFO handler. contexthandler:started [Email Protected]{/metrics/json,NULL, AVAILABLE} Begin Load Gradient file 16/12/21 18:18:00 INFO mapred. Fileinputformat:total input paths to PROCESS:116/12/21 18:18:02WARN netlib. blas:failed to load implementation From:com.github.fommil.netlib.NativeSystemBLAS16/12/21 18:18:02WARN netlib. blas:failed to load implementation From:com.github.fommil.netlib.NativeRefBLASFinish data loading, W Num: W:densevector;(0.5742670447735152, 0.3793477463119241, 0.9681722093411653, 0.5967720119758925, 1.513648869152009, 0.8246263930800145, 0.8513296345703405, 0.5016541916805365, 0.10371045067354999, 1.0622529560536655, 0.7333760424194737, 2.1149483032187897, 0.9299367625800867, 0.7255747859512406, 0.13008556580706143, 1.4831202765138185, 0.7729907277492736, 0.9723309264036033, 13.394753146641808, 0.5531526429090097, 2.7444722115693665, 0.11325813324181622, 0.5096129116641023, 0.7201439311127137, 0.44719912156747926, 0.8273500952621051, 0.6736417633922696, 0.046531684571481415, 0.017895929000231802, 0.4726397794671698, 0.394438566392741, 0.8438784726078483, 0.4144073806784945, 0.18873920886297268, 0.4760240368798872, 0.31604719205329873, 0.694745503752298, 0.721380820951884, 0.988535475648986, 0.13515871744899247, 0.15694652560543523, 0.6939378895510522, 0.9279201378471407, 0.3336083293555714, 0.38938263676999685, 0.17159756568171308, 0.18897754115255144, 0.7281027812135723, 0.7233165381530381, 1.1093715737790655, 0.15675561193336351, 2.059622965151493, 0.6839713282339183, 0.11528695729374866, 7.413534050555067, 23.13404922028611)16/12/21 18:18:07 INFO Server. serverconnector:stopped [Email protected]{http/1.1}{0.0.0.0:4040}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/stages/stage/kill,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/api,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email protected]{/,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email protected]{/Static,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/executors/threaddump/json,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/executors/threaddump,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/executors/json,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email protected]{/executors,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/environment/json,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/environment,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/storage/rdd/json,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/storage/rdd,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/storage/json,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/storage,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/stages/pool/json,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/stages/pool,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/stages/stage/json,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/stages/stage,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/stages/json,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/stages,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/jobs/job/json,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/jobs/job,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/jobs/json,NULL, unavailable}16/12/21 18:18:07 INFO handler. contexthandler:stopped [Email Protected]{/jobs,NULL, unavailable}

You can see that the data is processed properly.

In the iteration loop of the code, add this sentence and look at the process:

println ("In Data loading, W num:" + w.length + "; W: "+ w")

Then re-copy the jar package and run it. The discovery adds a lot of intermediate data, but each change is small, and some are just the last number changes:

In data loading, w num:56; W:densevector (0.8387794911469437, 0.041931950643148204, 0.610593576873822, 0.775693127624059, 0.9595814255406686, 0.8346753461732199, 1.3049939469403333, 0.7056665962054256, 0.4607139317388798, 0.7272237992038442, 0.658182563650663, 0.733627042229442, 0.49543528179048996, 0.43928474305383947, 0.7784540121519834, 3.3618947233533456, 0.8863247999385253, 0.4007587753541083, 2.0631977325748334, 0.8211289850510815, 1.2076387347473903, 0.43209585536401196, 0.8361371667999544, 0.3902040623717107, 0.9249800607229486, 0.9684655358995048, 0.7122113545634148, 0.7564214721597596, 0.9295754044438086, 0.0667831407627083, 0.8262226990678785, 0.9866253536733688, 0.7214690647928418, 0.5992067836236182, 0.801215365214358, 1.0206941788488395, 0.8887684894893382, 0.39696145592511084, 0.7994301499483707, 0.39766237687949973, 0.3213782652296576, 0.3959330364022269, 0.6573698429264838, 0.5725594506918451, 0.932872703406284, 0.4276515117478306, 0.8908902872993782, 0.6281143587881469, 0.5136752276267151, 1.0933173640821512, 0.10820509511118362, 1.9426418431339785, 0.2017114624971559, 0.9827542778431644, 5.224634203803431, 16.694903977208174) in data loading, W Num:  56; W:densevector (0.8387794911469437, 0.041931950643148204, 0.6105935768739001, 0.775693127624059, 0.9595814255414439, 0.8346753461732199, 1.3049939469403333, 0.7056665962054256, 0.4607139317388798, 0.7272237992038442, 0.658182563650663, 0.733627042229442, 0.49543528179048996, 0.43928474305383947, 0.7784540121519834, 3.3618947233534118, 0.8863247999385373, 0.4007587753541083, 2.0631977325749897, 0.8211289850510815, 1.2076387347474142, 0.43209585536401196, 0.8361371667999544, 0.3902040623717107, 0.9249800607229486, 0.9684655358995048, 0.7122113545634148, 0.7564214721597596, 0.9295754044438086, 0.0667831407627083, 0.8262226990678785, 0.9866253536733688, 0.7214690647928418, 0.5992067836236182, 0.801215365214358, 1.0206941788488395, 0.8887684894893382, 0.39696145592511084, 0.7994301499483707, 0.3976623768795117, 0.3213782652296576, 0.3959330364022269, 0.6573698429264838, 0.5725594506918451, 0.932872703406296, 0.4276515117478306, 0.8908902872993782, 0.6281143587881469, 0.5136752276267151,1.093317364082217, 0.10820509511118362, 1.942641843152015, 0.2017114624971559, 0.982754277843168, 5.22463420411604, 16.694903977520784)

Gradient Descent principle

The gradient descent principle is relatively good, can be seen here:

http://blog.csdn.net/woxincd/article/details/7040944

And here's the article:

Http://www.cnblogs.com/maybe2030/p/5089753.html?utm_source=tuicool&utm_medium=referral

Look carefully and find the formula above, and the code inside the formula does not seem to be the same. The sigmoid function should be used in the code.

You need to take a good look at it.

The formula used in the above code is mainly:

(1/(1 + exp (-P.Y * (w dot p.x)))-1) * P.Y * p.x)
Above p.x is an n-dimensional vector,p.y is a numeric value.

then reduce (_+_) said to add up all the lines. The last is an n-dimensional vector.

W-= Gradient

Then iterate n times to get a new W.

Case class

The difference between case class and class can be seen: Http://www.tuicool.com/articles/yEZr6ve

There is a case class in Scala, which is actually a normal class. But it is slightly different from the normal class, as follows:

1, the initialization time can not be new, of course you can add, ordinary class must add new;2, tostring implementation more beautiful; 3, the default implementation of equals and hashcode;4, the default is can be serialized, that is, the realization of serializable;

5, automatically from Scala. Some functions are inherited from product;

6, Case class constructor parameters are public level, we can directly access;

7, support pattern matching.

Breeze

In addition, the above densevector is actually used breeze inside the class

Linearregressionwithsgd

In addition, this is the linear regression implemented within spark, which is based on a random gradient descent. Similar functions also include the following:

The linear regression algorithms available in Mllib are: the main classes involved in Linearregressionwithsgd,ridgeregressionwithsgd,lassowithsgd;mllib regression analysis, Generalizedlinearalgorithm,gradientdescent.

Scala with Java

The last one used is densevector, so there is no use of the following paragraph. But the following paragraph shows that Scala can be used in Java:

Importnew Random (53)

Using Scala to experiment with the gradient descent algorithm on spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.