Http://product.dangdang.com/23829918.html
Spark has attracted wide attention as the emerging, most widely used open source framework for big data processing, attracting a lot of programming and developers to learn and develop relevant content, Mllib is the core of the spark framework. This book is a detailed introduction to the Spark mllib program design book, the introduction of simple, rich examples.
This book is divided into 12 chapters, starting with the installation and configuration of the Spark Foundation, introducing MLLIB Program Design Foundation, mllib data Object construction, Mllib in the use of RDD, various classification, clustering, regression and other data processing methods, finally through a complete example, review the previous learning content, And through the code to achieve a complete analysis process. The book's theoretical content is shallow and deep, the combination of examples and theory, the content of comprehensive and detailed, detailed and intuitive, suitable for Spark mllib beginners, Big data analysis and excavation personnel, but also suitable for universities and training and learning related professional teachers and students teaching reference.
Directory
1th Chapter Spark
1.1 Big Data Age
1.2 Big Data Analysis era
1.3 Simple, elegant, effective-this is spark
1.4 Core--mllib
1.5 A single spark can be a prairie fire
1.6 Summary
2nd. Spark installation and development environment configuration
2.1 Windows Standalone mode spark installation and configuration
2.1.1 Windows 7 Installation Java
2.1.2 Windows 7 installs Scala
2.1.3 Intellij ide Download and install
2.1.4 Intellij The installation of the Scala plugin in the IDE
2.1.5 Spark stand-alone installation
2.2 Classic WordCount
2.2.1 Spark Implementation WordCount
2.2.2 MapReduce Implementation WordCount
2.3 Summary
The 3rd Chapter Rdd detailed
What is the 3.1 rdd?
3.1.1 The secret of the RDD name
3.1.2 Rdd Features
The difference between 3.1.3 and other distributed shared memory
3.1.4 Rdd Defect
3.2 Rdd Working principle
3.2.1 Rdd Working principle
3.2.2 Rdd interdependencies
3.3 Rdd Application API detailed
3.3.1 method setting for a given dataset using the aggregate method
3.3.2 Pre-Computed cache method
Cartesian method of 3.3.3 Cartesian operation
Coalesce method for 3.3.4 Shard storage
3.3.5 Countbyvalue method based on value
3.3.6 Countbykey method based on key
3.3.7 out the distinct method for repeating items in a data set
Filter method for 3.3.8 filtering data
3.3.9 Flatmap method for manipulating data in behavioral units
3.3.10 Map method that operates on a single data target
GroupBy method of 3.3.11 packet data
3.3.12 Keyby method for generating key-value pairs
3.3.13 reduce method for simultaneous processing of two data
3.3.14 SortBy method for reordering data
3.3.15 Zip method for merging compression
3.4 Summary
4th Chapter Mllib Basic Concepts
4.1 Mllib Basic data types
4.1.1 Multiple data types
4.1.2 starting from a local vector set
4.1.3 Use of vector tags
4.1.4 Use of local matrices
4.1.5 the use of distributed matrices
Basic concepts of mathematical statistics of 4.2 mllib
4.2.1 Basic Statistics
4.2.2 Statistics Basic Data
4.2.3 Distance Calculation
4.2.42 sets of data correlation coefficient calculation
4.2.5 Stratified Sampling
4.2.6 hypothesis Test
4.2.7 Random Number
4.3 Summary
The 5th Chapter Collaborative filtering algorithm
5.1 What is collaborative filtering
5.1.1 What is collaborative filtering
5.1.2 What is a user-based recommendation
5.1.3 What is an item-based recommendation
The deficiency of 5.1.4 collaborative filtering algorithm
5.2 Similarity measurement
5.2.1 similarity calculation based on Euclidean distance
5.2.2 calculation of similarity based on cosine angle
Comparison of similarity degree and cosine similarity between Euclidean 5.2.3
5.2.4 first example--cosine similarity combat
Alternating least squares (ALS algorithm) in 5.3 mllib
5.3.1 least squares (LS algorithm) detailed
An explanation of alternating least squares (ALS algorithm) in 5.3.2 Mllib
5.3.3 ALS Algorithm Combat
5.4 Summary
6th Chapter Mllib Linear regression theory and actual combat
6.1 Random gradient descent algorithm detailed
The story of the 6.1.1 Taoist Mountain
Theoretical basis of 6.1.2 stochastic gradient descent algorithm
6.1.3 Random gradient descent algorithm combat
Overfitting of 6.2 mllib regression
Causes of 6.2.1 Overfitting
6.2.2 Lasso regression and ridge regression
6.3 Mllib Linear regression combat
6.3.1 mllib linear regression basic preparation
6.3.2 Mllib Linear regression combat: the relationship between commodity price and consumer income
6.3.3 Verification of fitting curve
6.4 Summary
The 7th Chapter Mllib classification actual combat
7.1 Logistic Regression explanation
7.1.1 Logistic regression is not a regression algorithm
The mathematical basis of 7.1.2 logistic regression
7.1.31-Dollar Logistic regression example
7.1.4 multi-Element Logistic regression example
7.1.5 Mllib Logistic regression verification
7.1.6 Mllib Logistic Regression example-metastasis judgment of gastric cancer
7.2 Support Vector Machine detailed
7.2.13 Corners or Round
Mathematical basis of 7.2.2 support vector machine
7.2.3 Support Vector Machine use example
7.2.4 Analysis of gastric cancer metastasis using support vector machines
7.3 Naive Bayesian detailed
7.3.1 Boys or girls wearing pants
The mathematical basis and significance of Bayesian theorem of 7.3.2
7.3.3 Naive Bayes theorem
7.3.4 Mllib naive Bayesian use example
7.3.5 Mllib Naïve Bayesian real combat--identification of "zombie powder"
7.4 Summary
8th Chapter Decision tree and preserving order regression
8.1 Detailed decision Tree
The Secret of 8.1.1 crystal ball
The algorithm basis of 8.1.2 decision tree-Information entropy
Algorithm base--ID3 algorithm for 8.1.3 decision tree
Construction of decision tree in 8.1.4 Mllib
8.1.5 Mllib Example of a decision tree
8.1.6 stochastic rainforest and gradient lifting algorithm (GBT)
8.2 Order-Preserving regression detailed
8.2.1 What is Order-preserving regression
8.2.2 Order-Preserving Regression example
8.3 Summary
9th Chapter Mllib in the cluster
9.1 Clustering and classification
9.1.1 What is a category
9.1.2 What is a cluster
The Kmeans algorithm in 9.2 mllib
9.2.1 What is the Kmeans algorithm
Example of Kmeans algorithm in 9.2.2 mllib
Discussion on the details of 9.2.3 Kmeans algorithm
9.3 Gaussian mixed cluster
9.3.1 starting from Gaussian distribution cluster
9.3.2 Mixed Gaussian cluster
9.3.3 Mllib Gaussian mixed model use example
9.4 Fast Iterative Clustering
9.4.1 Fast iterative Clustering Theory Foundation
9.4.2 Fast Iterative Clustering Example
9.5 Summary
Chapter 10th Association Rules in Mllib
10.1 Apriori Frequent itemsets algorithm
10.1.1 Beer with Diapers
10.1.2 Classic Apriori algorithm
10.1.3 Apriori Algorithm Example
10.2 Fp-growth algorithm
Limitations of the 10.2.1 Apriori algorithm
10.2.2 Fp-growth algorithm
10.2.3 FP Tree Example
10.3 Summary
11th Chapter Data dimensionality reduction
11.1 Singular value decomposition (SVD)
11.1.1 line Matrix (Rowmatrix) detailed
The basis of 11.1.2 singular value decomposition algorithm
Example of singular value decomposition in 11.1.3 mllib
11.2 principal component Analysis (PCA)
11.2.1 Definition of principal component analysis (PCA)
Mathematical basis of principal component Analysis (PCA) of 11.2.2
Example of principal component Analysis (PCA) in 11.2.3 mllib
11.3 Summary
The 12th Chapter feature extraction and transformation
12.1 TF-IDF
12.1.1 How to find the news I want
Mathematical calculation of 12.1.2 TF-IDF algorithm
12.1.3 Mllib in TF-IDF example
12.2 Word vectorization Tool
The basis of 12.2.1 word vectorization
12.2.2 using the word vectorization example
12.3 Feature selection based on Chi-square test
12.3.1 's "foodie" distress
Example of feature selection based on Chi-square test in 12.3.2 mllib
12.4 Summary
The 13th Chapter Mllib actual Combat drills-iris analysis
13.1 Modeling Instructions
Description and analysis target of 13.1.1 data
13.1.2 Modeling Instructions
13.2 Data preprocessing and analysis
Microscopic analysis of 13.2.1--a comparative analysis of mean value and variance
13.2.2 Macroscopic analysis--calculation of the length of different kinds of properties
13.2.3 removing duplicates--Determination of correlation coefficients
13.3 relationship between length and width--regression analysis of data sets
13.3.1 using linear regression to analyze the relationship between length and width
13.3.1 using logistic regression to analyze the relationship between length and width
13.4 Working with the IRIS data set using classification and clustering
13.4.1 Clustering data sets using cluster analysis
13.4.2 data sets with categorical analysis
13.5 Final Judgment-decision tree test
13.5.1 determining the collation of a data set--decision tree
13.5.2 a distributed method for determining data set classification--random rainforest
13.6 Summary
Introduction and catalogue of the Spark mllib machine learning Practice