Summary of network programming courses and summary of programming courses
I. Objectives of the Course Project
Build a complete web system to identify the age, gender, and data of each test in the picture of the uploaded blood routine test report, it can also use machine learning algorithms such as neural networks to predict age and gender based on various data of blood routine tests.
Ii. Learning Experience Summary
This network program design course material by Mr Meng has not followed some of the inherent molding more hands-on projects, but has chosen the current hot trend of machine learning in an innovative way, the starting direction is quite new. What is even more rare is that, in the face of increasingly important medical industries, it is expected that the current auxiliary diagnostic system should be more intelligent and opportunities for improvement. This project is not only an attempt, it is also very practical.
As a project, I have experienced a project process from scratch in this course, and have fully realized this open resource and collaborative interaction model, there are also clear documentation, rigorous code review, and other importance. The project Improvement promoted by this mode is far less simple than imagined, but practical and efficient.
I have learned a lot about this course from my one-sided understanding of machine learning and clustering algorithms, and now I know many machine learning libraries, in particular, the mlib library in Spark is carefully studied and used, and common algorithms are also learned: decision tree, Bayesian, random forest, svm, etc. I learned about some visual libraries through ocr recognition. In short, I learned the rough structure of a project through the image, prediction, and web modules, and I am no longer confused about machine learning, I am also interested in other libraries such as opencv.
The learning process A1 Neural Network implements a Handwritten Character Recognition System. This demo mainly has a concrete understanding of deep learning. This process mainly involves configuring the environment. Then I began to learn some basic python syntaxes; image OCR in A2 blood routine test report, this step is too large for me, only the latest version of fetch has learned from the code of experienced students about the idea of feature extraction, recognition and preprocessing in Image Recognition Based on the characteristics and geometry of the image; a3 predict age and gender based on various data of the blood routine test, because we encourage everyone to try different learning libraries to see the advantages and disadvantages of prediction accuracy, I have invested a lot of effort in learning the Spark mlib library and tried different algorithms provided by Spark, such as Bayesian, decision tree, binary classification, and random deep forest, although there are no more than 10 lines of core code for each algorithm, some parameters and their differences can be better understood only after a hands-on attempt. I think there is only hope for pr for this part, but it is possible that the encapsulation provided by the Spark platform is indeed simple, and I am not familiar with the operations related to the project hosting platform, soon, some experts uploaded various spark algorithms and analyzed the results in detail. They once again felt a huge gap with others. The entire project is based on python. As a little white who has just started to get familiar with python, it has not made any contribution. However, from the very beginning, I have no idea where to start, I have learned a lot from the code of various great gods, and I am interested in being the best teacher. I hope to gradually train myself into an excellent code farmer from the beginning.
Iii. Random forest algorithm under Spark
Random forest algorithms are widely used in machine learning, computer vision, and other fields. They can be used for classification and regression prediction, A random forest machine consists of multiple decision trees. Compared with a single decision tree algorithm, the random forest machine performs better classification and prediction, and is not prone to over-fitting.
A forest composed of multiple decision trees. The classification results are obtained by voting on these decision trees. The decision tree adds random processes in the row direction and column direction during the generation process, when constructing a decision tree in the row direction, bootstraping is used to obtain the training data, and no sampling is used to obtain the feature subset in the column direction. Then, the optimal splitting point is obtained, this is the basic principle of the random forest algorithm, such:
In short, the random forest algorithm can avoid the over-fitting problem of decision trees, because random forest uses several decision trees to vote to determine the final prediction or classification results. Meanwhile, to solve the efficiency problem in the random forest distribution environment, the random forest is optimized in the distributed environment. The random forest Algorithm in Spark mainly implements three optimization strategies: 1. Segmentation point sampling statistics 2. Feature packing 3. layer-by-layer training ).
The core code for calling the random forest algorithm interface in Spark is as follows:
1 from _ future _ import print_function 2 import json 3 import sys 4 import math 5 from pyspark import SparkContext 6 from pyspark. mllib. classification import NaiveBayes, NaiveBayesModel 7 from pyspark. mllib. util import MLUtils 8 # The key class in the random forest is org. apache. spark. mllib. tree. randomForest, org. apache. spark. mllib. tree. model. randomForestModel provides the trainClassifier and predict functions of random forest. 9 10 class BloodTestReportbyNB: 11 12 def _ init _ (self, SC): 13 self. SC = SC 14 # Read Data 15 print ('in in Load Data File! ') 16 self. sexData = MLUtils. loadLabeledPoints (self. SC, "LabeledPointsdata_sex.txt") 17 self. ageData = MLUtils. loadLabeledPoints (self. SC, "LabeledPointsdata_age.txt") 18 print ('data File has been Loaded! ') 19 self. predict_gender = "" 20 self. predict_age = "" 21 22 def predict (self): 23 sexTraining = self. sexData24 ageTraining = self. ageData25 # Train a random forest classifier. trainClassifier returns the RandomForestModel object 26 # Number of classifications; categoricalFeaturesInfo is empty, meaning that all features are continuous variables; number of trees; feature subset sampling policy, auto indicates that the algorithm is automatically selected; purity is calculated; the maximum hierarchy of the tree; and the maximum number of feature packing. Specifically, I set tree = 3, maximum depth = 4, maximum number of leaves = 32, purity calculation method: Gini coefficient; gender classification = 2, age classification = 1000 (the value depends on the purity calculation method ). 27 sexModel = RandomForest. trainClassifier (sexTraining, numClasses = 2, keys = {}, 28 numTrees = 3, featureSubsetStrategy = "auto", 29 impurity = 'gini ', maxDepth = 4, maxBins = 32) 30 ageModel = RandomForest. trainClassifier (ageTraining, numClasses = 1000, sums = {}, 31 numTrees = 3, featureSubsetStrategy = "auto", 32 impurity = 'gini ', maxDepth = 4, maxBins = 32) 33 # Read prediction data 34 sexPredict = MLUtils. loadLabeledPoints (self. SC, "LabeledPointsdata_predict_gender.txt") 35 agePredict = MLUtils. loadLabeledPoints (self. SC, "LabeledPointsdata_predict_age.txt") 36 # forecast data 37 predict_genderCollect = sexPredict. map (lambda p: p.label.zip (sexModel. predict (sexTest. map (lambda x: x. features) # import the rdd dataset to form a key-Value Pair 38 predict_ageCollect = agePredict. map (lambda p: p.label).zip (ageModel. predict (ageTest. map (lambda x: x. features) 39 self. predict_gender = predict_genderCollect [0] 40 self. predict_age = predict_ageCollect [0] 41 if 0 = self. predict_gender: 42 self. predict_gender = "male" 43 if 1 = self. predict_gender: 44 self. predict_gender = "female" 45 print ('predict Gender: ', self. predict_gender) 46 print ('predict Age: ', self. predict_age) 47 predict_data = {48 "age": self. predict_age, 49 "gender": self. predict_gender50} 51 json_data = json. dumps (predict_data, ensure_ascii = False) 52 print ('json data of predict_data: ', json_data) 53 return json_data54 if _ name _ = "_ main __": 55 SC = SparkContext (appName = "BloodTestReportPythonNaiveBayesExample") 56 BloodTestReportbyNB = BloodTestReportbyNB (SC) 57 BloodTestReportbyNB. predict ()
The key class in the random forest is org. apache. spark. mllib. tree. randomForest, org. apache. spark. mllib. tree. model. randomForestModel provides the trainClassifier and predict functions of random forest.
In addition, the parameters of the random forest model function represent the number of classes. The value of categoricalFeaturesInfo is null, meaning that all features are continuous variables, the number of trees, and the feature subset sampling policy. auto indicates that the algorithm is automatically selected; purity calculation; maximum hierarchy of trees; Maximum number of feature packing. Specifically, I set tree = 3, maximum depth = 4, maximum number of leaves = 32, purity calculation method: Gini coefficient; gender classification = 2, age classification = 1000 (the value depends on the purity calculation method ).
The accuracy of the random forest algorithm for gender prediction can reach 71%.
In addition, to generate labeled workers that can be recognized by Spark. The implementation code is as follows:
1 # converting the blood routine data in JSON format to LabeledPoint data that can be recognized by Saprk 2 def predict_dataFormat (report_data ', 'W') 5 outputline1 = str (report_data ["profile"] ["age"]) + ", "6 if report_data [" profile "] [" gender "] =" Man ": 7 outputline2 =" 0, "8 if report_data [" profile "] [" gender "] =" Woman ": 9 outputline2 =" 1, "10 for item in report_data [" bloodtest "]: 11 if item ["alias"] = "GRA" or item ["alias"] = "EO" or item ["alias"] = "GRA %" or item ["alias"] = "EO % ": 12 continue13 else: 14 outputline1 + = item ["value"] + "" 15 outputline2 + = item ["value"] + "" 16 output1.write (outputline1) 17 output2.write (outputline2)
1 # generate LabeledPoints data in the following format: 2 # label, factor1 factor2 .... factorn 3 # The first column is the category tag, which is separated by spaces as the feature (factor) 4 import csv 5 reader = csv. reader (file ('. /data_set.csv ', 'rb') 6 output1 = open('LabeledPointsdata_age.txt', 'w') 7 output2 = open('LabeledPointsdata_sex.txt ', 'w ') 8 9 flag = 010 row = 011 12 for line in reader: 13 row = row + 114 if 1 = row: 15 continue16 17 column = 018 for c in line: 19 column = colum N + 120 if 1 = column: 21 continue22 if 2 = column: 23 if "male" = c: 24 outputline2 = "0," 25 else: 26 outputline2 = "1," 27 continue28 if 3 = column: 29 outputline1 = c + "," 30 else: 31 if "--. -- "= c: 32 flag = 133 break34 else: 35 outputline1 + = (c +" ") 36 outputline2 + = (c +" ") 37 if 0 = flag: 38 outputline1 + = '\ n' 39 outputline2 + =' \ n' 40 else: 41 flag = 042 continue43 print (column) 44 output1.write (o Utputline1) 45 output2.write (outputline2) 46 output1.close () 47 output2.close () 48 print ('format Successful! ')
Iv. Project Demo
1. Environment Configuration:
# Install numpysudo apt-get install python-numpy # emerge apt-get install python-opencv # emerge apt-get install tesseract-ocrsudo pip install pytesseractsudo apt-get install python-tksudo pip install pillow # install the Flask framework sudo pip install Flask # install MongoDBsudo apt-get install mongodb # If no module name mongodb is prompted, run sudo apt-get updatesudo service mongodb startedsudo pip install pymongo # install JDK, Scala, and Spark (omitted ), configure the environment variables related to JDK and Spark (sudo gedit/etc/profile, and add the following configuration at the end of the file) export JAVA_HOME =/opt/jdk1.8.0 _ 45 export JRE_HOME =$ {JAVA_HOME}/jreexport CLASSPATH =.: $ {JAVA_HOME}/lib :$ {JRE_HOME}/libexport PATH =$ {JAVA_HOME}/bin :$ {JRE_HOME}/bin: $ PATHexport SCALA_HOME =/opt/scala-2.11.6export PATH =$ {SCALA_HOME}/bin: $ PATHexport SPARK_HOME =/opt/spark-hadoop/export PYTHONPATH =/opt/spark-hadoop/python
2. Running effect:
Python dataformat. pypython view. py browser access http: // 0.0.0.0: 8080/
Main Interface:
Upload images:
Generate a report (note: you can modify and adjust the data here ):
Prediction:
Appendix: Version library address
Student ID: SA16225281
Name: Wang Chenhang