8-28 decided to take part in the thousands data Processing task, since the scene was almost the same as one of the regression predictions that it had done, and the first day began to prepare for small-scale data.
# # Second Big modified version
# # # Date 20160829
Raw data processing, get the user fan relationship, the amount of Weibo forwarded at each time period, the overall depth of Weibo forwarding
The next stage goal, build the model, realize the prediction based on time series
# # Third Big modified version
# # # Date 20160830
Transfer these operations to the Linux platform because some iterations completely make my computer's memory unbearable
The main purpose of this version is to calculate the change in the time series of the depth of a microblog.
# # Fourth Major modified version
# # # Date 20160831
The test that extracts the depth of the sequence and the number of forwards that have changed over time from the raw data is done
This modification two tasks: first, the function is integrated into two parts respectively; second, replace the sampled data with the original test data to run through the basic data processing
The main purpose of the next release is to build a model of the data prediction through these known relationships, train with training data, test with test data, and then modify the parameters to get the best model
# # Fifth Major modified version
# # # Date 20160901
The serious problem this morning is that there is not enough memory, because I have cached the rdd of the computational process, especially the initial data, which is so large that it is not enough.
The change caches only important results, such as the Rdd of the time series, the number of forwards, and the forwarding depth, so that the program can be executed almost completely.
Just the second version of the depth of calculation or some problems, need to be used later in the time to further modify, especially for a specific time period, who is forwarding, the number of people forwarded the biggest fan.
The main problem with this version is to save the results of the calculation to a file, so that the regression model calls the data processed in the file for training and prediction.
First plan to achieve a certain time period of the forecast, the other overall forecast is to do later.
# # Sixth Major modified version
# # # Date 20160901
The biggest receipt this afternoon was to see the dawn.
But success is a distance from what I thought before.
This version will complete the calculation of all the required data, saved to the file, I hope to complete today
Cond
Prediction of the number and propagation depth of microblog propagation--based on Pyspark and some regression algorithm