0. All projects are common:
http://blog.csdn.net/mmc2015/article/details/46851245 (DataSet format and Predictor)
http://blog.csdn.net/mmc2015/article/details/46852755 (load your own raw data)
(Entire corpus loaded for text categorization issues )
http://blog.csdn.net/mmc2015/article/details/46906409 (5. Load the built-in common data)
(Many common data sets are loaded , 5.) Dataset loading utilities)
http://blog.csdn.net/mmc2015/article/details/46705983 (Choosing the Right estimator (your question is suitable for what estimator to model it))
( a picture tells you, your question choose what estimator good , no longer have to try again)
http://blog.csdn.net/mmc2015/article/details/46857949 (training classifier, predictive new data, evaluation classifier)
http://blog.csdn.net/mmc2015/article/details/46858009 (using "Pipeline" unified Vectorizer = Transformer = classifier, Grid Search Assistant)
First, the text classification to use:
http://blog.csdn.net/mmc2015/article/details/46857887 (extracting features from a text file (TF, IDF))
(countvectorizer,tfidftransformer)
http://blog.csdn.net/mmc2015/article/details/46866537 (what Countvectorizer extracted TF did)
( in-depth interpretation of what Countvectorizer has done, directing us to do personalized preprocessing )
http://blog.csdn.net/mmc2015/article/details/46867773 (2.5.2. Implementing LSA via TRUNCATEDSVD (implicit semantic analysis))
(LSA,LDA analysis )
(Non-Scikit-learn) http://blog.csdn.net/mmc2015/article/details/46940373 (textanalytics) (1): Two types of Word relations- -paradigmatic vs. Syntagmatic)
(Non-Scikit-learn) http://blog.csdn.net/mmc2015/article/details/46941367 (textanalytics) (1): Two types of Word relations- -paradigmatic vs. Syntagmatic (continued))
(Lexical granularity Relationship: paradigmatic ( aggregation relationship: The same nature can be substituted for each other, using TFIDF-based similarity mining ) vs. Syntagmatic ( Combinatorial relationship: Co-occurrence, mutual information mining ))
(Non-Scikit-learn) http://blog.csdn.net/mmc2015/article/details/46771791 (Feature selection method (TF-IDF, Chi, and IG))
(Introduction of TF-IDF in feature selection, CHI Square and information gain in feature selection )
Second, data preprocessing used (4. Dataset transformations):
http://blog.csdn.net/mmc2015/article/details/46991465 (4.1. Pipeline and Featureunion:combining estimators (features combined with predictors; characteristics and characteristics))
(Combination of feature and Predictor, feature and feature )
http://blog.csdn.net/mmc2015/article/details/46992105 (4.2. Feature extraction (Feature extraction, not feature selection))
(loading features form dicts, feature hashing,text feature extraction, image feature extraction)
http://blog.csdn.net/mmc2015/article/details/46997379 (4.2.3. Text feature Extraction)
(text feature extraction)
http://blog.csdn.net/mmc2015/article/details/47016313 (4.3. preprocessing data (Standardi/normali/binari. zation, encoding, missing value))
(standardization, or mean removal and variance scaling (normalization: de-mean, except variance),normalization(normalized),Feature binarization(binary),Encoding categorical features(encoding category feature), imputation ofmissing values (attribution missing value))
http://blog.csdn.net/mmc2015/article/details/47066239 (4.4. Unsupervised dimensionality reduction (descending dimension))
(PCA, Random projections, Feature agglomeration (feature aggregation))
http://blog.csdn.net/mmc2015/article/details/47069869 (4.8. Transforming the prediction target (y))
(Label binarization,lable encoding (transform non-numerical labels to numerical labels))
Third, other important points of knowledge:
http://blog.csdn.net/mmc2015/article/details/46867597 (2.5. Matrix factor decomposition problem)
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Scikit-learn: The knowledge points used in the actual project (summary)