Instead of loading common common data sets, we discuss loading your own raw data (that is, the data you actually encounter)
Http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html#sklearn.datasets.load_files
For information on how to load commonly used built-in data, refer to: http://blog.csdn.net/mmc2015/article/details/46906409
Sklearn.datasets load_files (Container_path, Description=none, Categories=none, Load_content=true, shuffle=True, enc Oding=none, decode_error= ' strict ', random_state=0) [source]
loads the dataset stored in this file , where the file storage paths for different categories are:
Container_folder/category_1_folder/file_1.txt file_2.txt ... file_42.txt category_2_folder/file_43.txt file_44.txt ...
Among them, sub-folders (category_1_folder) are distinguished by the name of the label as supervised learning . As for the specific naming of each file, casually.
The above parameters only explain container_path,load_content=true,encoding=none:
Container_path: Path to "Container_folder".
Load_content=true: If you really load the contents of the file into memory, select True.
Encoding=none:string or None (default is None), whether to decode the contents of the file (mainly for pictures, video or other binary files, not text content); If not None, then in load_content= True, the contents of the file are decoded. Note that the current text file is generally encoded as "utf-8". If you do not specify the encoding (encoding=none), then the contents of the file will be processed according to bytes instead of Unicode, so that many functions in the module "Sklearn.feature_extraction.tex" are not available.
return value: Data:bunch
Dictionary-like object. We are interested in:
Data: Raw, format reference the following figure.
Filenames: Name of each file
Target: Category tag (integer index starting at 0)
Target_names: The specific meaning of the category label (number) ( Category_1_folder decision by the name of the child folder)
Example:
From Sklearn import Datasets RawData = Datasets.load_files ("Data_folder") print rawdata X = Rawdata.data Print X[0] #first file content y = rawdata.target print y RawData = datasets.load_files ("Data_folder", encoding = "Utf-8") print rawdata {' target_names ': [' category_1_folder ', ' category_2_folder ', ' category_3_folder '], ' data ': [' 6 "Start", \r\ni don\ ' t like. ', ' 3 start, \r\nwe. ', ' 2 start, \r\nniubi. ', ' 4 start, \r\npretty good. ', ' 1 start , \r\nhaha. ', ' 5 start, \r\nnot so good. '], ' target ': Array ([2, 1, 0, 1, 0, 2]), ' DESCR ': None, ' filenames ': Array ([' Data_ Folder\\category_3_folder\\6.txt ', ' data_folder\\category_2_folder\\3.txt ', ' data_folder\\category_1_folder\ \2.txt ', ' data_folder\\category_2_folder\\4.txt ', ' data_folder\\category_1_folder\\1.txt ', ' data_fold Er\\category_3_folder\\5.txt '], dtype= ' |
S35 ')} 6 "Start", I don ' t like.
[2 1 0 1 0 2] {' Target_names ': [' category_1_folder ', ' Category_2_fOlder ', ' category_3_folder '], ' data ': [u ' 6 ' start ', \r\ni don\ ' t like. ', U ' 3 start, \r\nwe like this. ', U ' 2 start, \r\nniu Bi. ', U ' 4 start, \r\npretty good. ', U ' 1 start, \r\nhaha. ', U ' 5 start, \r\nnot so good. '], ' target ': Array ([2, 1, 0, 1, 0, 2]), ' DESCR ': None, ' filenames ': Array ([' Data_folder\\category_3_folder\\6.txt ', ' data_folder\\category_2_folder\\3 . txt ', ' data_folder\\category_1_folder\\2.txt ', ' data_folder\\category_2_folder\\4.txt ', ' Data_folder \\category_1_folder\\1.txt ', ' data_folder\\category_3_folder\\5.txt '], dtype= ' | S35 ')}