Many examples on the web, including the official website example, are using Textfile to load a file to create an RDD, similar to Sc.textfile ("Hdfs://n1:8020/user/hdfs/input")
The Textfile parameter is a path, which can be:
1. A file path where only the specified file is loaded
2. A directory path in which only all files under the specified directory ( excluding files under subdirectories ) are loaded
3. Load multiple files in the form of wildcards or load all files under multiple directories
3rd is a use of small tricks, now suppose my data structure for the first partition by day, and then by the hour partition, the directory structure on HDFs similar to:
/user/hdfs/input/dt=20130728/hr=00/
/user/hdfs/input/dt=20130728/hr=01/
...
/user/hdfs/input/dt=20130728/hr=23/
The specific data are in the HR equal to a certain time under the directory, now we want to analyze 20130728 of this day of data, we have to the directory below all hr=* subdirectories below the data loaded into the RDD, so we can write: Sc.textfile ("hdfs:// n1:8020/user/hdfs/input/dt=20130728/hr=*/"), notice the hr=*, is a fuzzy matching way.
Reprint: Tips for textfile use of sparkcontext instances in spark
Tips for textfile use of sparkcontext instances in spark