1.spark Read the compressed file of HDFs GZ
spark1.5 later versions support direct reading of files in GZ format, no difference from reading other plain text files.
Start the spark shell interface and read the GZ file in the same way as a plain text file:
Sc.textfile ("/your/path/*.gz"). map{...}
The above code will take care of the need to read GZ compressed files. 2.spark Read parquet format file
Spark naturally supports files in parquet format.
Also enter the interactive interface of the spark shell and do the following:
Val parquetfile = Sqlcontext.parquetfile ("/your/path/*.parquet")
Print the schema of the parquet file:
Parquetfile.printschema ()
To view specific content:
Parquetfile.take (2). foreach (println)
You can view the details in the file. 3. Using Parquet-tools
Https://github.com/apache/parquet-mr/tree/master/parquet-tools
Download the appropriate jar package first.
Then perform the following locally:
Alias parquetview= ' Hadoop--cluster c3prc-hadoop Jar/path/to/your/downloaded/parquet-tools-1.8.1.jar '
Next use Meta view Schema,head to view the data
Parquetview meta/hdfs/path/single/file/faster
Parquetview head/hdfs/path/single/file/faster