java 讀寫Parquet格式的資料 Parquet example

來源:互聯網
上載者:User

標籤:[]   ati   void   pread   file   end   array   data   static   

import java.io.BufferedReader;import java.io.File;import java.io.FileReader;import java.io.IOException;import java.util.Random;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.log4j.Logger;import org.apache.parquet.example.data.Group;import org.apache.parquet.example.data.GroupFactory;import org.apache.parquet.example.data.simple.SimpleGroupFactory;import org.apache.parquet.hadoop.ParquetReader;import org.apache.parquet.hadoop.ParquetReader.Builder;import org.apache.parquet.hadoop.ParquetWriter;import org.apache.parquet.hadoop.example.GroupReadSupport;import org.apache.parquet.hadoop.example.GroupWriteSupport;import org.apache.parquet.schema.MessageType;import org.apache.parquet.schema.MessageTypeParser;public class ReadParquet {    static Logger logger=Logger.getLogger(ReadParquet.class);    public static void main(String[] args) throws Exception {        //        parquetWriter("test\\parquet-out2","input.txt");        parquetReaderV2("test\\parquet-out2");    }            static void parquetReaderV2(String inPath) throws Exception{        GroupReadSupport readSupport = new GroupReadSupport();        Builder<Group> reader= ParquetReader.builder(readSupport, new Path(inPath));        ParquetReader<Group> build=reader.build();        Group line=null;        while((line=build.read())!=null){            System.out.println(line.toString());        }        System.out.println("讀取結束");            }     //新版本中new ParquetReader()所有構造方法好像都棄用了,用上面的builder去構造對象    static void parquetReader(String inPath) throws Exception{        GroupReadSupport readSupport = new GroupReadSupport();        ParquetReader<Group> reader = new ParquetReader<Group>(new Path(inPath),readSupport);        Group line=null;        while((line=reader.read())!=null){            System.out.println(line.toString());        }        System.out.println("讀取結束");            }    /**     *      * @param outPath  輸出Parquet格式     * @param inPath  輸入普通文字檔     * @throws IOException     */    static void parquetWriter(String outPath,String inPath) throws IOException{        MessageType schema = MessageTypeParser.parseMessageType("message Pair {\n" +                " required binary city (UTF8);\n" +                " required binary ip (UTF8);\n" +                " repeated group time {\n"+                  " required int32 ttl;\n"+                  " required binary ttl2;\n"+                "}\n"+              "}");        GroupFactory factory = new SimpleGroupFactory(schema);        Path path = new Path(outPath);       Configuration configuration = new Configuration();       GroupWriteSupport writeSupport = new GroupWriteSupport();       writeSupport.setSchema(schema,configuration);       ParquetWriter<Group> writer = new ParquetWriter<Group>(path,configuration,writeSupport);
    //把本地檔案讀取進去,用來產生parquet格式檔案 BufferedReader br =new BufferedReader(new FileReader(new File(inPath))); String line=""; Random r=new Random(); while((line=br.readLine())!=null){ String[] strs=line.split("\\s+"); if(strs.length==2) { Group group = factory.newGroup() .append("city",strs[0]) .append("ip",strs[1]); Group tmpG =group.addGroup("time"); tmpG.append("ttl", r.nextInt(9)+1); tmpG.append("ttl2", r.nextInt(9)+"_a"); writer.write(group); } } System.out.println("write end"); writer.close(); }}
說下schema(寫Parquet格式資料需要schema,讀取的話"自動識別"了schema)
/* * 每一個欄位有三個屬性:重複數、資料類型和欄位名,重複數可以是以下三種: *         required(出現1次) *         repeated(出現0次或多次) *         optional(出現0次或1次) * 每一個欄位的資料類型可以分成兩種: *         group(複雜類型) *         primitive(基本類型)
* 資料類型有
* INT64, INT32, BOOLEAN, BINARY, FLOAT, DOUBLE, INT96, FIXED_LEN_BYTE_ARRAY
*/

maven依賴(我用的1.7)
<dependency>    <groupId>org.apache.parquet</groupId>    <artifactId>parquet-hadoop</artifactId>    <version>1.7.0</version></dependency>

 

java 讀寫Parquet格式的資料 Parquet example

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.