求问如何导出20TB的数据

Viewed 12

背景:一张数据表存了一年的数据,按照每天进行分区,现在我需要导出到HDFS上,目前尝试了两种推荐方案

  1. export
  2. SELECT INTO OUTFILE
    两种方案最终结果都是查询过程中资源不足被停止。

集群信息:

  1. 存算一体
  2. Version: selectdb-doris-2.1.9-rc02-78f37e2d0d
  3. 3be[128C128G],3fe
  4. 内存日常使用50%左右
2 Answers

大表按照分区进行导出吧,全表导出失败的风险比较大。

EXPORT TABLE test
PARTITION (p1,p2)
TO "hdfs://HDFS8000871/path/to/export_"
PROPERTIES (
"columns" = "k1,k2"
) with HDFS (
"fs.defaultFS" = "hdfs://HDFS8000871",
"hadoop.username" = "hadoop",
"dfs.nameservices" = "your-nameservices",
"dfs.ha.namenodes.your-nameservices" = "nn1,nn2",
"dfs.namenode.rpc-address.HDFS8000871.nn1" = "ip:port",
"dfs.namenode.rpc-address.HDFS8000871.nn2" = "ip:port",
"dfs.client.failover.proxy.provider.HDFS8000871" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);

分区支持参数吗?分区太多了,我想通过crontab定时导出