doris：JSON导入数据

本文介绍如何在 Doris 中导入 JSON 格式的数据文件。Doris 支持导入标准 JSON 格式数据，通过配置相关参数，可以灵活地处理不同的 JSON 数据结构，并支持从 JSON 数据中抽取字段、处理嵌套结构等场景。

导入方式

以下导入方式支持 JSON 格式的数据导入：

Stream Load
Broker Load
Routine Load
INSERT INTO FROM S3 TVF
INSERT INTO FROM HDFS TVF

json-格式">支持的 JSON 格式

Doris 支持以下三种 JSON 格式：

以 Array 表示的多行数据

适用于批量导入多行数据，要求：

根节点必须是数组
数组中每个元素是一个对象，表示一行数据
必须设置 strip_outer_array=true

示例数据：

[
    {"id": 123, "city": "beijing"},
    {"id": 456, "city": "shanghai"}
]

// 支持嵌套结构
[
    {"id": 123, "city": {"name": "beijing", "region": "haidian"}},
    {"id": 456, "city": {"name": "beijing", "region": "chaoyang"}}
]

以 Object 表示的单行数据

适用于单行数据导入，要求：

根节点必须是对象
整个对象表示一行数据

示例数据：

{"id": 123, "city": "beijing"}

// 支持嵌套结构
{"id": 123, "city": {"name": "beijing", "region": "haidian"}}

注意

通常用于 Routine Load 导入方式，如 Kafka 中的单条消息。

以固定分隔符分隔的多行 Object 数据

适用于批量导入多行数据，要求：

每行是一个完整的 JSON 对象
必须设置 read_json_by_line=true
可通过 line_delimiter 参数指定行分隔符，默认为 \n

示例数据：

{"id": 123, "city": "beijing"}
{"id": 456, "city": "shanghai"}

参数配置

参数支持情况

下表列出了各种导入方式支持的 JSON 格式参数：

参数	默认值	Stream Load	Broker Load	Routine Load	TVF
json paths	无	jsonpaths	properties.jsonpaths	properties.jsonpaths	jsonpaths
json root	无	json_root	properties.json_root	properties.json_root	json_root
strip outer array	false	strip_outer_array	properties.strip_outer_array	properties.strip_outer_array	strip_outer_array
read json by line	false	read_json_by_line	不支持配置，都为true	不支持	read_json_by_line, 默认为true
fuzzy parse	false	fuzzy_parse	properties.fuzzy_parse	不支持	fuzzy_parse
num as string	false	num_as_string	properties.num_as_string	properties.num_as_string	num_as_string

注意

Stream Load：参数直接通过 HTTP Header 指定，如：-H "jsonpaths: $.data"
Broker Load：参数通过 PROPERTIES 指定，如：PROPERTIES("jsonpaths"="$.data")
Routine Load：参数通过 PROPERTIES 指定，如：PROPERTIES("jsonpaths"="$.data")
TVF：参数通过 TVF 语句指定，如：S3("jsonpaths"="$.data")

参数说明

json-path">JSON Path

作用：指定如何从 JSON 数据中抽取字段
类型：字符串数组
默认值：无，默认使用列名匹配

使用示例：

-- 基本用法
["$.id", "$.city"]

-- 嵌套结构
["$.id", "$.info.city", "$.data[0].name"]

json-root">JSON Root

作用：指定 JSON 数据的解析起点
类型：字符串
默认值：无，默认从根节点开始解析

使用示例：

-- 原始数据
{
  "data": {
    "id": 123,
    "city": "beijing"
  }
}

-- 设置 json_root
json_root = $.data

Strip Outer Array

作用：指定是否去除最外层的数组结构
类型：布尔值
默认值：false

使用示例：

-- 原始数据
[
  {"id": 1, "city": "beijing"},
  {"id": 2, "city": "shanghai"}
]

-- 设置 strip_outer_array=true

json-by-line">Read JSON By Line

作用：指定是否按行读取 JSON 数据
类型：布尔值
默认值：false

使用示例：

-- 原始数据（每行一个完整的 JSON 对象）
{"id": 1, "city": "beijing"}
{"id": 2, "city": "shanghai"}

-- 设置 read_json_by_line=true

Fuzzy Parse

作用：加速 JSON 数据的导入效率
类型：布尔值
默认值：false
限制：
- Array 中每行数据的字段顺序必须完全一致
- 通常与 strip_outer_array 配合使用
性能：可提升 3-5 倍导入效率

Num As String

作用：指定是否将 JSON 中的数值类型以字符串形式解析
类型：布尔值
默认值：false
使用场景：
- 处理超出数值范围的大数
- 避免数值精度损失

使用示例：

-- 原始数据
{
  "id": "12345678901234567890",
  "price": "99999999.999999"
}
-- 设置 num_as_string=true， price 字段将以字符串形式解析

json-path-和-columns-的关系">JSON Path 和 Columns 的关系

在数据导入过程中，JSON Path 和 Columns 各自承担不同的职责：

JSON Path：定义数据抽取规则

从 JSON 数据中按指定路径抽取字段
抽取的字段按 JSON Path 中定义的顺序进行重排列

Columns：定义数据映射规则

将抽取的字段映射到目标表的列
可以进行列的重排和转换

这两个参数的处理过程是串行的：首先 JSON Path 从源数据中抽取字段并形成有序的数据集，然后 Columns 将这些数据映射到表的列中。如果不指定 Columns，抽取的字段将按照表的列顺序直接映射。

使用示例

json-path">仅使用 JSON Path

表结构和数据：

-- 表结构
CREATE TABLE example_table (
    k2 int,
    k1 int
);

-- JSON 数据
{"k1": 1, "k2": 2}

导入命令：

curl -v ... -H "format: json" \
    -H "jsonpaths: [\"$.k2\", \"$.k1\"]" \
    -T example.json \
    http://<fe_host>:<fe_http_port>/api/db_name/table_name/_stream_load

导入结果：

+------+------+
| k1   | k2   |
+------+------+
|    2 |    1 | 
+------+------+

json-path--columns">使用 JSON Path + Columns

使用相同的表结构和数据，添加 columns 参数：

导入命令：

curl -v ... -H "format: json" \
    -H "jsonpaths: [\"$.k2\", \"$.k1\"]" \
    -H "columns: k2, k1" \
    -T example.json \
    http://<fe_host>:<fe_http_port>/api/db_name/table_name/_stream_load

导入结果：

+------+------+
| k1   | k2   |
+------+------+
|    1 |    2 | 
+------+------+

字段重复使用

表结构和数据：

-- 表结构
CREATE TABLE example_table (
    k2 int,
    k1 int,
    k1_copy int
);

-- JSON 数据
{"k1": 1, "k2": 2}

导入命令：

curl -v ... -H "format: json" \
    -H "jsonpaths: [\"$.k2\", \"$.k1\", \"$.k1\"]" \
    -H "columns: k2, k1, k1_copy" \
    -T example.json \
    http://<fe_host>:<fe_http_port>/api/db_name/table_name/_stream_load

导入结果：

+------+------+---------+
| k2   | k1   | k1_copy |
+------+------+---------+
|    2 |    1 |       1 |
+------+------+---------+

嵌套字段映射

表结构和数据：

-- 表结构
CREATE TABLE example_table (
    k2 int,
    k1 int,
    k1_nested1 int,
    k1_nested2 int
);

-- JSON 数据
{
    "k1": 1,
    "k2": 2,
    "k3": {
        "k1": 31,
        "k1_nested": {
            "k1": 32
        }
    }
}

导入命令：

curl -v ... -H "format: json" \
    -H "jsonpaths: [\"$.k2\", \"$.k1\", \"$.k3.k1\", \"$.k3.k1_nested.k1\"]" \
    -H "columns: k2, k1, k1_nested1, k1_nested2" \
    -T example.json \
    http://<fe_host>:<fe_http_port>/api/db_name/table_name/_stream_load

导入结果：

+------+------+------------+------------+
| k2   | k1   | k1_nested1 | k1_nested2 |
+------+------+------------+------------+
|    2 |    1 |         31 |         32 |
+------+------+------------+------------+

使用示例

本节展示了不同导入方式下的 JSON 格式使用方法。

Stream Load 导入

# 使用 JSON Path
curl --location-trusted -u <user>:<passwd> \
    -H "format: json" \
    -H "jsonpaths: [\"$.id\", \"$.city\"]" \
    -T example.json \
    http://<fe_host>:<fe_http_port>/api/example_db/example_table/_stream_load

# 指定 JSON root
curl --location-trusted -u <user>:<passwd> \
    -H "format: json" \
    -H "json_root: $.events" \
    -T example.json \
    http://<fe_host>:<fe_http_port>/api/example_db/example_table/_stream_load

# 按行读取 JSON
curl --location-trusted -u <user>:<passwd> \
    -H "format: json" \
    -H "read_json_by_line: true" \
    -T example.json \
    http://<fe_host>:<fe_http_port>/api/example_db/example_table/_stream_load

Broker Load 导入

-- 使用 JSON Path
LOAD LABEL example_db.example_label
(
    DATA INFILE("s3://bucket/path/example.json")
    INTO TABLE example_table
    FORMAT AS "json"
    PROPERTIES
    (
        "jsonpaths" = "[\"$.id\", \"$.city\"]"
    )
)
WITH S3 
(
    ...
);

-- 指定 JSON root
LOAD LABEL example_db.example_label
(
    DATA INFILE("s3://bucket/path/example.json")
    INTO TABLE example_table
    FORMAT AS "json"
    PROPERTIES
    (
        "json_root" = "$.events"
    )
)
WITH S3 
(
    ...
);

Routine Load 导入

-- 使用 JSON Path
CREATE ROUTINE LOAD example_db.example_job ON example_table
PROPERTIES
(
    "format" = "json",
    "jsonpaths" = "[\"$.id\", \"$.city\"]"
)
FROM KAFKA
(
    ...
);

TVF 导入

-- 使用 JSON Path
INSERT INTO example_table
SELECT *
FROM S3
(
    "uri" = "s3://bucket/example.json",
    "format" = "json",
    "jsonpaths" = "[\"$.id\", \"$.city\"]",
    ...
);

-- 指定 JSON root 
INSERT INTO example_table
SELECT *
FROM S3
(
    "uri" = "s3://bucket/example.json",
    "format" = "json",
    "json_root" = "$.events",
    ...
);