Lakehouse 字符串处理指南

概述

字符串处理是日志分析、用户行为分析和数据清洗中最常见的需求之一。云器 Lakehouse 提供完整的字符串函数支持，包括截取与拼接、正则提取与替换、模糊匹配以及字符串聚合。本指南按业务场景分类，帮助你快速掌握高效的字符串处理方法。

涉及的 SQL 命令

命令/函数	用途	适用场景
`SUBSTR()` SUBSTR() / `SUBSTRING()` SUBSTRING()	按位置截取子字符串	提取域名、截断固定长度字段
`CONCAT()` CONCAT()	拼接多个字符串	构造复合键、拼接字段
`CONCAT_WS()` CONCAT_WS()	用指定分隔符拼接多个字符串	将多列拼成一个分隔符串
`LOCATE()` LOCATE() / `INSTR()` INSTR()	查找字符或子串的位置	辅助截取，定位分隔符位置
`REPLACE()` REPLACE()	字面量字符串替换	去除固定前缀/后缀
`REGEXP_EXTRACT()` REGEXP_EXTRACT()	用正则表达式提取匹配组	从 URL/日志中提取字段
`REGEXP_REPLACE()` REGEXP_REPLACE()	用正则表达式替换匹配内容	清洗特殊字符、脱敏
`SPLIT()` SPLIT()	按分隔符切割字符串为数组	将逗号分隔标签转为数组
`LATERAL VIEW EXPLODE()` LATERAL VIEW EXPLODE()	将数组列展开为多行	标签展开、多值字段打平
`LIKE` LIKE	通配符模式匹配	关键词过滤
`RLIKE` RLIKE	正则表达式模式匹配	复杂模式过滤
`GROUP_CONCAT()` GROUP_CONCAT()	聚合多行字符串（支持 ORDER BY）	将访问 ID 按顺序合并
`WM_CONCAT()` WM_CONCAT()	聚合多行字符串	简洁的多行合并
`UPPER()` UPPER() / `LOWER()` LOWER()	大小写转换	字段标准化、统一格式
`TRIM()` TRIM() / `LTRIM()` LTRIM() / `RTRIM()` RTRIM()	去除首尾空白	清洗用户输入
`LENGTH()` LENGTH()	返回字符串字符数	长度校验、过滤空值

前置准备

以下示例使用模拟的访问日志表

doc_str_logs

doc_str_logs

：

-- 创建测试表 CREATE TABLE IF NOT EXISTS doc_str_logs ( log_id INT, user_id STRING, url STRING, tags STRING, content STRING ); -- 插入测试数据 INSERT INTO doc_str_logs VALUES (1, 'u001', 'https://www.example.com/product/detail?id=123', 'tech,mobile,review', 'Hello World! This is a test.'), (2, 'u002', 'https://shop.example.com/category/shoes?color=red', 'fashion,shoes,sale', ' spaces around '), (3, 'u003', 'https://www.example.com/blog/post?tag=python', 'tech,python,tutorial', 'Python is great!'), (4, 'u001', 'https://api.example.com/v2/users?page=1', 'api,backend,rest', 'API response: {"status":"ok"}'), (5, 'u004', 'https://www.example.com/search?q=lakehouse', 'data,lakehouse,cloud', 'Lakehouse 数据平台'), (6, 'u002', 'https://cdn.example.com/images/logo.png', 'image,cdn,static', 'CDN资源文件'), (7, 'u005', 'https://www.example.com/product/detail?id=456', 'tech,mobile,new', '手机号：138-0013-8000，请联系'), (8, 'u003', 'https://shop.example.com/checkout?order=789', 'order,payment,checkout','Order #789 confirmed!');

场景一：截取与拼接

从 URL 中提取域名，或将多个字段拼接为复合键，是日志处理中最常见的字符串操作。

用 SUBSTR + LOCATE 提取域名

SUBSTR(str, pos, len)

SUBSTR(str, pos, len)

从指定位置截取指定长度的子串。

LOCATE(substr, str, start)

LOCATE(substr, str, start)

返回子串从

start

start

位置起第一次出现的位置。组合使用可以精确截取 URL 的域名部分。

https://

https://

固定占 8 个字符，因此域名从第 9 个字符开始，到第一个斜杠结束：

SELECT log_id, url, SUBSTR(url, 9, LOCATE('/', url, 9) - 9) AS domain FROM doc_str_logs ORDER BY log_id;

执行结果：

log_id	url	domain
1	https://www.example.com/product/detail?id=123	www.example.com
2	https://shop.example.com/category/shoes?color=red	shop.example.com
3	https://www.example.com/blog/post?tag=python	www.example.com
4	https://api.example.com/v2/users?page=1	api.example.com
5	https://www.example.com/search?q=lakehouse	www.example.com
6	https://cdn.example.com/images/logo.png	cdn.example.com
7	https://www.example.com/product/detail?id=456	www.example.com
8	https://shop.example.com/checkout?order=789	shop.example.com

用 CONCAT 构造复合键

CONCAT(str1, str2, ...)

CONCAT(str1, str2, ...)

将多个字符串拼接为一个。常用于构造唯一标识或标注行来源：

SELECT log_id, CONCAT(user_id, '@', REGEXP_EXTRACT(url, 'https?://([/]+)', 1)) AS user_domain_key FROM doc_str_logs ORDER BY log_id;

执行结果：

log_id	user_domain_key
1	u001@www.example.com
2	u002@shop.example.com
3	u003@www.example.com
4	u001@api.example.com
5	u004@www.example.com
6	u002@cdn.example.com
7	u005@www.example.com
8	u003@shop.example.com

用 CONCAT_WS 拼接多个字段

CONCAT_WS(sep, str1, str2, ...)

CONCAT_WS(sep, str1, str2, ...)

用统一分隔符拼接多列，比多个

CONCAT

CONCAT

更简洁，且会自动跳过

NULL

NULL

值：

SELECT log_id, url, CONCAT_WS('|', REGEXP_EXTRACT(url, 'https?://([/]+)', 1), REGEXP_EXTRACT(url, 'https?://[/?]+(/[?]*)', 1), REGEXP_EXTRACT(url, '[?](.*)', 1) ) AS url_parts FROM doc_str_logs ORDER BY log_id LIMIT 5;

执行结果（前 5 行）：

log_id	url	url_parts
1	https://www.example.com/product/detail?id=123	www.example.com\\|/product/detail\\|id=123
2	https://shop.example.com/category/shoes?color=red	shop.example.com\|/category/shoes\|color=red
3	https://www.example.com/blog/post?tag=python	www.example.com\\|/blog/post\\|tag=python
4	https://api.example.com/v2/users?page=1	api.example.com\|/v2/users\|page=1
5	https://www.example.com/search?q=lakehouse	www.example.com\\|/search\\|q=lakehouse

场景二：正则提取

当 URL 或日志格式较复杂时，

REGEXP_EXTRACT

REGEXP_EXTRACT

比

SUBSTR + LOCATE

SUBSTR + LOCATE

的组合更直观，也更易维护。

REGEXP_EXTRACT(str, pattern, group_index)

REGEXP_EXTRACT(str, pattern, group_index)

返回第

group_index

group_index

个捕获组匹配的内容（从 1 开始）；

group_index

group_index

为 0 表示返回整个匹配串。

将 URL 拆分为域名、路径和查询参数

SELECT log_id, REGEXP_EXTRACT(url, 'https?://([/?]+)', 1) AS domain, REGEXP_EXTRACT(url, 'https?://[/?]+(/[?]*)', 1) AS path, REGEXP_EXTRACT(url, '[?](.*)', 1) AS query_string FROM doc_str_logs ORDER BY log_id;

执行结果：

log_id	domain	path	query_string
1	www.example.com	/product/detail	id=123
2	shop.example.com	/category/shoes	color=red
3	www.example.com	/blog/post	tag=python
4	api.example.com	/v2/users	page=1
5	www.example.com	/search	q=lakehouse
6	cdn.example.com	/images/logo.png	（空）
7	www.example.com	/product/detail	id=456
8	shop.example.com	/checkout	order=789

说明：

[?](.*)

[?](.*)

中将

写在字符类

[?]

[?]

内，避免正则引擎将其识别为量词。无查询参数的 URL（如 log_id=6）返回空字符串。

提取 URL 中的第一个参数键与值

SELECT log_id, REGEXP_EXTRACT(url, '[?&]([=]+)=([&]+)', 1) AS param_key, REGEXP_EXTRACT(url, '[?&][=]+=([&]+)', 1) AS param_value FROM doc_str_logs WHERE url LIKE '%?%' ORDER BY log_id;

执行结果：

log_id	param_key	param_value
1	id	123
2	color	red
3	tag	python
4	page	1
5	q	lakehouse
7	id	456
8	order	789

场景三：正则替换与清洗

REGEXP_REPLACE(str, pattern, replacement)

REGEXP_REPLACE(str, pattern, replacement)

将字符串中所有匹配

pattern

pattern

的部分替换为

replacement

replacement

，常用于去噪和脱敏。

去除特殊字符

保留汉字、英文字母、数字和空格，去除标点等其他字符：

SELECT log_id, content, REGEXP_REPLACE(content, '[a-zA-Z0-9一-龥 ]', '') AS cleaned FROM doc_str_logs ORDER BY log_id;

执行结果：

log_id	content	cleaned
1	Hello World! This is a test.	Hello World This is a test
2	spaces around	spaces around
3	Python is great!	Python is great
4	API response: {"status":"ok"}	API response statusok
5	Lakehouse 数据平台	Lakehouse 数据平台
6	CDN资源文件	CDN资源文件
7	手机号：138-0013-8000，请联系	手机号13800138000请联系
8	Order #789 confirmed!	Order 789 confirmed

说明：

一-龥

一-龥

是 Unicode 中文基本汉字范围，用于在正则中保留中文字符。

脱敏手机号

将内容中的手机号（

NNN-NNNN-NNNN

NNN-NNNN-NNNN

格式）替换为脱敏占位符：

SELECT log_id, content, REGEXP_REPLACE(content, '[0-9]{3}-[0-9]{4}-[0-9]{4}', 'PHONE_HIDDEN') AS masked FROM doc_str_logs ORDER BY log_id;

执行结果：

log_id	content	masked
1	Hello World! This is a test.	Hello World! This is a test.
2	spaces around	spaces around
3	Python is great!	Python is great!
4	API response: {"status":"ok"}	API response: {"status":"ok"}
5	Lakehouse 数据平台	Lakehouse 数据平台
6	CDN资源文件	CDN资源文件
7	手机号：138-0013-8000，请联系	手机号：PHONE_HIDDEN，请联系
8	Order #789 confirmed!	Order #789 confirmed!

说明：

REGEXP_REPLACE

REGEXP_REPLACE

会替换字符串中所有匹配位置。如果内容中包含多个手机号，每一个都会被替换。

用 REPLACE 去除固定前缀

对于简单的字面量替换，

REPLACE(str, search, replacement)

REPLACE(str, search, replacement)

比正则更高效：

SELECT log_id, REPLACE(url, 'https://', '') AS url_no_scheme FROM doc_str_logs ORDER BY log_id LIMIT 4;

执行结果（前 4 行）：

log_id	url_no_scheme
1	www.example.com/product/detail?id=123
2	shop.example.com/category/shoes?color=red
3	www.example.com/blog/post?tag=python
4	api.example.com/v2/users?page=1

场景四：字符串分割转行

当一列中存储了多个值（如逗号分隔的标签），需要先用

SPLIT

SPLIT

切割为数组，再用

LATERAL VIEW EXPLODE

LATERAL VIEW EXPLODE

将数组展开为多行，才能按标签分组统计。

展开标签列

SELECT log_id, tags, tag FROM doc_str_logs LATERAL VIEW EXPLODE(SPLIT(tags, ',')) t AS tag ORDER BY log_id, tag;

执行结果（前 9 行）：

log_id	tags	tag
1	tech,mobile,review	mobile
1	tech,mobile,review	review
1	tech,mobile,review	tech
2	fashion,shoes,sale	fashion
2	fashion,shoes,sale	sale
2	fashion,shoes,sale	shoes
3	tech,python,tutorial	python
3	tech,python,tutorial	tech
3	tech,python,tutorial	tutorial

统计标签出现频次

展开后即可对标签做聚合分析：

SELECT tag, COUNT(*) AS log_count FROM doc_str_logs LATERAL VIEW EXPLODE(SPLIT(tags, ',')) t AS tag GROUP BY tag ORDER BY log_count DESC, tag;

执行结果：

tag	log_count
tech	3
mobile	2
api	1
backend	1
cdn	1
checkout	1
cloud	1
data	1
fashion	1
image	1
lakehouse	1
new	1
order	1
payment	1
python	1
rest	1
review	1
sale	1
shoes	1
static	1
tutorial	1

说明：

SPLIT(tags, ',')

SPLIT(tags, ',')

返回字符串数组，

LATERAL VIEW EXPLODE(...)

LATERAL VIEW EXPLODE(...)

将数组中每个元素展开为独立一行，并通过

AS tag

AS tag

给展开列命名。

如果只需要将字符串转为数组并统计元素个数，不需要展开为行，可以使用

SIZE(SPLIT(...))

SIZE(SPLIT(...))

：

SELECT log_id, tags, SPLIT(tags, ',') AS tag_array, SIZE(SPLIT(tags, ',')) AS tag_count FROM doc_str_logs ORDER BY log_id LIMIT 4;

执行结果（前 4 行）：

log_id	tags	tag_array	tag_count
1	tech,mobile,review	tech:mobile:review	3
2	fashion,shoes,sale	fashion:shoes:sale	3
3	tech,python,tutorial	tech:python:tutorial	3
4	api,backend,rest	api:backend:rest	3

说明：

SPLIT(tags, ',')

SPLIT(tags, ',')

返回数组类型，在终端显示时为冒号分隔字符串（如

tech:mobile:review

tech:mobile:review

）。如需 JSON 数组格式，可使用

TO_JSON(SPLIT(tags, ','))

TO_JSON(SPLIT(tags, ','))

。

场景五：模糊匹配与过滤

LIKE：通配符匹配

LIKE

使用

（任意多个字符）和

（单个字符）作为通配符。适合路径前缀、关键词包含等简单模式：

-- 筛选 URL 路径包含 /product/ 的记录 SELECT log_id, url FROM doc_str_logs WHERE url LIKE '%/product/%' ORDER BY log_id;

执行结果：

log_id	url
1	https://www.example.com/product/detail?id=123
7	https://www.example.com/product/detail?id=456

RLIKE：正则表达式匹配

RLIKE

RLIKE

（等价于

REGEXP

REGEXP

）支持完整的正则语法，适合复杂模式：

-- 筛选来自 www 或 shop 子域名的访问 SELECT log_id, url FROM doc_str_logs WHERE url RLIKE 'https://(www|shop)\.example\.com' ORDER BY log_id;

执行结果：

log_id	url
1	https://www.example.com/product/detail?id=123
2	https://shop.example.com/category/shoes?color=red
3	https://www.example.com/blog/post?tag=python
5	https://www.example.com/search?q=lakehouse
7	https://www.example.com/product/detail?id=456
8	https://shop.example.com/checkout?order=789

-- 筛选 content 中包含手机号格式的记录 SELECT log_id, content FROM doc_str_logs WHERE content RLIKE '[0-9]{3}-[0-9]{4}-[0-9]{4}' ORDER BY log_id;

执行结果：

log_id	content
7	手机号：138-0013-8000，请联系

场景六：字符串聚合

将多行字符串合并为一行，常用于生成报表摘要或构造分析维度。

GROUP_CONCAT：按序聚合（支持 ORDER BY）

GROUP_CONCAT(expr ORDER BY col SEPARATOR sep)

GROUP_CONCAT(expr ORDER BY col SEPARATOR sep)

在合并时支持排序和自定义分隔符：

-- 按用户聚合访问记录 ID，按 log_id 升序排列 SELECT user_id, GROUP_CONCAT(log_id ORDER BY log_id SEPARATOR ',') AS log_ids, COUNT(*) AS visit_count FROM doc_str_logs GROUP BY user_id ORDER BY user_id;

执行结果：

user_id	log_ids	visit_count
u001	1,4	2
u002	2,6	2
u003	3,8	2
u004	5	1
u005	7	1

-- 按用户聚合访问的 tags，用分号分隔 SELECT user_id, GROUP_CONCAT(tags ORDER BY log_id SEPARATOR '; ') AS all_tags FROM doc_str_logs GROUP BY user_id ORDER BY user_id;

执行结果：

user_id	all_tags
u001	tech,mobile,review; api,backend,rest
u002	fashion,shoes,sale; image,cdn,static
u003	tech,python,tutorial; order,payment,checkout
u004	data,lakehouse,cloud
u005	tech,mobile,new

⚠️ 注意：

DISTINCT

DISTINCT

和

ORDER BY

ORDER BY

不能在同一个

GROUP_CONCAT

GROUP_CONCAT

调用中同时使用，否则会报错。如需对结果去重，可先在子查询中去重，再聚合。

WM_CONCAT：简洁写法

WM_CONCAT(sep, expr)

WM_CONCAT(sep, expr)

是更简洁的字符串聚合写法，不支持

ORDER BY

ORDER BY

，适合对顺序无要求的场景：

SELECT user_id, WM_CONCAT(';', tags) AS all_tags FROM doc_str_logs GROUP BY user_id ORDER BY user_id;

执行结果：

user_id	all_tags
u001	tech,mobile,review;api,backend,rest
u002	fashion,shoes,sale;image,cdn,static
u003	tech,python,tutorial;order,payment,checkout
u004	data,lakehouse,cloud
u005	tech,mobile,new

说明：

WM_CONCAT

WM_CONCAT

不保证聚合顺序。如果需要确定性的合并顺序，请使用

GROUP_CONCAT ... ORDER BY

GROUP_CONCAT ... ORDER BY

。

GROUP_CONCAT DISTINCT：聚合唯一值

对每个用户访问过的唯一域名去重后合并：

SELECT user_id, GROUP_CONCAT(DISTINCT REGEXP_EXTRACT(url, 'https?://([/]+)', 1) SEPARATOR ', ') AS domains FROM doc_str_logs GROUP BY user_id ORDER BY user_id;

执行结果：

user_id	domains
u001	www.example.com, api.example.com
u002	shop.example.com, cdn.example.com
u003	www.example.com, shop.example.com
u004	www.example.com
u005	www.example.com

场景七：大小写与空格处理

数据入库时常有大小写不一致或首尾空格的问题，需要在分析前做标准化处理。

大小写转换

UPPER(str)

UPPER(str)

将字符串转为大写，

LOWER(str)

LOWER(str)

转为小写：

SELECT log_id, content, UPPER(content) AS upper_content FROM doc_str_logs WHERE log_id IN (1, 3, 8) ORDER BY log_id;

执行结果：

log_id	content	upper_content
1	Hello World! This is a test.	HELLO WORLD! THIS IS A TEST.
3	Python is great!	PYTHON IS GREAT!
8	Order #789 confirmed!	ORDER #789 CONFIRMED!

去除首尾空白

TRIM(str)

TRIM(str)

去除两端空格，

LTRIM

LTRIM

只去左侧，

RTRIM

RTRIM

只去右侧：

SELECT log_id, content, TRIM(content) AS trimmed, LTRIM(content) AS ltrimmed, RTRIM(content) AS rtrimmed FROM doc_str_logs WHERE log_id = 2;

执行结果：

log_id	content	trimmed	ltrimmed	rtrimmed
2	spaces around	spaces around	spaces around	spaces around

组合标准化：LOWER + TRIM + LENGTH

在分组或 JOIN 之前，同时处理大小写和空格，避免因格式差异导致匹配失败：

SELECT log_id, user_id, LOWER(TRIM(user_id)) AS normalized_uid, LENGTH(TRIM(content)) AS content_len FROM doc_str_logs ORDER BY log_id;

执行结果：

log_id	user_id	normalized_uid	content_len
1	u001	u001	28
2	u002	u002	13
3	u003	u003	16
4	u001	u001	29
5	u004	u004	14
6	u002	u002	7
7	u005	u005	21
8	u003	u003	21

说明：

LENGTH

LENGTH

返回字符数（不是字节数），中文字符和英文字符各计为 1 个字符。

清理测试数据

完成字符串处理验证后，建议清理测试表：

DROP TABLE IF EXISTS doc_str_logs;

💡 提示：Lakehouse 支持

UNDROP TABLE

UNDROP TABLE

，误删后可在保留期内恢复。

注意事项

REGEXP_EXTRACT 的
?
?
转义：正则中
```
?
```
?
是量词，若要匹配字面量
```
?
```
?
，需要写成
```
[?]
```
[?]
或
```
\?
```
\?
。直接写
```
'?(.*)'
```
'?(.*)'
会导致
```
no argument for repetition operator
```
no argument for repetition operator
报错。
REGEXP_REPLACE 不支持反向引用：Lakehouse 的
```
REGEXP_REPLACE
```
REGEXP_REPLACE
不支持
```
$1
```
$1
、
```
\1
```
\1
这样的捕获组反向引用，替换内容只能是字面量字符串。
GROUP_CONCAT 的 DISTINCT 与 ORDER BY 不兼容：同一个
```
GROUP_CONCAT
```
GROUP_CONCAT
调用中不能同时使用
```
DISTINCT
```
DISTINCT
和
```
ORDER BY
```
ORDER BY
，否则会报
```
Distinct aggregate call and aggregate call with order by can not coexist
```
Distinct aggregate call and aggregate call with order by can not coexist
错误。
LIKE vs RLIKE 的性能差异：
```
LIKE
```
LIKE
使用通配符，比
```
RLIKE
```
RLIKE
性能更好；在不需要完整正则功能时，优先使用
```
LIKE
```
LIKE
。
LENGTH 计算字符数而非字节数：
```
LENGTH('CDN资源文件')
```
LENGTH('CDN资源文件')
返回 7（3 个英文 + 4 个汉字），不是字节长度。如需字节长度，使用
```
OCTET_LENGTH
```
OCTET_LENGTH
。
SPLIT 分隔符是正则：
```
SPLIT(str, pattern)
```
SPLIT(str, pattern)
的第二个参数是正则表达式，若分隔符是
```
.
```
.
或
```
|
```
|
等正则特殊字符，需要转义：
```
SPLIT(str, '\\.')
```
SPLIT(str, '\\.')
。

Lakehouse 字符串处理指南

概述

涉及的 SQL 命令

前置准备

场景一：截取与拼接

用 SUBSTR + LOCATE 提取域名

用 CONCAT 构造复合键

用 CONCAT_WS 拼接多个字段

场景二：正则提取

将 URL 拆分为域名、路径和查询参数

提取 URL 中的第一个参数键与值

场景三：正则替换与清洗

去除特殊字符

脱敏手机号

用 REPLACE 去除固定前缀

场景四：字符串分割转行

展开标签列

统计标签出现频次

场景五：模糊匹配与过滤

LIKE：通配符匹配

RLIKE：正则表达式匹配

场景六：字符串聚合

GROUP_CONCAT：按序聚合（支持 ORDER BY）

WM_CONCAT：简洁写法

GROUP_CONCAT DISTINCT：聚合唯一值

场景七：大小写与空格处理

大小写转换

去除首尾空白

组合标准化：LOWER + TRIM + LENGTH

清理测试数据

注意事项

相关文档