Lakehouse 字符串处理指南

概述

字符串处理是日志分析、用户行为分析和数据清洗中最常见的需求之一。云器 Lakehouse 提供完整的字符串函数支持,包括截取与拼接、正则提取与替换、模糊匹配以及字符串聚合。本指南按业务场景分类,帮助你快速掌握高效的字符串处理方法。

涉及的 SQL 命令

命令/函数用途适用场景
SUBSTR()
SUBSTR()
/
SUBSTRING()
SUBSTRING()
按位置截取子字符串提取域名、截断固定长度字段
CONCAT()
CONCAT()
拼接多个字符串构造复合键、拼接字段
CONCAT_WS()
CONCAT_WS()
用指定分隔符拼接多个字符串将多列拼成一个分隔符串
LOCATE()
LOCATE()
/
INSTR()
INSTR()
查找字符或子串的位置辅助截取,定位分隔符位置
REPLACE()
REPLACE()
字面量字符串替换去除固定前缀/后缀
REGEXP_EXTRACT()
REGEXP_EXTRACT()
用正则表达式提取匹配组从 URL/日志中提取字段
REGEXP_REPLACE()
REGEXP_REPLACE()
用正则表达式替换匹配内容清洗特殊字符、脱敏
SPLIT()
SPLIT()
按分隔符切割字符串为数组将逗号分隔标签转为数组
LATERAL VIEW EXPLODE()
LATERAL VIEW EXPLODE()
将数组列展开为多行标签展开、多值字段打平
LIKE
LIKE
通配符模式匹配关键词过滤
RLIKE
RLIKE
正则表达式模式匹配复杂模式过滤
GROUP_CONCAT()
GROUP_CONCAT()
聚合多行字符串(支持 ORDER BY)将访问 ID 按顺序合并
WM_CONCAT()
WM_CONCAT()
聚合多行字符串简洁的多行合并
UPPER()
UPPER()
/
LOWER()
LOWER()
大小写转换字段标准化、统一格式
TRIM()
TRIM()
/
LTRIM()
LTRIM()
/
RTRIM()
RTRIM()
去除首尾空白清洗用户输入
LENGTH()
LENGTH()
返回字符串字符数长度校验、过滤空值

前置准备

以下示例使用模拟的访问日志表

doc_str_logs
doc_str_logs

-- 创建测试表 CREATE TABLE IF NOT EXISTS doc_str_logs ( log_id INT, user_id STRING, url STRING, tags STRING, content STRING ); -- 插入测试数据 INSERT INTO doc_str_logs VALUES (1, 'u001', 'https://www.example.com/product/detail?id=123', 'tech,mobile,review', 'Hello World! This is a test.'), (2, 'u002', 'https://shop.example.com/category/shoes?color=red', 'fashion,shoes,sale', ' spaces around '), (3, 'u003', 'https://www.example.com/blog/post?tag=python', 'tech,python,tutorial', 'Python is great!'), (4, 'u001', 'https://api.example.com/v2/users?page=1', 'api,backend,rest', 'API response: {"status":"ok"}'), (5, 'u004', 'https://www.example.com/search?q=lakehouse', 'data,lakehouse,cloud', 'Lakehouse 数据平台'), (6, 'u002', 'https://cdn.example.com/images/logo.png', 'image,cdn,static', 'CDN资源文件'), (7, 'u005', 'https://www.example.com/product/detail?id=456', 'tech,mobile,new', '手机号:138-0013-8000,请联系'), (8, 'u003', 'https://shop.example.com/checkout?order=789', 'order,payment,checkout','Order #789 confirmed!');


场景一:截取与拼接

从 URL 中提取域名,或将多个字段拼接为复合键,是日志处理中最常见的字符串操作。

用 SUBSTR + LOCATE 提取域名

SUBSTR(str, pos, len)
SUBSTR(str, pos, len)
从指定位置截取指定长度的子串。
LOCATE(substr, str, start)
LOCATE(substr, str, start)
返回子串从
start
start
位置起第一次出现的位置。组合使用可以精确截取 URL 的域名部分。

https://
https://
固定占 8 个字符,因此域名从第 9 个字符开始,到第一个斜杠结束:

SELECT log_id, url, SUBSTR(url, 9, LOCATE('/', url, 9) - 9) AS domain FROM doc_str_logs ORDER BY log_id;

执行结果

log_idurldomain
1https://www.example.com/product/detail?id=123www.example.com
2https://shop.example.com/category/shoes?color=redshop.example.com
3https://www.example.com/blog/post?tag=pythonwww.example.com
4https://api.example.com/v2/users?page=1api.example.com
5https://www.example.com/search?q=lakehousewww.example.com
6https://cdn.example.com/images/logo.pngcdn.example.com
7https://www.example.com/product/detail?id=456www.example.com
8https://shop.example.com/checkout?order=789shop.example.com

用 CONCAT 构造复合键

CONCAT(str1, str2, ...)
CONCAT(str1, str2, ...)
将多个字符串拼接为一个。常用于构造唯一标识或标注行来源:

SELECT log_id, CONCAT(user_id, '@', REGEXP_EXTRACT(url, 'https?://([/]+)', 1)) AS user_domain_key FROM doc_str_logs ORDER BY log_id;

执行结果

log_iduser_domain_key
1u001@www.example.com
2u002@shop.example.com
3u003@www.example.com
4u001@api.example.com
5u004@www.example.com
6u002@cdn.example.com
7u005@www.example.com
8u003@shop.example.com

用 CONCAT_WS 拼接多个字段

CONCAT_WS(sep, str1, str2, ...)
CONCAT_WS(sep, str1, str2, ...)
用统一分隔符拼接多列,比多个
CONCAT
CONCAT
更简洁,且会自动跳过
NULL
NULL
值:

SELECT log_id, url, CONCAT_WS('|', REGEXP_EXTRACT(url, 'https?://([/]+)', 1), REGEXP_EXTRACT(url, 'https?://[/?]+(/[?]*)', 1), REGEXP_EXTRACT(url, '[?](.*)', 1) ) AS url_parts FROM doc_str_logs ORDER BY log_id LIMIT 5;

执行结果(前 5 行)

log_idurlurl_parts
1https://www.example.com/product/detail?id=123www.example.com\|/product/detail\|id=123
2https://shop.example.com/category/shoes?color=redshop.example.com|/category/shoes|color=red
3https://www.example.com/blog/post?tag=pythonwww.example.com\|/blog/post\|tag=python
4https://api.example.com/v2/users?page=1api.example.com|/v2/users|page=1
5https://www.example.com/search?q=lakehousewww.example.com\|/search\|q=lakehouse

场景二:正则提取

当 URL 或日志格式较复杂时,

REGEXP_EXTRACT
REGEXP_EXTRACT
SUBSTR + LOCATE
SUBSTR + LOCATE
的组合更直观,也更易维护。

REGEXP_EXTRACT(str, pattern, group_index)
REGEXP_EXTRACT(str, pattern, group_index)
返回第
group_index
group_index
个捕获组匹配的内容(从 1 开始);
group_index
group_index
为 0 表示返回整个匹配串。

将 URL 拆分为域名、路径和查询参数

SELECT log_id, REGEXP_EXTRACT(url, 'https?://([/?]+)', 1) AS domain, REGEXP_EXTRACT(url, 'https?://[/?]+(/[?]*)', 1) AS path, REGEXP_EXTRACT(url, '[?](.*)', 1) AS query_string FROM doc_str_logs ORDER BY log_id;

执行结果

log_iddomainpathquery_string
1www.example.com/product/detailid=123
2shop.example.com/category/shoescolor=red
3www.example.com/blog/posttag=python
4api.example.com/v2/userspage=1
5www.example.com/searchq=lakehouse
6cdn.example.com/images/logo.png(空)
7www.example.com/product/detailid=456
8shop.example.com/checkoutorder=789

提取 URL 中的第一个参数键与值

SELECT log_id, REGEXP_EXTRACT(url, '[?&]([=]+)=([&]+)', 1) AS param_key, REGEXP_EXTRACT(url, '[?&][=]+=([&]+)', 1) AS param_value FROM doc_str_logs WHERE url LIKE '%?%' ORDER BY log_id;

执行结果

log_idparam_keyparam_value
1id123
2colorred
3tagpython
4page1
5qlakehouse
7id456
8order789

场景三:正则替换与清洗

REGEXP_REPLACE(str, pattern, replacement)
REGEXP_REPLACE(str, pattern, replacement)
将字符串中所有匹配
pattern
pattern
的部分替换为
replacement
replacement
,常用于去噪和脱敏。

去除特殊字符

保留汉字、英文字母、数字和空格,去除标点等其他字符:

SELECT log_id, content, REGEXP_REPLACE(content, '[a-zA-Z0-9一-龥 ]', '') AS cleaned FROM doc_str_logs ORDER BY log_id;

执行结果

log_idcontentcleaned
1Hello World! This is a test.Hello World This is a test
2spaces aroundspaces around
3Python is great!Python is great
4API response: {"status":"ok"}API response statusok
5Lakehouse 数据平台Lakehouse 数据平台
6CDN资源文件CDN资源文件
7手机号:138-0013-8000,请联系手机号13800138000请联系
8Order #789 confirmed!Order 789 confirmed

脱敏手机号

将内容中的手机号(

NNN-NNNN-NNNN
NNN-NNNN-NNNN
格式)替换为脱敏占位符:

SELECT log_id, content, REGEXP_REPLACE(content, '[0-9]{3}-[0-9]{4}-[0-9]{4}', 'PHONE_HIDDEN') AS masked FROM doc_str_logs ORDER BY log_id;

执行结果

log_idcontentmasked
1Hello World! This is a test.Hello World! This is a test.
2spaces aroundspaces around
3Python is great!Python is great!
4API response: {"status":"ok"}API response: {"status":"ok"}
5Lakehouse 数据平台Lakehouse 数据平台
6CDN资源文件CDN资源文件
7手机号:138-0013-8000,请联系手机号:PHONE_HIDDEN,请联系
8Order #789 confirmed!Order #789 confirmed!

用 REPLACE 去除固定前缀

对于简单的字面量替换,

REPLACE(str, search, replacement)
REPLACE(str, search, replacement)
比正则更高效:

SELECT log_id, REPLACE(url, 'https://', '') AS url_no_scheme FROM doc_str_logs ORDER BY log_id LIMIT 4;

执行结果(前 4 行)

log_idurl_no_scheme
1www.example.com/product/detail?id=123
2shop.example.com/category/shoes?color=red
3www.example.com/blog/post?tag=python
4api.example.com/v2/users?page=1

场景四:字符串分割转行

当一列中存储了多个值(如逗号分隔的标签),需要先用

SPLIT
SPLIT
切割为数组,再用
LATERAL VIEW EXPLODE
LATERAL VIEW EXPLODE
将数组展开为多行,才能按标签分组统计。

展开标签列

SELECT log_id, tags, tag FROM doc_str_logs LATERAL VIEW EXPLODE(SPLIT(tags, ',')) t AS tag ORDER BY log_id, tag;

执行结果(前 9 行)

log_idtagstag
1tech,mobile,reviewmobile
1tech,mobile,reviewreview
1tech,mobile,reviewtech
2fashion,shoes,salefashion
2fashion,shoes,salesale
2fashion,shoes,saleshoes
3tech,python,tutorialpython
3tech,python,tutorialtech
3tech,python,tutorialtutorial

统计标签出现频次

展开后即可对标签做聚合分析:

SELECT tag, COUNT(*) AS log_count FROM doc_str_logs LATERAL VIEW EXPLODE(SPLIT(tags, ',')) t AS tag GROUP BY tag ORDER BY log_count DESC, tag;

执行结果

taglog_count
tech3
mobile2
api1
backend1
cdn1
checkout1
cloud1
data1
fashion1
image1
lakehouse1
new1
order1
payment1
python1
rest1
review1
sale1
shoes1
static1
tutorial1

如果只需要将字符串转为数组并统计元素个数,不需要展开为行,可以使用

SIZE(SPLIT(...))
SIZE(SPLIT(...))

SELECT log_id, tags, SPLIT(tags, ',') AS tag_array, SIZE(SPLIT(tags, ',')) AS tag_count FROM doc_str_logs ORDER BY log_id LIMIT 4;

执行结果(前 4 行)

log_idtagstag_arraytag_count
1tech,mobile,reviewtech:mobile:review3
2fashion,shoes,salefashion:shoes:sale3
3tech,python,tutorialtech:python:tutorial3
4api,backend,restapi:backend:rest3

场景五:模糊匹配与过滤

LIKE:通配符匹配

LIKE
LIKE
使用
%
%
(任意多个字符)和
_
_
(单个字符)作为通配符。适合路径前缀、关键词包含等简单模式:

-- 筛选 URL 路径包含 /product/ 的记录 SELECT log_id, url FROM doc_str_logs WHERE url LIKE '%/product/%' ORDER BY log_id;

执行结果

log_idurl
1https://www.example.com/product/detail?id=123
7https://www.example.com/product/detail?id=456

RLIKE:正则表达式匹配

RLIKE
RLIKE
(等价于
REGEXP
REGEXP
)支持完整的正则语法,适合复杂模式:

-- 筛选来自 www 或 shop 子域名的访问 SELECT log_id, url FROM doc_str_logs WHERE url RLIKE 'https://(www|shop)\.example\.com' ORDER BY log_id;

执行结果

log_idurl
1https://www.example.com/product/detail?id=123
2https://shop.example.com/category/shoes?color=red
3https://www.example.com/blog/post?tag=python
5https://www.example.com/search?q=lakehouse
7https://www.example.com/product/detail?id=456
8https://shop.example.com/checkout?order=789

-- 筛选 content 中包含手机号格式的记录 SELECT log_id, content FROM doc_str_logs WHERE content RLIKE '[0-9]{3}-[0-9]{4}-[0-9]{4}' ORDER BY log_id;

执行结果

log_idcontent
7手机号:138-0013-8000,请联系

场景六:字符串聚合

将多行字符串合并为一行,常用于生成报表摘要或构造分析维度。

GROUP_CONCAT:按序聚合(支持 ORDER BY)

GROUP_CONCAT(expr ORDER BY col SEPARATOR sep)
GROUP_CONCAT(expr ORDER BY col SEPARATOR sep)
在合并时支持排序和自定义分隔符:

-- 按用户聚合访问记录 ID,按 log_id 升序排列 SELECT user_id, GROUP_CONCAT(log_id ORDER BY log_id SEPARATOR ',') AS log_ids, COUNT(*) AS visit_count FROM doc_str_logs GROUP BY user_id ORDER BY user_id;

执行结果

user_idlog_idsvisit_count
u0011,42
u0022,62
u0033,82
u00451
u00571

-- 按用户聚合访问的 tags,用分号分隔 SELECT user_id, GROUP_CONCAT(tags ORDER BY log_id SEPARATOR '; ') AS all_tags FROM doc_str_logs GROUP BY user_id ORDER BY user_id;

执行结果

user_idall_tags
u001tech,mobile,review; api,backend,rest
u002fashion,shoes,sale; image,cdn,static
u003tech,python,tutorial; order,payment,checkout
u004data,lakehouse,cloud
u005tech,mobile,new

WM_CONCAT:简洁写法

WM_CONCAT(sep, expr)
WM_CONCAT(sep, expr)
是更简洁的字符串聚合写法,不支持
ORDER BY
ORDER BY
,适合对顺序无要求的场景:

SELECT user_id, WM_CONCAT(';', tags) AS all_tags FROM doc_str_logs GROUP BY user_id ORDER BY user_id;

执行结果

user_idall_tags
u001tech,mobile,review;api,backend,rest
u002fashion,shoes,sale;image,cdn,static
u003tech,python,tutorial;order,payment,checkout
u004data,lakehouse,cloud
u005tech,mobile,new

GROUP_CONCAT DISTINCT:聚合唯一值

对每个用户访问过的唯一域名去重后合并:

SELECT user_id, GROUP_CONCAT(DISTINCT REGEXP_EXTRACT(url, 'https?://([/]+)', 1) SEPARATOR ', ') AS domains FROM doc_str_logs GROUP BY user_id ORDER BY user_id;

执行结果

user_iddomains
u001www.example.com, api.example.com
u002shop.example.com, cdn.example.com
u003www.example.com, shop.example.com
u004www.example.com
u005www.example.com

场景七:大小写与空格处理

数据入库时常有大小写不一致或首尾空格的问题,需要在分析前做标准化处理。

大小写转换

UPPER(str)
UPPER(str)
将字符串转为大写,
LOWER(str)
LOWER(str)
转为小写:

SELECT log_id, content, UPPER(content) AS upper_content FROM doc_str_logs WHERE log_id IN (1, 3, 8) ORDER BY log_id;

执行结果

log_idcontentupper_content
1Hello World! This is a test.HELLO WORLD! THIS IS A TEST.
3Python is great!PYTHON IS GREAT!
8Order #789 confirmed!ORDER #789 CONFIRMED!

去除首尾空白

TRIM(str)
TRIM(str)
去除两端空格,
LTRIM
LTRIM
只去左侧,
RTRIM
RTRIM
只去右侧:

SELECT log_id, content, TRIM(content) AS trimmed, LTRIM(content) AS ltrimmed, RTRIM(content) AS rtrimmed FROM doc_str_logs WHERE log_id = 2;

执行结果

log_idcontenttrimmedltrimmedrtrimmed
2spaces aroundspaces aroundspaces aroundspaces around

组合标准化:LOWER + TRIM + LENGTH

在分组或 JOIN 之前,同时处理大小写和空格,避免因格式差异导致匹配失败:

SELECT log_id, user_id, LOWER(TRIM(user_id)) AS normalized_uid, LENGTH(TRIM(content)) AS content_len FROM doc_str_logs ORDER BY log_id;

执行结果

log_iduser_idnormalized_uidcontent_len
1u001u00128
2u002u00213
3u003u00316
4u001u00129
5u004u00414
6u002u0027
7u005u00521
8u003u00321

清理测试数据

完成字符串处理验证后,建议清理测试表:

DROP TABLE IF EXISTS doc_str_logs;


注意事项

  1. REGEXP_EXTRACT 的
    ?
    ?
    转义
    :正则中
    ?
    ?
    是量词,若要匹配字面量
    ?
    ?
    ,需要写成
    [?]
    [?]
    \?
    \?
    。直接写
    '?(.*)'
    '?(.*)'
    会导致
    no argument for repetition operator
    no argument for repetition operator
    报错。
  2. REGEXP_REPLACE 不支持反向引用:Lakehouse 的
    REGEXP_REPLACE
    REGEXP_REPLACE
    不支持
    $1
    $1
    \1
    \1
    这样的捕获组反向引用,替换内容只能是字面量字符串。
  3. GROUP_CONCAT 的 DISTINCT 与 ORDER BY 不兼容:同一个
    GROUP_CONCAT
    GROUP_CONCAT
    调用中不能同时使用
    DISTINCT
    DISTINCT
    ORDER BY
    ORDER BY
    ,否则会报
    Distinct aggregate call and aggregate call with order by can not coexist
    Distinct aggregate call and aggregate call with order by can not coexist
    错误。
  4. LIKE vs RLIKE 的性能差异
    LIKE
    LIKE
    使用通配符,比
    RLIKE
    RLIKE
    性能更好;在不需要完整正则功能时,优先使用
    LIKE
    LIKE
  5. LENGTH 计算字符数而非字节数
    LENGTH('CDN资源文件')
    LENGTH('CDN资源文件')
    返回 7(3 个英文 + 4 个汉字),不是字节长度。如需字节长度,使用
    OCTET_LENGTH
    OCTET_LENGTH
  6. SPLIT 分隔符是正则
    SPLIT(str, pattern)
    SPLIT(str, pattern)
    的第二个参数是正则表达式,若分隔符是
    .
    .
    |
    |
    等正则特殊字符,需要转义:
    SPLIT(str, '\\.')
    SPLIT(str, '\\.')

相关文档

联系我们
预约咨询
微信咨询
电话咨询