JOIN

JOIN 用于将两个或多个表的数据按指定条件合并。Lakehouse 支持以下 JOIN 类型:

left_table_reference { [ join_type ] JOIN right_table_reference join_criteria | NATURAL JOIN right_table_reference | CROSS JOIN right_table_reference } join_type ::= { [ INNER ] | LEFT [ OUTER ] | [ LEFT ] SEMI | RIGHT [ OUTER ] | FULL [ OUTER ] | [ LEFT ] ANTI | CROSS } join_criteria ::= { ON boolean_expression | USING ( column_name [, ...] ) }

left_table_reference
left_table_reference
right_table_reference
right_table_reference
是参与 JOIN 的两个表(或子查询),
join_type
join_type
指定连接类型,
join_criteria
join_criteria
指定连接条件。

JOIN 类型

类型说明
INNER JOIN返回两表中满足连接条件的行(交集),是默认 JOIN 类型
LEFT [OUTER] JOIN返回左表所有行,右表无匹配时用 NULL 填充
RIGHT [OUTER] JOIN返回右表所有行,左表无匹配时用 NULL 填充
FULL [OUTER] JOIN返回两表所有行,无匹配的一侧用 NULL 填充
[LEFT] SEMI JOIN返回左表中在右表有匹配的行,不返回右表列
[LEFT] ANTI JOIN返回左表中在右表没有匹配的行,不返回右表列
CROSS JOIN返回两表的笛卡尔积(所有行的组合)
NATURAL JOIN按两表中所有同名列做隐式等值连接,无需指定条件

JOIN 条件

  • ON boolean_expression
    ON boolean_expression
    :指定任意布尔表达式作为连接条件,JOIN 条件不支持子查询。
  • USING (column_name [, ...])
    USING (column_name [, ...])
    :指定一个或多个列名做等值连接,这些列必须同时存在于两表中。

测试数据说明

以下示例使用

doc_test
doc_test
schema 中的两张表:

  • employees(id, name, dept, salary, hire_date, is_active)
    employees(id, name, dept, salary, hire_date, is_active)
    — 5 条员工记录
  • departments(dept_id, dept_name, manager)
    departments(dept_id, dept_name, manager)
    — 3 个部门

employees.dept
employees.dept
departments.dept_name
departments.dept_name
做关联。
employees
employees
中的
Dave
Dave
所在部门
Marketing
Marketing
departments
departments
中存在,但
departments
departments
中的
HR
HR
部门在
employees
employees
里只有 1 人,而
departments
departments
里还有一个额外的
Finance
Finance
部门(
employees
employees
中无人属于该部门)——这样可以清晰展示各 JOIN 类型对不匹配行的处理方式。

使用示例

INNER JOIN

返回

employees
employees
departments
departments
dept = dept_name
dept = dept_name
均匹配的行。

SELECT e.id, e.name, e.dept, d.manager FROM doc_test.employees e INNER JOIN doc_test.departments d ON e.dept = d.dept_name ORDER BY e.id;

执行结果(仅显示两表均有匹配的员工):

+----+-------+-------------+---------+ | id | name | dept | manager | +----+-------+-------------+---------+ | 1 | Alice | Engineering | Charlie | | 2 | Bob | Engineering | Charlie | | 3 | Carol | Marketing | Diana | | 4 | Dave | Marketing | Diana | | 5 | Eve | HR | Frank | +----+-------+-------------+---------+

LEFT [OUTER] JOIN

返回

employees
employees
的所有行,
departments
departments
中无匹配时用 NULL 填充。

SELECT e.id, e.name, e.dept, d.dept_id, d.manager FROM doc_test.employees e LEFT JOIN doc_test.departments d ON e.dept = d.dept_name ORDER BY e.id;

执行结果(所有员工都保留,若

dept
dept
departments
departments
中无对应记录则
dept_id
dept_id
/
manager
manager
为 NULL):

+----+-------+-------------+---------+---------+ | id | name | dept | dept_id | manager | +----+-------+-------------+---------+---------+ | 1 | Alice | Engineering | 1 | Charlie | | 2 | Bob | Engineering | 1 | Charlie | | 3 | Carol | Marketing | 2 | Diana | | 4 | Dave | Marketing | 2 | Diana | | 5 | Eve | HR | 3 | Frank | +----+-------+-------------+---------+---------+

RIGHT [OUTER] JOIN

返回

departments
departments
的所有行,
employees
employees
中无匹配时用 NULL 填充。

SELECT e.id, e.name, d.dept_name, d.manager FROM doc_test.employees e RIGHT JOIN doc_test.departments d ON e.dept = d.dept_name ORDER BY d.dept_id, e.id;

执行结果(所有部门都保留;若某部门无员工,则

id
id
/
name
name
为 NULL):

+------+-------+-------------+---------+ | id | name | dept_name | manager | +------+-------+-------------+---------+ | 1 | Alice | Engineering | Charlie | | 2 | Bob | Engineering | Charlie | | 3 | Carol | Marketing | Diana | | 4 | Dave | Marketing | Diana | | 5 | Eve | HR | Frank | | NULL | NULL | Finance | Grace | +------+-------+-------------+---------+

FULL [OUTER] JOIN

返回两表所有行,无匹配的一侧用 NULL 填充。

SELECT e.id, e.name, e.dept, d.dept_name, d.manager FROM doc_test.employees e FULL JOIN doc_test.departments d ON e.dept = d.dept_name ORDER BY e.id, d.dept_id;

执行结果(员工侧和部门侧都完整保留,互相无匹配时对应列为 NULL):

+------+-------+-------------+-------------+---------+ | id | name | dept | dept_name | manager | +------+-------+-------------+-------------+---------+ | 1 | Alice | Engineering | Engineering | Charlie | | 2 | Bob | Engineering | Engineering | Charlie | | 3 | Carol | Marketing | Marketing | Diana | | 4 | Dave | Marketing | Marketing | Diana | | 5 | Eve | HR | HR | Frank | | NULL | NULL | NULL | Finance | Grace | +------+-------+-------------+-------------+---------+

[LEFT] SEMI JOIN

返回左表中在右表有匹配的行,不返回右表的列。

SELECT e.id, e.name, e.dept FROM doc_test.employees e SEMI JOIN doc_test.departments d ON e.dept = d.dept_name ORDER BY e.id;

执行结果(只返回员工信息,不含部门表的列):

+----+-------+-------------+ | id | name | dept | +----+-------+-------------+ | 1 | Alice | Engineering | | 2 | Bob | Engineering | | 3 | Carol | Marketing | | 4 | Dave | Marketing | | 5 | Eve | HR | +----+-------+-------------+

[LEFT] ANTI JOIN

返回左表中在右表没有匹配的行,不返回右表的列。

SELECT e.id, e.name, e.dept FROM doc_test.employees e ANTI JOIN doc_test.departments d ON e.dept = d.dept_name ORDER BY e.id;

执行结果(无

dept
dept
不存在于
departments
departments
的员工时,结果为空):

+----+------+------+ | id | name | dept | +----+------+------+ (0 rows)

employees
employees
中存在部门不在
departments
departments
中的员工,该员工会出现在结果中。

CROSS JOIN

返回两表的笛卡尔积,每行两两组合,结果行数 = 左表行数 × 右表行数。

SELECT e.name, d.dept_name FROM doc_test.employees e CROSS JOIN doc_test.departments d ORDER BY e.id, d.dept_id LIMIT 9;

执行结果(5 名员工 × 3 个部门 = 15 行,此处截取前 9 行):

+-------+-------------+ | name | dept_name | +-------+-------------+ | Alice | Engineering | | Alice | Marketing | | Alice | HR | | Bob | Engineering | | Bob | Marketing | | Bob | HR | | Carol | Engineering | | Carol | Marketing | | Carol | HR | +-------+-------------+

NATURAL JOIN

根据两表中所有同名列做隐式等值连接,无需指定 ON 或 USING 条件。

-- NATURAL JOIN:自动按同名列连接 SELECT * FROM doc_test.employees e NATURAL JOIN doc_test.departments d;

USING 语法

当两表的连接列名相同时,可以用

USING
USING
代替
ON
ON
,结果集中该列只出现一次:

-- 假设两表都有 dept_name 列,使用 USING 简化 SELECT e.name, e.salary, dept_name, d.manager FROM doc_test.employees e JOIN doc_test.departments d USING (dept_name);

注意事项

  • JOIN 条件(
    ON
    ON
    子句)不支持子查询。
  • SEMI JOIN 和 ANTI JOIN 只返回左表的列,即使 SELECT 中引用右表列也会报错。
  • FULL OUTER JOIN 在某些分布式引擎中性能开销较大,数据量大时需关注执行计划。
  • CROSS JOIN 不需要 ON 条件;如果误写
    JOIN
    JOIN
    而忘记写
    ON
    ON
    ,优化器会将其视为 CROSS JOIN。
联系我们
预约咨询
微信咨询
电话咨询