Studio Shell 任务开发指南

Studio Shell 任务在服务端 Linux 环境中运行 Bash 脚本，预装了

curl

curl

、

wget

wget

、

awk

awk

、

sed

sed

、

grep

grep

、

python3

python3

等工具。

什么时候选 Shell 任务，什么时候选 Python 任务：

场景	推荐
团队已有 bash 脚本，想直接接入调度体系	Shell 任务
用 `awk` awk / `sed` sed /管道做文本处理，逻辑用 shell 表达更简洁	Shell 任务
需要调用系统二进制工具（ `ffmpeg` ffmpeg 、 `convert` convert 等）	Shell 任务
新开发数据处理逻辑，需要复杂计算或 DataFrame 操作	Python 任务
需要 ZettaPark DataFrame API	Python 任务

Shell 任务和 Python 任务在能力上高度重叠——Shell 任务里可以内嵌

python3

python3

代码，Python 任务里可以用

subprocess

subprocess

调用 shell 命令。选择的依据主要是已有代码的形式和团队习惯，而不是功能差异。

运行环境

Shell 任务运行在 Studio 托管的 Linux Pod 中，每次执行启动一个新 Pod，执行完毕后销毁。

项目	说明
操作系统	Linux x86_64（内核 5.10）
运行用户	`system_normal` system_normal
Python 版本	Python 3.10.0
预装 CLI 工具	`python3` python3 、 `curl` curl 、 `wget` wget 、 `awk` awk 、 `sed` sed 、 `grep` grep 、 `find` find 、 `tar` tar 、 `gzip` gzip
预装 Python 包	`clickzetta` clickzetta 、 `clickzetta_dbutils` clickzetta_dbutils 、 `pandas` pandas 、 `requests` requests 、 `boto3` boto3 、 `oss2` oss2

💡 提示：Pod 销毁后环境不保留。如需安装额外包，在脚本开头用

pip install --target /home/system_normal <pkg>

pip install --target /home/system_normal <pkg>

安装，并在 Python 代码里

sys.path.append('/home/system_normal')

sys.path.append('/home/system_normal')

。

连接 Lakehouse：通过

clickzetta_dbutils

clickzetta_dbutils

获取连接，不需要硬编码凭据：

from clickzetta_dbutils import get_active_lakehouse_engine from sqlalchemy import text engine = get_active_lakehouse_engine(schema="your_schema") with engine.connect() as conn: conn.execute(text("SELECT 1"))

场景：已有 shell 脚本接入调度

典型场景：团队有一批用

awk

awk

sed

sed

处理日志或 CSV 的 shell 脚本，想直接接入 Studio 调度体系，处理完后把结果写入 Lakehouse。

以下示例模拟一个常见模式：下载 CSV 文件 → 用

awk

awk

过滤清洗 → 用

python3

python3

写入 Lakehouse。

完整脚本

任务参数和下载数据文件：

#!/bin/bash BIZ_DATE='${biz_date}' echo "处理日期：$BIZ_DATE" wget -q "https://jsonplaceholder.typicode.com/posts" -O /tmp/posts.json echo "下载完成：$(wc -c < /tmp/posts.json) bytes"

用 python3 解析 JSON + awk 过滤（userId <= 3 的帖子）：

python3 -c " import json posts = json.load(open('/tmp/posts.json')) for p in posts: print(f\"{p['id']},{p['userId']},{p['title'][:30].replace(',','')}\") " | awk -F, '$2 <= 3 {print}' > /tmp/posts_filtered.csv echo "过滤后行数：$(wc -l < /tmp/posts_filtered.csv)"

用 python3 把结果写入 Lakehouse：

from clickzetta_dbutils import get_active_lakehouse_engine from sqlalchemy import text biz_date = '$BIZ_DATE' engine = get_active_lakehouse_engine(schema="doc_connector_demo") with engine.connect() as conn: conn.execute(text("CREATE SCHEMA IF NOT EXISTS doc_connector_demo")) conn.execute(text(""" CREATE TABLE IF NOT EXISTS doc_connector_demo.doc_shell_posts ( post_id INT, user_id INT, title STRING, load_date STRING ) """)) conn.execute(text(f"DELETE FROM doc_connector_demo.doc_shell_posts WHERE load_date = '{biz_date}'")) rows = 0 with open('/tmp/posts_filtered.csv') as f: for line in f: parts = line.strip().split(',', 2) if len(parts) == 3: post_id, user_id, title = parts title = title.replace("'", "''") conn.execute(text( f"INSERT INTO doc_connector_demo.doc_shell_posts VALUES " f"({post_id}, {user_id}, '{title}', '{biz_date}')" )) rows += 1 print(f"写入 {rows} 行，load_date={biz_date}") with engine.connect() as conn: result = conn.execute(text( f"SELECT COUNT(*) as cnt, COUNT(DISTINCT user_id) as users " f"FROM doc_connector_demo.doc_shell_posts WHERE load_date = '{biz_date}'" )) row = result.fetchone() print(f"验证：{row[0]} 条记录，{row[1]} 个用户")

创建并执行任务

Studio UI

进入 数据开发 → 新建任务，选择 Shell 类型，填写任务名称
将上方脚本粘贴到编辑器
点击右侧参数按钮，系统自动识别
```
${biz_date}
```
${biz_date}
，赋值为
```
$[yyyy-MM-dd, -1d]
```
$[yyyy-MM-dd, -1d]
（取昨天日期）
点击调度按钮，配置 VCluster（选通用型
```
DEFAULT
```
DEFAULT
）和 Cron 表达式（如
```
0 1 * * *
```
0 1 * * *
）
点击发布，再点击运行 → 在弹窗里输入
```
biz_date=2024-12-01
```
biz_date=2024-12-01
验证

cz-cli（适合 CI/CD 或批量管理场景，详见 Studio 任务开发与运维）

创建任务：

cz-cli task create shell_etl --type shell --profile <your-profile>

上传脚本并配置参数：

cz-cli task save-content shell_etl --file shell_etl.sh \ --params '{"biz_date": "$[yyyy-MM-dd, -1d]"}' \ --profile <your-profile>

配置调度：

cz-cli task save-config shell_etl --vcluster DEFAULT --retry-count 1 --profile <your-profile> cz-cli task save-cron shell_etl --cron "0 1 * * *" --profile <your-profile>

发布并临时执行验证：

cz-cli task online shell_etl -y --profile <your-profile> cz-cli task execute shell_etl --param "biz_date=2024-12-01" --profile <your-profile>

执行结果

处理日期：2024-12-01 下载完成：27520 bytes 过滤后行数：30 写入 30 行，load_date=2024-12-01 验证：30 条记录，3 个用户

验证写入结果：

SELECT user_id, COUNT(*) AS post_count FROM doc_connector_demo.doc_shell_posts WHERE load_date = '2024-12-01' GROUP BY user_id ORDER BY user_id;

user_id post_count 1 10 2 10 3 10

注意事项

任务参数
```
${biz_date}
```
${biz_date}
在 Shell 层面是字符串替换，传给内嵌 Python 时用
```
'$BIZ_DATE'
```
'$BIZ_DATE'
引用 Shell 变量
```
python3 - << PYEOF ... PYEOF
```
python3 - << PYEOF ... PYEOF
是 heredoc 内嵌 Python 的标准写法
Pod 每次执行都是全新环境，
```
/tmp
```
/tmp
里的文件不会跨次保留
不支持
```
cz-cli
```
cz-cli
，Lakehouse 操作通过
```
clickzetta_dbutils
```
clickzetta_dbutils
+ SQLAlchemy 完成

Studio Shell 任务开发指南

运行环境

场景：已有 shell 脚本接入调度

完整脚本

创建并执行任务

执行结果

注意事项

相关文档