概述

TPC-DS(Transaction Processing Performance Council - Decision Support)是一种由交易处理性能评估委员会(TPC)发布的基准测试标准,旨在评估决策支持系统(Decision Support Systems,DSS)的性能。相较于TPC-H更适合评估传统的查询和报表性能,TPC-DS包含了对数据集的分析报告、交互查询、数据挖掘等复杂应用场景,更接近真实的数据仓库业务分析场景。

本报告为您提供了云器Lakehouse与Spark SQL在TPC-DS测试集10TB规模上的测试结果,结论如下:

  • 在TPC-DS 10TB规模数据集上的比较测试中,与Spark相比,云器Lakehouse展现出了显著的性能优势,其性能相当于Spark的9.51倍。
  • 云器Lakehouse对Spark长耗时作业有明显性能提升。

测试环境

  • Spark测试环境
配置项配置信息
服务器Hadoop集群:Master节点:1台阿里云ECS服务器(ecs.g8i.xlarge,4 vCPU 16 GiB);Core节点:4台阿里云ECS服务器(ecs.g7.8xlarge,32 vCPU 128 GiB),每台服务器配置ESSD云盘300GiB*4
网络带宽16Gbps
软件Spark 3.4.2
存储服务阿里云OSS对象存储
数据格式Parquet(Snappy压缩)
  • 云器Lakehouse测试环境
配置项配置信息
计算资源Virtual Cluster:XLarge(128vCore等效算力)
软件阿里云上海Region 云器Lakehouse服务
存储服务托管存储,阿里云OSS对象存储

测试数据

行数
call_center54
catalog_page40,000
catalog_returns1,440,033,112
catalog_sales14,399,964,710
customer65000000
customer_address32,500,000
customer_demographics1,920,800
date_dim73,049
household_demographics7,200
income_band20
inventory1311525000
item402000
promotion2,000
reason70
ship_mode20
store1,500
store_returns2,880,015,149
store_sales28,799,944,153
time_dim86,400
warehouse25
web_page4,002
web_returns720,020,485
web_sales7,199,963,324
web_site78
  • 数据表已通过ANALYZE命令收集统计信息。

测试过程

在测试中,我们选择了TPC-DS基准测试中的103个复杂SQL查询,对10TB的数据集进行性能测试。测试结果包括每个查询在云器Lakehouse和Spark SQL中的执行时间,以及两者的性能对比。

Spark SQL

在元数据服务中创建TPC-DS数据表,使用Parquet文件格式,分区设置与Lakehouse保持一致。

同时,从云器Lakehouse中导出TPC-DS 10TB测试数据,以数据文件形式保存至对象存储服务,以保证双方的测试数据一致。然后在Spark中使用INSERT INTO语句读取数据文件并写入Spark定义的数据表中。

  • 运行TPC-DS 103个查询时,Spark添加了以下参数:

--spark 生产环境大作业必调参数之一。在处理TPCDS-10T规模的数据时,若使用默认的最大并发数200,会因规模偏小而导致大量task内存占用过高,并且极易触发shuffle spill,进而使Spark运行缓慢。经测试,将该参数值调整为2000后,观察到spill大幅减少。因此,我们决定采用2000这一参数值,以优化Spark的运行性能。 set spark.sql.shuffle.partitions = 2000;spark默认值为200

云器Lakehouse

创建集群和表

使用云器Lakehouse XLARGE Virtual Cluster在阿里云OSS上进行测试,所有表均使用默认存储格式。

create vcluster if not exists XLARGE_CLUSTER vcluster_size='XLARGE' vcluster_type='Analytics' AUTO_RESUME=TRUE AUTO_SUSPEND_IN_SECOND=300 min_replicas=1 max_replicas=1;

建表语句

drop table if exists call_center; drop table if exists catalog_page; drop table if exists catalog_returns; drop table if exists catalog_sales; drop table if exists customer; drop table if exists customer_address; drop table if exists customer_demographics; drop table if exists date_dim; drop table if exists household_demographics; drop table if exists income_band; drop table if exists inventory; drop table if exists item; drop table if exists promotion; drop table if exists reason; drop table if exists ship_mode; drop table if exists store; drop table if exists store_returns; drop table if exists store_sales; drop table if exists time_dim; drop table if exists warehouse; drop table if exists web_page; drop table if exists web_returns; drop table if exists web_sales; drop table if exists web_site; drop table if exists catalog_sales; drop table if exists catalog_returns; create table if not exists catalog_sales ( cs_sold_date_sk int, cs_sold_time_sk int, cs_ship_date_sk int, cs_bill_customer_sk int, cs_bill_cdemo_sk int, cs_bill_hdemo_sk int, cs_bill_addr_sk int, cs_ship_customer_sk int, cs_ship_cdemo_sk int, cs_ship_hdemo_sk int, cs_ship_addr_sk int, cs_call_center_sk int, cs_catalog_page_sk int, cs_ship_mode_sk int, cs_warehouse_sk int, cs_item_sk int, cs_promo_sk int, cs_order_number long, cs_quantity int, cs_wholesale_cost decimal(7,2), cs_list_price decimal(7,2), cs_sales_price decimal(7,2), cs_ext_discount_amt decimal(7,2), cs_ext_sales_price decimal(7,2), cs_ext_wholesale_cost decimal(7,2), cs_ext_list_price decimal(7,2), cs_ext_tax decimal(7,2), cs_coupon_amt decimal(7,2), cs_ext_ship_cost decimal(7,2), cs_net_paid decimal(7,2), cs_net_paid_inc_tax decimal(7,2), cs_net_paid_inc_ship decimal(7,2), cs_net_paid_inc_ship_tax decimal(7,2), cs_net_profit decimal(7,2) ) partitioned by (cs_sold_date_sk); create table if not exists catalog_returns ( cr_returned_date_sk int, cr_returned_time_sk int, cr_item_sk int, cr_refunded_customer_sk int, cr_refunded_cdemo_sk int, cr_refunded_hdemo_sk int, cr_refunded_addr_sk int, cr_returning_customer_sk int, cr_returning_cdemo_sk int, cr_returning_hdemo_sk int, cr_returning_addr_sk int, cr_call_center_sk int, cr_catalog_page_sk int, cr_ship_mode_sk int, cr_warehouse_sk int, cr_reason_sk int, cr_order_number long, cr_return_quantity int, cr_return_amount decimal(7,2), cr_return_tax decimal(7,2), cr_return_amt_inc_tax decimal(7,2), cr_fee decimal(7,2), cr_return_ship_cost decimal(7,2), cr_refunded_cash decimal(7,2), cr_reversed_charge decimal(7,2), cr_store_credit decimal(7,2), cr_net_loss decimal(7,2) ) partitioned by (cr_returned_date_sk); create table if not exists inventory ( inv_date_sk int, inv_item_sk int, inv_warehouse_sk int, inv_quantity_on_hand int ) partitioned by (inv_date_sk); create table if not exists store_sales ( ss_sold_date_sk int, ss_sold_time_sk int, ss_item_sk int, ss_customer_sk int, ss_cdemo_sk int, ss_hdemo_sk int, ss_addr_sk int, ss_store_sk int, ss_promo_sk int, ss_ticket_number long, ss_quantity int, ss_wholesale_cost decimal(7,2), ss_list_price decimal(7,2), ss_sales_price decimal(7,2), ss_ext_discount_amt decimal(7,2), ss_ext_sales_price decimal(7,2), ss_ext_wholesale_cost decimal(7,2), ss_ext_list_price decimal(7,2), ss_ext_tax decimal(7,2), ss_coupon_amt decimal(7,2), ss_net_paid decimal(7,2), ss_net_paid_inc_tax decimal(7,2), ss_net_profit decimal(7,2) ) partitioned by (ss_sold_date_sk); create table if not exists store_returns ( sr_returned_date_sk int, sr_return_time_sk int, sr_item_sk int, sr_customer_sk int, sr_cdemo_sk int, sr_hdemo_sk int, sr_addr_sk int, sr_store_sk int, sr_reason_sk int, sr_ticket_number long, sr_return_quantity int, sr_return_amt decimal(7,2), sr_return_tax decimal(7,2), sr_return_amt_inc_tax decimal(7,2), sr_fee decimal(7,2), sr_return_ship_cost decimal(7,2), sr_refunded_cash decimal(7,2), sr_reversed_charge decimal(7,2), sr_store_credit decimal(7,2), sr_net_loss decimal(7,2) ) partitioned by (sr_returned_date_sk); create table if not exists web_sales ( ws_sold_date_sk int, ws_sold_time_sk int, ws_ship_date_sk int, ws_item_sk int, ws_bill_customer_sk int, ws_bill_cdemo_sk int, ws_bill_hdemo_sk int, ws_bill_addr_sk int, ws_ship_customer_sk int, ws_ship_cdemo_sk int, ws_ship_hdemo_sk int, ws_ship_addr_sk int, ws_web_page_sk int, ws_web_site_sk int, ws_ship_mode_sk int, ws_warehouse_sk int, ws_promo_sk int, ws_order_number long, ws_quantity int, ws_wholesale_cost decimal(7,2), ws_list_price decimal(7,2), ws_sales_price decimal(7,2), ws_ext_discount_amt decimal(7,2), ws_ext_sales_price decimal(7,2), ws_ext_wholesale_cost decimal(7,2), ws_ext_list_price decimal(7,2), ws_ext_tax decimal(7,2), ws_coupon_amt decimal(7,2), ws_ext_ship_cost decimal(7,2), ws_net_paid decimal(7,2), ws_net_paid_inc_tax decimal(7,2), ws_net_paid_inc_ship decimal(7,2), ws_net_paid_inc_ship_tax decimal(7,2), ws_net_profit decimal(7,2) ) partitioned by (ws_sold_date_sk); create table if not exists web_returns ( wr_returned_date_sk int, wr_returned_time_sk int, wr_item_sk int, wr_refunded_customer_sk int, wr_refunded_cdemo_sk int, wr_refunded_hdemo_sk int, wr_refunded_addr_sk int, wr_returning_customer_sk int, wr_returning_cdemo_sk int, wr_returning_hdemo_sk int, wr_returning_addr_sk int, wr_web_page_sk int, wr_reason_sk int, wr_order_number long, wr_return_quantity int, wr_return_amt decimal(7,2), wr_return_tax decimal(7,2), wr_return_amt_inc_tax decimal(7,2), wr_fee decimal(7,2), wr_return_ship_cost decimal(7,2), wr_refunded_cash decimal(7,2), wr_reversed_charge decimal(7,2), wr_account_credit decimal(7,2), wr_net_loss decimal(7,2) ) partitioned by (wr_returned_date_sk); create table if not exists call_center ( cc_call_center_sk int, cc_call_center_id string, cc_rec_start_date date, cc_rec_end_date date, cc_closed_date_sk int, cc_open_date_sk int, cc_name string, cc_class string, cc_employees int, cc_sq_ft int, cc_hours string, cc_manager string, cc_mkt_id int, cc_mkt_class string, cc_mkt_desc string, cc_market_manager string, cc_division int, cc_division_name string, cc_company int, cc_company_name string, cc_street_number string, cc_street_name string, cc_street_type string, cc_suite_number string, cc_city string, cc_county string, cc_state string, cc_zip string, cc_country string, cc_gmt_offset decimal(5,2), cc_tax_percentage decimal(5,2) ); create table if not exists catalog_page ( cp_catalog_page_sk int, cp_catalog_page_id string, cp_start_date_sk int, cp_end_date_sk int, cp_department string, cp_catalog_number int, cp_catalog_page_number int, cp_description string, cp_type string) ; create table if not exists customer ( c_customer_sk int, c_customer_id string, c_current_cdemo_sk int, c_current_hdemo_sk int, c_current_addr_sk int, c_first_shipto_date_sk int, c_first_sales_date_sk int, c_salutation string, c_first_name string, c_last_name string, c_preferred_cust_flag string, c_birth_day int, c_birth_month int, c_birth_year int, c_birth_country string, c_login string, c_email_address string, c_last_review_date string) ; create table if not exists customer_address ( ca_address_sk int, ca_address_id string, ca_street_number string, ca_street_name string, ca_street_type string, ca_suite_number string, ca_city string, ca_county string, ca_state string, ca_zip string, ca_country string, ca_gmt_offset decimal(5,2), ca_location_type string) ; create table if not exists customer_demographics ( cd_demo_sk int, cd_gender string, cd_marital_status string, cd_education_status string, cd_purchase_estimate int, cd_credit_rating string, cd_dep_count int, cd_dep_employed_count int, cd_dep_college_count int) ; create table if not exists date_dim ( d_date_sk int, d_date_id string, d_date date, d_month_seq int, d_week_seq int, d_quarter_seq int, d_year int, d_dow int, d_moy int, d_dom int, d_qoy int, d_fy_year int, d_fy_quarter_seq int, d_fy_week_seq int, d_day_name string, d_quarter_name string, d_holiday string, d_weekend string, d_following_holiday string, d_first_dom int, d_last_dom int, d_same_day_ly int, d_same_day_lq int, d_current_day string, d_current_week string, d_current_month string, d_current_quarter string, d_current_year string) ; create table if not exists household_demographics ( hd_demo_sk int, hd_income_band_sk int, hd_buy_potential string, hd_dep_count int, hd_vehicle_count int) ; create table if not exists income_band ( ib_income_band_sk int, ib_lower_bound int, ib_upper_bound int) using parquet ; create table if not exists item ( i_item_sk int, i_item_id string, i_rec_start_date date, i_rec_end_date date, i_item_desc string, i_current_price decimal(7,2), i_wholesale_cost decimal(7,2), i_brand_id int, i_brand string, i_class_id int, i_class string, i_category_id int, i_category string, i_manufact_id int, i_manufact string, i_size string, i_formulation string, i_color string, i_units string, i_container string, i_manager_id int, i_product_name string) ; create table if not exists promotion ( p_promo_sk int, p_promo_id string, p_start_date_sk int, p_end_date_sk int, p_item_sk int, p_cost decimal(15,2), p_response_target int, p_promo_name string, p_channel_dmail string, p_channel_email string, p_channel_catalog string, p_channel_tv string, p_channel_radio string, p_channel_press string, p_channel_event string, p_channel_demo string, p_channel_details string, p_purpose string, p_discount_active string) ; create table if not exists reason ( r_reason_sk int, r_reason_id string, r_reason_desc string) ; create table if not exists ship_mode ( sm_ship_mode_sk int, sm_ship_mode_id string, sm_type string, sm_code string, sm_carrier string, sm_contract string) ; create table if not exists store ( s_store_sk int, s_store_id string, s_rec_start_date date, s_rec_end_date date, s_closed_date_sk int, s_store_name string, s_number_employees int, s_floor_space int, s_hours string, s_manager string, s_market_id int, s_geography_class string, s_market_desc string, s_market_manager string, s_division_id int, s_division_name string, s_company_id int, s_company_name string, s_street_number string, s_street_name string, s_street_type string, s_suite_number string, s_city string, s_county string, s_state string, s_zip string, s_country string, s_gmt_offset decimal(5,2), s_tax_precentage decimal(5,2)) ; create table if not exists time_dim ( t_time_sk int, t_time_id string, t_time int, t_hour int, t_minute int, t_second int, t_am_pm string, t_shift string, t_sub_shift string, t_meal_time string) ; create table if not exists warehouse ( w_warehouse_sk int, w_warehouse_id string, w_warehouse_name string, w_warehouse_sq_ft int, w_street_number string, w_street_name string, w_street_type string, w_suite_number string, w_city string, w_county string, w_state string, w_zip string, w_country string, w_gmt_offset decimal(5,2)) ; create table if not exists web_page ( wp_web_page_sk int, wp_web_page_id string, wp_rec_start_date date, wp_rec_end_date date, wp_creation_date_sk int, wp_access_date_sk int, wp_autogen_flag string, wp_customer_sk int, wp_url string, wp_type string, wp_char_count int, wp_link_count int, wp_image_count int, wp_max_ad_count int) ; create table if not exists web_site ( web_site_sk int, web_site_id string, web_rec_start_date date, web_rec_end_date date, web_name string, web_open_date_sk int, web_close_date_sk int, web_class string, web_manager string, web_mkt_id int, web_mkt_class string, web_mkt_desc string, web_market_manager string, web_company_id int, web_company_name string, web_street_number string, web_street_name string, web_street_type string, web_suite_number string, web_city string, web_county string, web_state string, web_zip string, web_country string, web_gmt_offset decimal(5,2), web_tax_percentage decimal(5,2)) ; analyze table call_center compute statistics for all columns; analyze table catalog_page compute statistics for all columns; analyze table catalog_returns compute statistics for all columns; analyze table catalog_sales compute statistics for all columns; analyze table customer compute statistics for all columns; analyze table customer_address compute statistics for all columns; analyze table customer_demographics compute statistics for all columns; analyze table date_dim compute statistics for all columns; analyze table household_demographics compute statistics for all columns; analyze table income_band compute statistics for all columns; analyze table inventory compute statistics for all columns; analyze table item compute statistics for all columns; analyze table promotion compute statistics for all columns; analyze table reason compute statistics for all columns; analyze table ship_mode compute statistics for all columns; analyze table store compute statistics for all columns; analyze table store_returns compute statistics for all columns; analyze table store_sales compute statistics for all columns; analyze table time_dim compute statistics for all columns; analyze table warehouse compute statistics for all columns; analyze table web_page compute statistics for all columns; analyze table web_returns compute statistics for all columns; analyze table web_sales compute statistics for all columns; analyze table web_site compute statistics for all columns;

执行查询

TPC-DS 103个测试查询语句:TPC-DS-Query-SQL

测试结果

以下是云器Lakehouse和SparkSQL在103个查询上的性能测试结果,单位为秒(s),数值越低表示性能越好。

  • 所有查询均以首次执行结果为准
Query云器LakehouseSpark SQLSpark vs. Lakehouse
query14.44319.8624.470402881
query236.636150.4164.105688394
query311.73423.391.99335265
query492.902642.3986.914791931
query514.756163.48911.07949309
query61.8926.5623.468287526
query722.48158.7782.614563409
query84.5516.043.525274725
query944.262643.99114.54952329
query101.99950.34725.18609305
query1138.772238.7356.157407407
query121.2535.3344.25698324
query1311.41867.1025.876861097
query14a88.878490.0515.513749184
query14b68.34477.1276.981665203
query152.9611.9234.028040541
query165.515288.99652.40181324
query178.45266.5757.876833885
query186.26255.0018.783296072
query192.70410.6933.954511834
query201.3945.0213.601865136
query210.683.1794.675
query2211.0849.1450.825063154
query23a98.7221393.11214.11146452
query23b95.8451831.94819.11365225
query24a34.641925.88126.72789469
query24b30.553943.61130.8843976
query2519.82156.4832.849654407
query262.93133.0911.28966223
query276.18554.938.881164107
query2830.606802.20526.21071032
query2913.697186.70413.63101409
query303.15319.2326.099587694
query315.05756.97311.26616571
query321.5777.1394.526949905
query332.87411.5214.008698678
query344.01227.3716.822283151
query356.34179.32512.50985649
query3612.82372.5495.657724401
query373.868105.23627.20682523
query3815.221152.69210.03166678
query39a1.32110.087.630582892
query39b0.9678.0138.286452947
query406.61945.5396.880042302
query410.1171.1119.495726496
query421.3296.6685.017306245
query432.99321.247.096558637
query4413.75915.6441.137001236
query451.98712.4496.265223956
query465.43243.5158.010861561
query4721.831133.5036.115294764
query485.02551.67510.28358209
query4913.32774.8775.618443761
query5028.519789.92527.6982012
query5117.59664.1813.647476699
query521.6858.685.151335312
query532.14932.21314.98976268
query5411.41820.1481.764582239
query550.4857.11114.66185567
query561.8112.8567.102762431
query5713.72474.7145.444039639
query581.4257.4785.247719298
query5916.064158.0259.837213645
query602.95921.8377.37985806
query614.57615.9823.49256993
query625.25833.3786.34804108
query633.05529.139.535188216
query6432.014663.72220.73224214
query6533.916185.2195.461109801
query667.1948.5836.757023644
query67186.6451.8752.421623794
query682.94517.4825.936162988
query692.08322.04210.5818531
query7022.75259.8012.628384318
query713.03421.4987.085695452
query7211.45212.09418.52349345
query731.1469.698.455497382
query7427.236224.4478.240820972
query7546.678385.938.267920648
query7617.695321.02718.14224357
query771.80413.8637.6845898
query78181.223669.2423.692919773
query794.04228.3747.019792182
query8011.991163.78913.65932783
query813.226.4548.266875
query824.572208.44445.59142607
query830.7956.2537.865408805
query844.27723.85.564648118
query856.82646.9286.874890126
query867.52732.7034.344758868
query8715.751159.44410.12278585
query8852.074801.00515.38205246
query893.38934.13410.07199764
query904.4866.70914.89040179
query910.5276.7812.86527514
query921.6215.9893.694632943
query930.030.86628.86666667
query9410.33164.8815.96127783
query9549.464381.5287.713245997
query9611.94799.4148.321252197
query9730.497178.7515.861265042
query982.0059.7954.885286783
query999.35262.9526.731394354
sum1869.18717779.6369.511962153

联系我们
预约咨询
微信咨询
电话咨询