十八般武艺玩转GaussDB(DWS)性能调优：Plan hint运用

前言

数据库的使用者在书写SQL语句时，会根据自己已知的情况尽力写出性能很高的SQL语句。但是当需要写大量SQL语句，且有些SQL语句的逻辑极为复杂时，数据库使用者就很难写出性能较高的SQL语句。

而每个数据库都有一个类似人的大脑的查询优化器模块，它接收来自语法分析模块传递过来的查询树，在这个查询树的基础上进行逻辑上的等价变换、物理执行路径的筛选，并且把选择出的最优的执行路径传递给数据库的执行器模块。查询优化器是提升查询效率非常重要的一个手段。

数据库查询优化器的分类详见博文

Plan hint的引入

由于优化器基于统计信息和估算模型生成计划，当估算出现偏差时，计划可能出现问题，性能较差，使语句的执行变得奇慢无比。

通常，查询优化器的优化过程对数据库使用者是透明的。在上一篇博文中， Gauss DB(DWS)提供了可通过配置GUC参数的方式，全局的干预查询计划的路径生成。本次，将介绍另一种可以人工干预计划生成的功能--plan hint。Hint是一种通过SQL语句中的注释传递给优化器的指令，优化器使用hint为语句选择执行计划。在测试或开发环境中，hint对于测试特定访问路径的性能非常有用。例如，您可能知道某些表优先进行连接，可以有效减少中间结果集大小，在这种情况下，可以使用提示来指示优化器使用更好的执行计划。

Plan hint功能属于语句级的调控，仅对当前语句的当前层次生效，可以帮助我们在调优的过程中，针对特定的语句，通过plan hint进行人工干预，选择更高效的执行计划。

GaussDB(DWS)的Plan hint有以下种类:

Join顺序的hint：调整join顺序

Scan/Join方法的hint：指定或避免scan/join的方法

Stream方法的hint：指定或避免redistribute/broadcast

行数hint：对于给定结果集，指定行数，或对原有估算值进行计算调整

倾斜值hint：在倾斜优化时，指定需要倾斜处理的特殊值

下面分别对以上几种plan hint的功能及其在实际中的运用做一下介绍。在下面几节的介绍中，除倾斜值hint外，都以tpcds中的Q6作为示例。为了能明显看到hint在查询优化过程中的作用，我们将store_sales表的统计信息删除。原始语句和生成的初始计划如下。

示例语句：

explain performanceselect a.ca_state state, count(*) cnt

from customer_address a

,customer c

,store_sales s

,date_dim d

,item i

where a.ca_address_sk = c.c_current_addr_sk

and c.c_customer_sk = s.ss_customer_sk

and s.ss_sold_date_sk = d.d_date_sk

and s.ss_item_sk = i.i_item_sk

and d.d_month_seq =

(select distinct (d_month_seq)

from date_dim

where d_year = 2000

and d_moy = 2 )

and i.i_current_price > 1.2 *

(select avg(j.i_current_price)

from item j

where j.i_category = i.i_category)

group by a.ca_state

having count(*) >= 10

order by cnt

limit 100;

Plan hint的应用

Join 顺序的hint

语法：

格式1：

leading(table_list)

仅指定join顺序，不指定内外表顺序

格式2：

leading((table_list))

同时指定join顺序和内外表顺序，内外表顺序仅在最外层生效

说明：

table_list为要调整join顺序的表名列表，表之间使用空格分隔。可以包含当前层的任意个表（别名），或对于子查询提升的场景，也可以包含子查询的hint别名，同时任意表可以使用括号指定优先级。

注意：

表只能用单个字符串表示，不能带schema。

表如果存在别名，需要优先使用别名来表示该表。

list中的表在当前层或提升的子查询中必须是唯一的。如果不唯一，需要使用不同的别名进行区分。

同一个表只能在list里出现一次。

示例1：

对于示例中的计划，可以看出，17-22号算子时store_sales表和item表join后生成hash表，store_sales表的数据量很大，store_sales和item表join后未过滤掉任何数据，所以这两个表join并生成hash表的时间都比较长。根据对tpcds各表中数据分布的了解，我们知道，store_sales表和date_dim进行join，可以过滤掉较多数据，所以，可以网站监控使用hint来提示优化器优将store_sales表和date_dim表先进行join，store_sales作为外表，date_dim作为内表，减少中间结果集大小。语句改写如下：

explain performanceselect /*+ leading((s d)) */ a.ca_state state, count(*) cnt

from customer_address a

,customer c

,store_sales s

,date_dim d

,item i

where a.ca_address_sk = c.c_current_addr_sk

and c.c_customer_sk = s.ss_customer_sk

and s.ss_sold_date_sk = d.d_date_sk

and s.ss_item_sk = i.i_item_sk

and d.d_month_seq =

(select distinct (d_month_seq)

from date_dim

where d_year = 2000

and d_moy = 2 )

and i.i_current_price > 1.2 *

(select avg(j.i_current_price)

from item j

where j.i_category = i.i_category)

group by a.ca_state

having count(*) >= 10

order by cnt

limit 100;

通过调整join顺序，使得之后各join的中间结果集都大幅减少，执行时间由34268.322ms降为11095.046ms。

Scan/Join方法的hint

用于指示优化器使用那种scan方法或join方法。

语法：

Join方法的hint格式：

[no] nestloop|hashjoin|mergejoin(table_list)

Scan方法的hint格式：

[no] tablescan|indexscan|indexonlyscan(table [index])

说明：

no表示提示优化器不使用这种方法。

table表示hint指定的表，只能指定一个表，如果表存在别名应优先使用别名进行hint。

index表示使用indexscan或indexonlyscan的hint时，指定的索引名称，当前只能指定一个。

示例2-1：

示例1中得到的执行计划，由于store_sales表的行数估算不准，store_sales和date_dim采用了效率不好的nestloop方式进行join。现在通过本节的hint方法来指示优化器不使用nestloop方式进行join。

explain performanceselect /*+ leading((s d)) no nestloop(s d) */ a.ca_state state, count(*) cnt

from customer_address a

,customer c

,store_sales s

,date_dim d

,item i

where a.ca_address_sk = c.c_current_addr_sk

and c.c_customer_sk = s.ss_customer_sk

and s.ss_sold_date_sk = d.d_date_sk

and s.ss_item_sk = i.i_item_sk

and d.d_month_seq =

(select distinct (d_month_seq)

from date_dim

where d_year = 2000

and d_moy = 2 )

and i.i_current_price > 1.2 *

(select avg(j.i_current_price)

from item j

where j.i_category = i.i_category)

group by a.ca_state

having count(*) >= 10

order by cnt

limit 100;

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。