Troubleshooting high “gc current grant 2-way” and “gc cr grant 2-way” 案例

很久以前遇到过一个案例，也是一套RAC环境出现了非常高的GC cr/current grant 2-way 等待事件，《Performance tuning ‘gc cr&current grant 2-way’ event (当主机扩容cpu后)》当是这个案例是因为同一cluster的两个实例主机不同的CPU个数导致出现了不同的LMS进程数，负载能力不均衡导致，这次要分享的案例是另个一种特殊情况。

来自SMARTDBOPS

持续较高的GC类等待，

NODE1

EventWaitsTime(s)Avg wait (ms)% DB timeWait Class

enq: TX – index contention27,05732,531120225.09Concurrency

DB CPU26,32020.30

enq: TX – row lock contention165,2779,237567.12Application

gc current grant 2-way61,7699,2081497.10Cluster

log file sync1,937,6896,74735.20Commit

NODE2

Top 5 Timed Foreground Events

EventWaitsTime(s)Avg wait (ms)% DB timeWait Class

gc buffer busy acquire84,11231,36537335.95Cluster

gc current grant 2-way24,90611,21745012.86Cluster

gc current block 2-way71,9848,1901149.39Cluster

gc cr grant 2-way34,7486,0611746.95Cluster

enq: TX – index contention3,0674,73515445.43Concurrency

如果从节点1看会认为是并发争用enq index/row lock contention；还有gc cr/current grant 2-way, 该event表示Block is not in any cache，

Permission granted to read from the disk.

在诊断RAC问题时有几个黄金规则：

1. 谨防被TOP Event所欺骗

在RAC环境中有可能根本问题在另一个节点中隐藏，基于不会显示在top 5 event中, 被影响的等待事件可能比根本原因等待事件本身还要高，实例的性能错综复杂，很可能一个节点的后台进程出现问题（如LMS），而影响了其它所有节点出现Application class等待事件。

gc buffer busy xx 等待事件是RAC性能分析根本问题之前的主要证状，简单认为在等待一个BLOCK的BL lock请求, 其它所有进程在等待这个BLOCK的BL lock可用其间的时间消耗都会记入gc buffer busy xx 等待事件中.

ASH和dump block trace 文件可以定位GC BUFFER BUSY等待，主要是识别对象，对象类型，块类型信息

Object typeBlock typePossible issue(s)

TableSegment headerFreelists, freelist groups, deletes from

one node,ASSM bugs etc

TableSegment blocksHeavy insert workload, heavy update to

few blocks, SQL performance issues

scanning few objects aggressively.

IndexLeaf blockNumerous inserts on indexed columns,

sequence or sysdate generated keys

IndexRoot /

branch blockInsert + numerous index scans,

concurrent inserts and deletes etc

UndoUndo header blockNumerous short transactions

UndoUndo blockLong pending transaction and CR block

generation

Dictionary SEQ$ (object)Sequences with nocache or order set

and excessive access to sequence.

2. 消除基础架构环境问题

Gc cr grants 2-way 如果BLOCK没有在任何实例的Buffer cache中，时间会记入这个等待事件中，前台进程发送一个请求给远程LMS, 远程的LMS回复‘read from disk’. 出现gc 2-way 3-way这类意为着高并发出现的拥挤产生的等待，一个block和message传输的最小时间通常<2ms, (Exadata<0.5ms). 需要查看 AWR中的wait event histogram 查看2/3- way事件不同时间段的占比。排除网络问题

3. 识别问题实例

SELECT INSTANCE ||'->’ || inst_id transfer,class, cr_block cr_blk,Trunc(cr_block_time / cr_block / 1000, 2) avg_Cr,current_block cur_blk,Trunc(current_block_time / current_block / 1000, 2) avg_curFROM gv$instance_cache_transferWHERE cr_block > 0 AND current_block > 0ORDER BY INSTANCE, inst_id, class/

4. 检查发送端的性能指标，不只是接收端

发送端

gc cr block receive time=

Time to send message to a remote LMS process by FG

+ Time taken by LMS to build block (statistics: gc cr block build time)

+ LMS wait for LGWR latency ( statistics: gc cr block flush time)

+ LMS send time (Statistics: gc cr block send time)

+ Wire latency.

5. 用柱状图去识别异常情况，而不只是平均值

AWR中的wait event histogram

案例分析

接着分析一开始AAS 上显示GC wait问题案例，这是一个11.2.0.3 2-NODES RAC ON HPUX 11.31的数据库，

# instance 1

Global Cache Load Profile

Per SecondPer Transaction

Global Cache blocks received:1,018.741.08

Global Cache blocks served:78.530.08

GCS/GES messages received:904.330.96

GCS/GES messages sent:2,532.562.68

DBWR Fusion writes:3.070.00

Estd Interconnect traffic (KB)18,227.67

Global Cache Efficiency Percentages (Target local+remote 100%)

Buffer access – local cache %:99.48

Buffer access – remote cache %:0.39

Buffer access – disk %:0.14

Global Cache and Enqueue Services – Workload Characteristics

Avg global enqueue get time (ms):2.9

Avg global cache cr block receive time (ms):6.8

Avg global cache current block receive time (ms):3.8

Avg global cache cr block build time (ms):0.0

Avg global cache cr block send time (ms):0.0

Global cache log flushes for cr blocks served %:0.5

Avg global cache cr block flush time (ms):29.3

Avg global cache current block pin time (ms):1.4

Avg global cache current block send time (ms):0.0

Global cache log flushes for current blocks served %:2.1

Avg global cache current block flush time (ms):30.8

Global Cache and Enqueue Services – Messaging Statistics

Avg message sent queue time (ms):250.4 –issue instance

Avg message sent queue time on ksxp (ms):6.8

Avg message received queue time (ms):3.0

Avg GCS message process time (ms):1.6

Avg GES message process time (ms):0.0

% of direct sent messages:43.52

% of indirect sent messages:45.21

% of flow controlled messages:11.27

TIPS:

AWR中的Average message queue time 数据来自 (G)V$GES_STATISTICS. 该值是把indirect message到message队列和发送它所需的时间(GCS/GES communication layer). 我们需要查看KSXP消息统计时间，以确定IPC层中是否存在问题。 IPC时间通常<1ms，您可以查询X$KSXPCLIENT以获取更多数据，该值很高时也有可能在这段时间内CPU非常繁忙（100％CPU）或者 interconnect达到顶峰, 这点需要结合OSW确认。

# instance 2

Global Cache Load Profile

Per SecondPer Transaction

Global Cache blocks received:96.240.75

Global Cache blocks served:1,043.198.17

GCS/GES messages received:3,607.2028.26

GCS/GES messages sent:1,342.8810.52

DBWR Fusion writes:3.090.02

Estd Interconnect traffic (KB)19,197.66

Global Cache Efficiency Percentages (Target local+remote 100%)

Buffer access – local cache %:99.43

Buffer access – remote cache %:0.20

Buffer access – disk %:0.37

Global Cache and Enqueue Services – Workload Characteristics

Avg global enqueue get time (ms):3.5

Avg global cache cr block receive time (ms):509.7

Avg global cache current block receive time (ms):96.6

Avg global cache cr block build time (ms):0.0

Avg global cache cr block send time (ms):0.0

Global cache log flushes for cr blocks served %:1.0

Avg global cache cr block flush time (ms):1.3

Avg global cache current block pin time (ms):0.2

Avg global cache current block send time (ms):0.0

Global cache log flushes for current blocks served %:0.1

Avg global cache current block flush time (ms):1.7

Global Cache and Enqueue Services – Messaging Statistics

Avg message sent queue time (ms):0.0

Avg message sent queue time on ksxp (ms):142.6

Avg message received queue time (ms):0.0

Avg GCS message process time (ms):0.0

Avg GES message process time (ms):0.0

% of direct sent messages:73.94

% of indirect sent messages:18.43

% of flow controlled messages:7.63

Tips:

Avg global cache cr block receive time (ms) 从请求实例发送消息到mastering instance花费的时间。这个时间包括在holding instance生成数据块一致性读映像的时间。CR数据块获取耗费的时间不应该大于15ms。

Avg global cache current block receive time (ms): 从请求实例发送消息到mastering instance（2-way get）和一些到holding instance (3-way get)花费的时间。这个时间包括holding instance日志刷新花费的时间。Current Block获取耗费的时间不大于30ms.

Avg message sent queue time on ksxp (ms)对端IPC层确认收到的消息并返回ACK所花费的时间，通常是个很小的毫秒值，直接反应了网络延迟，一般小于1ms, The send time is measured when a block is handed to the IPC layer to be sent. The requests are queued in a completion queue, and the send time is measured from when a block is handed off to the IPC layer until the system call returns a completion, i.e. a status indicating that the send request has been successfully passed on to the network and control returns to the cache layer.

如果该值比较大，可能是CLUSTER INTERCONNECT和LMS进程出现了问题。

做ipc诊断：

oradebug setmypid

oradebug ipc

1）如果网络配置没有问题，可以检查一下TCP 和UDP的buffer是否不足

root#netstat -s | grep overflow 值为了0一般情况下。

root#netstat -i

2) 查看负载ps aux | grep lms

查看是否是有某些表产生了过多的global cr request

select STATISTIC_NAME stat, OWNER, OBJECT_NAME obj, sum(value) val from v$segment_statistics where STATISTIC_NAME like 'global%' and value > 10000 group by STATISTIC_NAME,OWNER,OBJECT_NAME order by val desc;

2、查看一段时间内，CR的值是否很大。

select CR_REQUESTS cr, CURRENT_REQUESTS cur, DATA_REQUESTS data, UNDO_REQUESTS undo, TX_REQUESTS tx from v$cr_block_server;确认问题节点

WAIT timeNODE1NODE2

Avg global cache cr block receive time (ms) 6.8 509.7

Avg global cache current block receive time (ms) 3.8 96.6

Node:

RAC环境中，慢的节点会导致其他节点出现gc性能问题，根据这个逻辑，确认是节点1 出了问题。

#查看当时的LMS进程CPU使用

# node1oracle@qdty1:/home/oracle:11G> ps -efl|head -n 1;ps -efl|grep lms F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD1401 S oracle 5117 1 237 178 20 e000001c934ba340 94816 e000000792a7f050 Jul 13 ? 15776:29 ora_lms1_tyjc11401 S grid 23810 1 0 178 20 e000001c9de0e040 90895 e000000792a7f1d0 Jan 25 ? 1550:30 asm_lms0_+ASM11401 R oracle 5119 1 254 178 20 e000001ce7c1b9c0 94816 - Jul 13 ? 16214:17 ora_lms2_tyjc1401 S oracle 9366 7730 0 154 20 e000001cdb304c40 30 e000000da0da6472 10:50:38 pts/4 0:00 grep lms1401 R oracle 5115 1 252 178 20 e000001cf7042a00 94752 - Jul 13 ? 16329:36 ora_lms0_tyjc1# node 2oracle@qdty2:/home/oracle:11G> ps -efl|head -n 1;ps -efl|grep lms|grep -v grep F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD1401 S grid 12516 1 0 178 20 e000001b43146400 90854 e000000792d71348 Jan 25 ? 1461:04 asm_lms0_+ASM21401 S oracle 6446 1 24 178 20 e000001ef09b0d00 95666 e000000792d710d0 Jul 13 ? 16447:18 ora_lms2_tyjc21401 S oracle 6444 1 30 178 20 e000001ebe6c7d00 95666 e000000792d71248 Jul 13 ? 17583:22 ora_lms1_tyjc21401 S oracle 6442 1 43 178 20 e000001e7bc97c80 95666 e000000792d71240 Jul 13 ? 17764:57 ora_lms0_tyjc2

Note:

当前为HPUX 11.31平台，节点1所以PS显示的C[PU] 列会超过100%现象，但也表明了当前的CPU是非常繁忙。节点2是正常的。

# node1

Instance Activity Stats

StatisticTotalper Secondper Trans

data blocks consistent reads – undo records applied3.7 TB1.1 GB1.1 MB

rollback changes – undo records applied4,074,7741,124.491.19

transaction tables consistent reads – undo records applied100.1 MB28.3 KB0 B

# node 2

Instance Activity Stats

StatisticTotalper Secondper Trans

data blocks consistent reads – undo records applied4.6 GB2.7 MB21.6 KB

rollback changes – undo records applied1,0430.590.00

transaction tables consistent reads – undo records applied112.0 KB0 B0 B

如果不是应用SQL和网络问题，那为什么LMS会占用那么高的cpu呢？去MOS中搜，不难发现，又命中了一个bug 14672057.

原因是因为从11.2.0.2版本开始为了解决之前的bug 9569316,而引入了_max_cr_rollbacks参数，该参数默认为0。表示最大cr block rolback blocks.没有找到过多的关于该参数的解释。如果当出现花费大量的时间在undo cr block生成时，把该参数调整为非0值，会启用light work 规则，有助于减少lms进程压力。该bug只影响11.2.0.2和11.2.0.3, 解决解决方法可以动态的修改 “_max_cr_rollbacks”参数解决，所以没有再提供one-off patch, 该bug 在11.2.0.4版本已修复，当执行完如下修改后，GC 等待和enq 瞬间消失，恢复了正常状态。所以问题不是节点1 top event中表现的index split问题。

SQL> alter system set "_max_cr_rollbacks"=200;System altered.

# 修改后LMS 进程 node1

oracle@qdty1:/home/oracle:11G> ps -efl|head -n 1;ps -efl|grep lms|grep -v grep F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD1401 S oracle 5117 1 55 178 20 e000001c934ba340 94816 e000000792a7f0d0 Jul 13 ? 15778:04 ora_lms1_tyjc11401 S grid 23810 1 0 178 20 e000001c9de0e040 90895 e000000792a7f350 Jan 25 ? 1550:30 asm_lms0_+ASM11401 S oracle 5119 1 76 178 20 e000001ce7c1b9c0 94816 e000000792a7f1c8 Jul 13 ? 16216:40 ora_lms2_tyjc11401 S oracle 5115 1 80 178 20 e000001cf7042a00 94752 e000000792a7f3c0 Jul 13 ? 16331:29 ora_lms0_tyjc1

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。