FAIRYFAR-INTERNAL
 
  FAIRYFAR-INTERNAL  |  SITEMAP  |  ABOUT-ME  |  HOME  
openGauss查询超时不起作用

本文描述 openGauss 1.0.1 build 6942b898 版本中的一个bug。

验证

snippet.sql
[yz@test openGauss]$ gsql tpch1s
tpch1s=# SET statement_timeout=10;
SET
TIME: 2.079 ms
 
tpch1s=# SHOW statement_timeout;
 statement_timeout 
-------------------
 10ms
(1 ROW)
 
TIME: 1.976 ms
 
tpch1s=# \d
ERROR:  canceling statement due TO statement timeout
 
tpch1s=# SELECT COUNT(*) FROM lineitem;
  COUNT  
---------
 6001215
(1 ROW)
 
TIME: 6141.025 ms

可见,查询耗时6秒,仍然可以没有报超时。同样的用例在原生PostgreSQL上执行,查询超时会触发正常报错。

原因

查询执行时,对同一个定时器,先后多次设置了时间。导致后面的定时时间覆盖了前面的时间。

以下堆栈是同一个执行线程("gaussdb" )先后出现的3各堆栈。

(gdb) info threads
  Id   Target Id                                            Frame 
  1    Thread 0x7fbed0dd0480 (LWP 150411) "GaussMaster"     0x00007fbed1b45bed in poll () from /usr/lib64/libc.so.6
  2    Thread 0x7fbed03ff700 (LWP 150541) "jemalloc_bg_thd" 0x00007fbed1e2b9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
  3    Thread 0x7fbeab5ff700 (LWP 151475) "GaussMaster"     0x00007fbed1e2f381 in sigwait () from /usr/lib64/libpthread.so.0
  ……
  25   Thread 0x7fbe8d7f5700 (LWP 153766) "WlmArbiter"      0x00007fbed1b47953 in select () from /usr/lib64/libc.so.6
* 26   Thread 0x7fbe889f4700 (LWP 1243811) "gaussdb"        gs_signal_settimer (interval=0x7fbe889d8760) at gs_signal.cpp:1017

首先,ReadCommand设置定时器(缺省600s),此为正常逻辑

(gdb) bt
#0  gs_signal_settimer (interval=0x7fbe889d88d0) at gs_signal.cpp:1017
#1  0x0000000001dedd33 in enable_session_sig_alarm (delayms=600000) at proc.cpp:2117
#2  0x0000000001710d30 in ReadCommand (inBuf=0x7fbe889d8a60) at postgres.cpp:655
#3  0x0000000001720b8f in PostgresMain (argc=1, argv=0x7fbe86ff32b0, dbname=0x7fbe86e9a590 "tpch1s", username=0x7fbe86e9a548 "yz") at postgres.cpp:7869
#4  0x000000000169b39f in BackendRun (port=0x7fbe889d8f30) at postmaster.cpp:7004
#5  0x00000000016a6c31 in GaussDbThreadMain<(knl_thread_role)1> (arg=0x7fbeb01021e0) at postmaster.cpp:10366
#6  0x00000000016a3d9a in InternalThreadFunc (args=0x7fbeb01021e0) at postmaster.cpp:10730
#7  0x0000000001fc6532 in ThreadStarterFunc (arg=0x7fbeb01021d0) at gs_thread.cpp:381
#8  0x00007fbed1e27e65 in start_thread () from /usr/lib64/libpthread.so.0
#9  0x00007fbed1b5088d in clone () from /usr/lib64/libc.so.6

然后,statement超时定时(上例为10ms),这个是我们期望的查询超时时间:

(gdb) bt
#0  gs_signal_settimer (interval=0x7fbe889d8760) at gs_signal.cpp:1017
#1  0x0000000001ded71d in enable_sig_alarm (delayms=2, is_statement_timeout=true) at proc.cpp:2042
#2  0x000000000171a39f in start_xact_command () at postgres.cpp:4974
#3  0x0000000001713d32 in exec_simple_query (query_string=0x7fbe872b7060 "select count(*) from lineitem;", messageType=QUERY_MESSAGE, msg=0x7fbe889d8a60) at postgres.cpp:1966
#4  0x0000000001720f20 in PostgresMain (argc=1, argv=0x7fbe86ff32b0, dbname=0x7fbe86e9a590 "tpch1s", username=0x7fbe86e9a548 "yz") at postgres.cpp:8049
#5  0x000000000169b39f in BackendRun (port=0x7fbe889d8f30) at postmaster.cpp:7004
#6  0x00000000016a6c31 in GaussDbThreadMain<(knl_thread_role)1> (arg=0x7fbeb01021e0) at postmaster.cpp:10366
#7  0x00000000016a3d9a in InternalThreadFunc (args=0x7fbeb01021e0) at postmaster.cpp:10730
#8  0x0000000001fc6532 in ThreadStarterFunc (arg=0x7fbeb01021d0) at gs_thread.cpp:381
#9  0x00007fbed1e27e65 in start_thread () from /usr/lib64/libpthread.so.0
#10 0x00007fbed1b5088d in clone () from /usr/lib64/libc.so.6

但是,之后WLM又设置定时器(缺省30s),把之前的statement超时时间覆盖了:

(gdb) bt
#0  gs_signal_settimer (interval=0x7fbe889d8510) at gs_signal.cpp:1017
#1  0x00000000012ea7ec in WLMSetTimer (start=669806818927860, stop=669806848927860) at statctl.cpp:3821
#2  0x00000000012f1119 in WLMSetDNodeInfoStatus (qid=0x7fbe86f28230, status=WLM_STATUS_RUNNING) at statctl.cpp:5531
#3  0x00000000012f2bc2 in WLMSetCollectInfoStatus (status=WLM_STATUS_RUNNING) at statctl.cpp:6041
#4  0x000000000132326e in dywlm_client_manager (queryDesc=0x7fbe87307060, isQueryDesc=true) at dywlm_client.cpp:2612
#5  0x000000000172a9e2 in PortalRun (portal=0x7fbe8730f060, count=9223372036854775807, is_top_level=true, dest=0x7fbe872b8a58, altdest=0x7fbe872b8a58, completion_tag=0x7fbe889d88d0 "") at pquery.cpp:1051
#6  0x00000000017154b0 in exec_simple_query (query_string=0x7fbe872b7060 "select count(*) from lineitem;", messageType=QUERY_MESSAGE, msg=0x7fbe889d8a60) at postgres.cpp:2447
#7  0x0000000001720f20 in PostgresMain (argc=1, argv=0x7fbe86ff32b0, dbname=0x7fbe86e9a590 "tpch1s", username=0x7fbe86e9a548 "yz") at postgres.cpp:8049
#8  0x000000000169b39f in BackendRun (port=0x7fbe889d8f30) at postmaster.cpp:7004
#9  0x00000000016a6c31 in GaussDbThreadMain<(knl_thread_role)1> (arg=0x7fbeb01021e0) at postmaster.cpp:10366
#10 0x00000000016a3d9a in InternalThreadFunc (args=0x7fbeb01021e0) at postmaster.cpp:10730
#11 0x0000000001fc6532 in ThreadStarterFunc (arg=0x7fbeb01021d0) at gs_thread.cpp:381
#12 0x00007fbed1e27e65 in start_thread () from /usr/lib64/libpthread.so.0
#13 0x00007fbed1b5088d in clone () from /usr/lib64/libc.so.6

这还可以解释另一种情况。当statement_timeout设置的足够小,例如1ms,那么在定时器被重新设置之前,定时器已经1ms超时,那么会正常触发查询超时报错。即,

snippet.sql
tpch1s=# SET statement_timeout=1;
SET
TIME: 2.023 ms
 
tpch1s=# SELECT COUNT(*) FROM lineitem;
ERROR:  canceling statement due TO statement timeout


打赏作者以资鼓励:
移动端扫码阅读:
目录: