问题

Postgresql并发测试时,我们发现当并发psql数比较大时,会报告以下类似错误:

psql: error: could not connect to server: Resource temporarily unavailable
        Is the server running locally and accepting
        connections on Unix domain socket "/tmp/.s.PGSQL.5404"?

搜索Postgresql的源码推断,“Resource temporarily unavailable”错误不是Postgresql数据库报的错,推断是操作系统拦截了psql拦截。

复现方法

测试用例如下:

先修改Postgresql的max_conections参数值为500,并重启Postgresql服务:

max_conections = 500

编写测试脚本(test.sh):

for ((i=1; i<300; i++)); do
        psql -p 5404 postgres -c "select pg_sleep(2);" &
done

执行脚本可能会报以下错误:

[yz@bogon ~]$ ./test.sh
psql: error: could not connect to server: Resource temporarily unavailable
        Is the server running locally and accepting
        connections on Unix domain socket "/tmp/.s.PGSQL.5404"?
psql:psql: error: could not connect to server: Resource temporarily unavailable
        Is the server running locally and accepting
        connections on Unix domain socket "/tmp/.s.PGSQL.5404"?
……

原因与解决

经查内核net.core.somaxconn值太小导致的(默认128)。

[yz@bogon ~]$ sysctl -a | grep somaxconn
net.core.somaxconn = 128

调大net.core.somaxconn值并重启Postgresql服务(注意:修改该参数后务必重启服务)。

[yz@bogon ~]$ sudo sysctl -w net.core.somaxconn=1024
net.core.somaxconn = 1024
[yz@bogon ~]$ sudo sysctl -a | grep somaxconn
net.core.somaxconn = 1024

net.core.somaxconn内核参数控制一个socket端口最大监听队列长度,因为上述测试中,psql连接的是同一个端口,并发数远远高于128,大概率会造成该队列超限。以下引自华为云文档

`net.core.somaxconn` indicates the maximum number of half-open connections that can be backlogged in a listening queue. The default value is 128. If the queue is overloaded, you need to increase the listening queue length.