Привет. В документации в разделе Standby Server Operation https://www.postgresql.org/docs/13/warm-standby.html, написано,

Question

Привет. В документации в разделе Standby Server Operation https://www.postgresql.org/docs/13/warm-standby.html, написано,

что не работающая restore_command это не страшно:
"Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_wal directory"
а по факту имеем при попытке запуска с несуществующим скриптом для restore_command:

sh: /var/lib/pgsql/logreader.script: No such file or directory
2022-02-03 19:27:51 MSK 121190 FATAL: could not restore file "00000002.history" from archive: command not found
2022-02-03 19:27:51 MSK 121187 LOG: startup process (PID 121190) exited with exit code 1
2022-02-03 19:27:51 MSK 121187 LOG: aborting startup due to startup process failure
2022-02-03 19:27:51 MSK 121187 LOG: database system is shut down

Понятно, что это не совсем то , что описано в документации, но все же. Или я неверно её трактую?

#backend #devops #pgsql #programming #russian

0

03.02.2022

10 ответов

36 просмотров

Anton Glushakov Автор вопроса

Ilya Anfimov
Я бы хотел посмотреть в лог в цэлом. Но скорее вс...

Там нет ничего до этого, мало того, это даже не стендбай. Я экспериментирую с патрони, который по своей архитектуре запускает ноду сначала в режиме реплики, проверяет ключ лидера, и если ок, то промоутит до мастера. И вот я наткнулся на случай, когда restore_command фейлится и ПГ не стартует, соответственно патрони не может запустить ноду.

0

03.02.2022

Ilya Anfimov

Anton Glushakov
Там нет ничего до этого, мало того, это даже не ст...

Да, патрони этот тот ещё поросёнок. А вы лишних WAL туда не докидывали вручную?

0

03.02.2022

Ilya Anfimov

Anton Glushakov
Там нет ничего до этого, мало того, это даже не ст...

Я так подумал -- именно такое поведение можэт быть быть, если в pg_controldata одна точка, а в WAL куча файлов с перерывом. Тогда он должэн пытаться достать промежуточные файлы откуда-нибудь.

0

03.02.2022

Anton Glushakov Автор вопроса

Ilya Anfimov
Я так подумал -- именно такое поведение можэт быть...

Не в этом дело. Проблема просто в проблеме с доступом к файлу-скрипту указанном в restore_command. Если, например, ограничить права на него(или удалить) - проблема воспроизводится, вернуть права - постгрес запустит его, ничего не получит, и стартанет успешно.., т.к. весь нужный вал лежит у него локально.

0

03.02.2022

Ilya Anfimov

Anton Glushakov
Не в этом дело. Проблема просто в проблеме с досту...

Офигеть.

0

03.02.2022

Ilya Anfimov

Anton Glushakov
Не в этом дело. Проблема просто в проблеме с досту...

А какая версия?

0

03.02.2022

Anton Glushakov Автор вопроса

Ilya Anfimov
А какая версия?

13, но я уверен , что на любой версии воспроизведётся. пруфы: #mkdir /tmp/data #pg_ctl init -D /tmp/data #echo "restore_command = '/tmp/restore.sh'" >> /tmp/data/postgresql.auto.conf #touch /tmp/data/standby.signal #pg_ctl start -D /tmp/data waiting for server to start....2022-02-03 21:52:15.267 MSK [311] LOG: starting PostgreSQL 13.3 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit ... sh: /tmp/restore.sh: No such file or directory 2022-02-03 21:52:15.283 MSK [311] LOG: database system is shut down ... pg_ctl: could not start server #touch /tmp/restore.sh #chmod +x /tmp/restore.sh #pg_ctl start -D /tmp/data ... 2022-02-03 21:53:39.172 MSK [445] LOG: database system is ready to accept read only connections done server started ...

0

03.02.2022

Ilya Anfimov

Anton Glushakov
13, но я уверен , что на любой версии воспроизведё...

В сорцах спецыальная проверка, чтобы падало если command not found или эту команду убили сигналом. " * Remember, we rollforward UNTIL the restore fails so failure here is * just part of the process... that makes it difficult to determine * whether the restore failed because there isn't an archive to restore, * or because the administrator has specified the restore program * incorrectly. We have to assume the former. * * However, if the failure was due to any sort of signal, it's best to * punt and abort recovery. (If we "return false" here, upper levels will * assume that recovery is complete and start up the database!) It's * essential to abort on child SIGINT and SIGQUIT, because per spec * system() ignores SIGINT and SIGQUIT while waiting; if we see one of * those it's a good bet we should have gotten it too. * * On SIGTERM, assume we have received a fast shutdown request, and exit * cleanly. It's pure chance whether we receive the SIGTERM first, or the * child process. If we receive it first, the signal handler will call * proc_exit, otherwise we do it here. If we or the child process received * SIGTERM for any other reason than a fast shutdown request, postmaster * will perform an immediate shutdown when it sees us exiting * unexpectedly. * * We treat hard shell errors such as "command not found" as fatal, too. " Да, явно нехватает документацыи.

0

03.02.2022

Anton Glushakov Автор вопроса

Ilya Anfimov
В сорцах спецыальная проверка, чтобы падало если c...

да, так и есть. спасибо)

0

03.02.2022

Ilya Anfimov · Accepted Answer

Ilya Anfimov

Я бы хотел посмотреть в лог в цэлом. Но скорее всего он ужэ до того определил, что ему нужэн этот 00000002.history (и, вероятно, куча других сегментов, которых ужэ нет на сервере, но которые нужны чтобы сервер догнать).

0

03.02.2022

172 похожих чатов

Привет. В документации в разделе Standby Server Operation https://www.postgresql.org/docs/13/warm-standby.html, написано,

10 ответов

Похожие вопросы