Replica data file corrupted after restoring database from pg dump file (on primary pod) #3923

mrain32 · 2024-05-31T05:04:44Z

Hi, I got data file corrupted errors with the log in postgresql-log file :
ERROR: could not read block xxx in file "base/xxx/xxx": read only 0 of 8192 bytes

Not all postgres data files were corrupted, but some of them.

I tried to detect broken fsm files as intructions in this link by running query:

SELECT oid::regclass AS relname, pg_relation_filepath(oid) || '_fsm' AS fsm FROM pg_class, CAST(current_setting('block_size') AS bigint) AS bs WHERE relkind IN ('r', 'i', 't', 'm') AND EXISTS (SELECT 1 FROM generate_series(pg_relation_size(oid) / bs, (pg_relation_size(oid, 'fsm') - 2*bs) / 2) AS blk WHERE pg_freespace(oid, blk) > 0);

and got result like below:

Reproducing steps:

Install postgres cluster by helm with config yaml:

postgresVersion: 16
pgBouncerReplicas: 1
instances:
  - name: "instance1"
    replicas: 2
    resources:
      requests:
        cpu: "0.5"
        memory: "1Gi"
      limits:
        cpu: "4.0"
        memory: "10Gi"
    dataVolumeClaimSpec:
      accessModes:
      - "ReadWriteOnce"
      resources:
        requests:
          storage: "1Gi"
patroni:
  dynamicConfiguration:
    postgresql:
      parameters:
        shared_buffers: "1GB"
        max_connections: "300"
pgBackRestConfig: 
  global:
    repo1-retention-full: "2"
    repo1-retention-full-type: count
  repos:
  - name: repo1
    schedules:
      full: "00 18 * * *"
    volume:
      volumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: "1Gi"
pgBouncerConfig: 
  replicas: 1
  service: 
    metadata:
      labels:
        k8slens-edit-resource-version: v1
        postgres-operator.crunchydata.com/cluster: app
        postgres-operator.crunchydata.com/role: pgbouncer
    type: NodePort
    nodePort: [Port number]
  config:
    global:
      max_client_conn: "2000"
      pool_mode: "transaction"
      default_pool_size: "30"
      server_idle_timeout: "60"
service:
  metadata:
    name: app-primary-ex
    namespace: pgo
    labels:
      app-label: app-label-1
  type: NodePort
  nodePort: [Port Number]

Restore database from pg dump backup file (I tried with different pg dump backup files (different databases) but got the same issue)
Failover to change primary pod (patronictl switchover)
Got FSM broken file error in the postgres log file.

Solved by:

Clear the data folder of the replica pod right after restoring db (from pg dump file)--> patroni auto reinit and copy all file
Try to switchover --> no error.

Is it a bug or something?

Environment

Please provide the following details:

Platform: Kubernetes
Platform Version: 1.28.2
PGO Image Tag: ubi8-5.5.1-0
Postgres Version: 16
Storage: nfs

The text was updated successfully, but these errors were encountered:

benjaminjb · 2024-06-10T19:12:08Z

Hi @mrain32, sorry you're running into this. Just to make sure I understand: what command are you using to restore from a pg_dump backup?

I'd like to track this problem down, though it seems like you have a good workaround here -- though instead of starting a cluster with 2 replicas and deleting the data in the replica, could you restore to a cluster that only has a master, and once that restore is done, bump the number of replicas to 2?

mrain32 · 2024-06-21T05:09:42Z

Hi @benjaminjb , thank for your reply.

Here the command for restoring from a pg_dump backup that I used:
pg_restore --verbose --host=x.x.x.x --port=xxx --username=xxx --format=t --dbname=xxx filepath

I think if I started with 1 master only, this error would not occur. But in my case, I have some databases in the cluster and just restore 1 of them, if scale down the number of replicas pod to 0 - restoring - scale up back to 2, it would take time :)

Thank you again,
Best Regards,
mRain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replica data file corrupted after restoring database from pg dump file (on primary pod) #3923

Replica data file corrupted after restoring database from pg dump file (on primary pod) #3923

mrain32 commented May 31, 2024 •

edited by benjaminjb

Loading

benjaminjb commented Jun 10, 2024

mrain32 commented Jun 21, 2024

Replica data file corrupted after restoring database from pg dump file (on primary pod) #3923

Replica data file corrupted after restoring database from pg dump file (on primary pod) #3923

Comments

mrain32 commented May 31, 2024 • edited by benjaminjb Loading

Environment

benjaminjb commented Jun 10, 2024

mrain32 commented Jun 21, 2024

mrain32 commented May 31, 2024 •

edited by benjaminjb

Loading