O Cluster MariaDB está tendo problema de tempo limite entre os nós

1

Estou com dificuldades para diagnosticar o problema do MariaDB Cluster e gostaria de receber alguns conselhos.

Estamos executando um cluster MariaDB de três nós, cada nó está em um servidor ESXi dedicado e conectado pela rede local no mesmo datacenter. Recentemente, descobrimos que ocasionalmente estão com erro de tempo limite, olhamos para muitas coisas, mas não conseguimos chegar a uma conclusão.

Aqui estão os registros detalhados do erro de tempo limite:

181009 18:35:14 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') connection to peer 0a520ba7 with addr tcp://192.168.[censor]:4567 timed out, no messages seen in PT3S
181009 18:35:14 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.[censor]:4567
181009 18:35:15 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') reconnecting to 0a520ba7 (tcp://192.168.[censor]:4567), attempt 0
181009 18:35:17 [Note] WSREP: evs::proto(b1d2aacb, GATHER, view_id(REG,0a520ba7,147)) suspecting node: 0a520ba7
181009 18:35:17 [Note] WSREP: evs::proto(b1d2aacb, GATHER, view_id(REG,0a520ba7,147)) suspected node without join message, declaring inactive
181009 18:35:18 [Note] WSREP: declaring e0d6a63b at tcp://192.168.[censor]:4567 stable
181009 18:35:18 [Note] WSREP: Node b1d2aacb state prim
181009 18:35:18 [Note] WSREP: view(view_id(PRIM,b1d2aacb,148) memb {
    b1d2aacb,0
    e0d6a63b,0
} joined {
} left {
} partitioned {
    0a520ba7,0
})
181009 18:35:18 [Note] WSREP: save pc into disk
181009 18:35:18 [Note] WSREP: forgetting 0a520ba7 (tcp://192.168.[censor]:4567)
181009 18:35:18 [Note] WSREP: deleting entry tcp://192.168.[censor]:4567
181009 18:35:18 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting off
181009 18:35:18 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
181009 18:35:18 [Note] WSREP: STATE_EXCHANGE: sent state UUID: a226ba90-cba6-11e8-af4f-3751d36b7f83
181009 18:35:18 [Note] WSREP: STATE EXCHANGE: sent state msg: a226ba90-cba6-11e8-af4f-3751d36b7f83
181009 18:35:18 [Note] WSREP: STATE EXCHANGE: got state msg: a226ba90-cba6-11e8-af4f-3751d36b7f83 from 0 (CLUSTER002)
181009 18:35:18 [Note] WSREP: STATE EXCHANGE: got state msg: a226ba90-cba6-11e8-af4f-3751d36b7f83 from 1 (CLUSTER003)
181009 18:35:18 [Note] WSREP: Quorum results:
    version    = 4,
    component  = PRIMARY,
    conf_id    = 126,
    members    = 2/2 (joined/total),
    act_id     = 781947656,
    last_appl. = 781947602,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9
181009 18:35:18 [Note] WSREP: Flow-control interval: [23, 23]
181009 18:35:18 [Note] WSREP: New cluster view: global state: efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9:781947656, view# 127: Primary, number of nodes: 2, my index: 0, protocol version 3
181009 18:35:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
181009 18:35:18 [Note] WSREP: REPL Protocols: 7 (3, 2)
181009 18:35:18 [Note] WSREP: Assign initial position for certification: 781947656, protocol version: 3
181009 18:35:18 [Note] WSREP: Service thread queue flushed.
181009 18:35:20 [Note] WSREP:  cleaning up 0a520ba7 (tcp://192.168.[censor]:4567)
181009 18:35:22 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') connection established to 0a520ba7 tcp://192.168.[censor]:4567
181009 18:35:22 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
181009 18:35:25 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting off
181009 18:35:27 [Note] WSREP: declaring 0a520ba7 at tcp://192.168.[censor]:4567 stable
181009 18:35:27 [Note] WSREP: declaring e0d6a63b at tcp://192.168.[censor]:4567 stable
181009 18:35:27 [Note] WSREP: Node b1d2aacb state prim
181009 18:35:27 [Note] WSREP: view(view_id(PRIM,0a520ba7,149) memb {
    0a520ba7,0
    b1d2aacb,0
    e0d6a63b,0
} joined {
} left {
} partitioned {
})
181009 18:35:27 [Note] WSREP: save pc into disk
181009 18:35:27 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: sent state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: got state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143 from 0 (CLUSTER004)
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: got state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143 from 2 (CLUSTER003)
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: got state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143 from 1 (CLUSTER002)
181009 18:35:27 [Note] WSREP: Quorum results:
    version    = 4,
    component  = PRIMARY,
    conf_id    = 127,
    members    = 2/3 (joined/total),
    act_id     = 781948414,
    last_appl. = 781948375,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9
181009 18:35:27 [Note] WSREP: Flow-control interval: [28, 28]
181009 18:35:27 [Note] WSREP: New cluster view: global state: efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9:781948414, view# 128: Primary, number of nodes: 3, my index: 1, protocol version 3
181009 18:35:27 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
181009 18:35:27 [Note] WSREP: REPL Protocols: 7 (3, 2)
181009 18:35:27 [Note] WSREP: Assign initial position for certification: 781948414, protocol version: 3
181009 18:35:27 [Note] WSREP: Service thread queue flushed.
181009 18:35:29 [Note] WSREP: Member 0.0 (CLUSTER004) requested state transfer from '*any*'. Selected 2.0 (CLUSTER003)(SYNCED) as donor.
181009 18:35:29 [Note] WSREP: 2.0 (CLUSTER003): State transfer to 0.0 (CLUSTER004) complete.
181009 18:35:29 [Note] WSREP: Member 2.0 (CLUSTER003) synced with group.
181009 18:35:29 [Note] WSREP: 0.0 (CLUSTER004): State transfer from 2.0 (CLUSTER003) complete.
181009 18:35:29 [Note] WSREP: Member 0.0 (CLUSTER004) synced with group.

O que fizemos:

  1. Nós aprimoramos o monitoramento do banco de dados e comparamos os dados com outros ambientes de trabalho. Descobrimos que "innodb_checkpoint_age.uncheckpointed_bytes" é alto em torno de 4MB ~ 10MB.

  2. Realizamos alguns monitoramento do Traceroute e descobrimos que, às vezes, especialmente quando o WSREP faz verificação de conexão, o Ping dispara para > 8000ms e até mesmo uma vez > 16000ms, o que normalmente deveria ser 0,2ms.

  3. Tentamos ajustar o valor de rx e tx do adaptador de rede de 256/256 para 512/512 para ver se isso ajuda. Não o fez.

  4. Alguém na Internet diz que mudar o MTU de 9000 para 1500 ajuda, e isso não aconteceu, o Cluster recusou-se a iniciar a MTU 1500.

  5. Há alguma consulta lenta logo antes de um grande incidente, no qual todos os clusters são encerrados e temos que reiniciá-los manualmente. Embora não tenhamos provas ou não tenhamos experiência suficiente para confirmar que tem algo a ver com a lentidão da consulta.

Eu não sou especialista em banco de dados, então isso pode ser o melhor que eu posso fazer, se houver algo que tenhamos perdido, por favor, responda a este post, muito obrigado.

    
por Richard Leung 10.10.2018 / 06:17

0 respostas