Evento DegradedArray após o rsync mas depois o mdadm e o smartctl não mostram nenhum problema

1

Eu tenho no meu cron rsync ativo e comecei a receber e-mails após cada rsync

This is an automatically generated mail message from mdadm
running on titan707

A DegradedArray event had been detected on md device /dev/md/2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sdb3[1] sda3[0]
      7995840 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sdb2[1](F) sda2[0]
      499712 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sdb4[1](F) sda4[0]
      968130304 blocks super 1.2 [2/1] [U_]

unused devices: 

Mais tarde, o smartctl e o mdadmin não mostram nenhum problema, veja abaixo os logs do mdadm, smartctl.

$ cat /proc/mdstat 
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear] [multipath] 
md0 : active raid1 sda1[0] sdb1[1]
      33553336 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sdb2[1] sda2[0]
      524276 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sdb4[1] sda4[0]
      1822442815 blocks super 1.2 [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
      1073740664 blocks super 1.2 [2/2] [UU]

unused devices: 
$ smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-24-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda XT
Device Model:     ST33000651AS
Serial Number:    Z291E1TG
LU WWN Device Id: 5 000c50 03f2f8fbc
Firmware Version: CC45
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed Mar 19 09:20:26 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (  600) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 255) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x103f) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       152015022
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       6
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       40795438
  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       20281
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       6
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   053   046   045    Old_age   Always       -       47 (Min/Max 43/54)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       6
194 Temperature_Celsius     0x0022   047   054   000    Old_age   Always       -       47 (0 23 0 0)
195 Hardware_ECC_Recovered  0x001a   021   003   000    Old_age   Always       -       152015022
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       253145372446521
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2852285811
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       811308464

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     20193         -
# 2  Short offline       Completed without error       00%     20185         -
# 3  Extended offline    Completed without error       00%      5723         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

$ smartctl -a /dev/sdb
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-24-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda XT
Device Model:     ST33000651AS
Serial Number:    Z2917JDM
LU WWN Device Id: 5 000c50 03f1b6146
Firmware Version: CC45
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed Mar 19 09:20:53 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (  609) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 255) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x103f) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       144398334
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       6
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       41707682
  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       20281
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       6
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   057   049   045    Old_age   Always       -       43 (Min/Max 39/51)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       6
194 Temperature_Celsius     0x0022   043   051   000    Old_age   Always       -       43 (0 23 0 0)
195 Hardware_ECC_Recovered  0x001a   021   003   000    Old_age   Always       -       144398334
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       38959648362297
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       162809159
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1526676264

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     20218         -
# 2  Short offline       Completed without error       00%     20185         -
# 3  Extended offline    Completed without error       00%      5723         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

$ 
$ mdadm -D /dev/md0 
/dev/md0:
        Version : 1.2
  Creation Time : Fri Jul 27 13:40:57 2012
     Raid Level : raid1
     Array Size : 33553336 (32.00 GiB 34.36 GB)
  Used Dev Size : 33553336 (32.00 GiB 34.36 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon Mar 17 12:24:57 2014
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : rescue:0
           UUID : 28ad38a2:f3df9bbc:2f1f4d98:2006ce16
         Events : 22

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
$ mdadm -D /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Fri Jul 27 13:40:57 2012
     Raid Level : raid1
     Array Size : 524276 (512.07 MiB 536.86 MB)
  Used Dev Size : 524276 (512.07 MiB 536.86 MB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Wed Mar 19 06:25:43 2014
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : rescue:1
           UUID : 659022e1:e93cfcb9:c7b533ae:5a81c83b
         Events : 25

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
$ mdadm -D /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Fri Jul 27 13:40:58 2012
     Raid Level : raid1
     Array Size : 1073740664 (1024.00 GiB 1099.51 GB)
  Used Dev Size : 1073740664 (1024.00 GiB 1099.51 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Wed Mar 19 09:21:40 2014
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : rescue:2
           UUID : b79d3e48:62b55d0b:8501355c:2f905ef2
         Events : 34

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
$ mdadm -D /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Fri Jul 27 13:40:58 2012
     Raid Level : raid1
     Array Size : 1822442815 (1738.02 GiB 1866.18 GB)
  Used Dev Size : 1822442815 (1738.02 GiB 1866.18 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Wed Mar 19 09:21:09 2014
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : rescue:3
           UUID : fdb07043:8bd52646:9f267e1b:d0a43f0e
         Events : 22

    Number   Major   Minor   RaidDevice State
       0       8        4        0      active sync   /dev/sda4
       1       8       20        1      active sync   /dev/sdb4
$ 

Também não consigo encontrar nada no dmesg

$ dmesg | grep "md"
[    1.957908] md: raid0 personality registered for level 0
[    1.959091] md: raid1 personality registered for level 1
[    2.069112] md: bind
[    2.070684] md: bind
[    2.072032] md: bind
[    2.116159] md: bind
[    2.117310] md/raid1:md3: active with 2 out of 2 mirrors
[    2.117380] md3: detected capacity change from 0 to 1866181442560
[    2.124174] md: bind
[    2.138621]  md3: unknown partition table
[    2.140113] md: bind
[    2.141326] md/raid1:md2: active with 2 out of 2 mirrors
[    2.141398] md2: detected capacity change from 0 to 1099510439936
[    2.162685]  md2: unknown partition table
[    2.230596] md: bind
[    2.231715] md/raid1:md1: active with 2 out of 2 mirrors
[    2.231786] md1: detected capacity change from 0 to 536858624
[    2.233100]  md1: unknown partition table
[    2.436160] md: bind
[    2.437387] md/raid1:md0: active with 2 out of 2 mirrors
[    2.437456] md0: detected capacity change from 0 to 34358616064
[    2.444765]  md0: unknown partition table
[    2.456675] md: raid6 personality registered for level 6
[    2.456738] md: raid5 personality registered for level 5
[    2.456797] md: raid4 personality registered for level 4
[    2.458570] md: raid10 personality registered for level 10
[    2.462736] md: linear personality registered for level -1
[    2.463538] md: multipath personality registered for level -4
[    8.213448] EXT4-fs (md2): mounted filesystem with ordered data mode. Opts: (null)
[   11.334852] Adding 33553332k swap on /dev/md0.  Priority:-1 extents:1 across:33553332k 
[   11.337379] EXT4-fs (md2): warning: checktime reached, running e2fsck is recommended
[   11.359536] EXT4-fs (md2): re-mounted. Opts: (null)
[   11.700105] EXT3-fs (md1): warning: checktime reached, running e2fsck is recommended
[   11.778306] EXT3-fs (md1): using internal journal
[   11.778310] EXT3-fs (md1): mounted filesystem with ordered data mode
[   12.155704] EXT4-fs (md3): warning: checktime reached, running e2fsck is recommended
[   12.218303] EXT4-fs (md3): mounted filesystem with ordered data mode. Opts: (null)
$ dmesg | grep "sd"
[    1.870244] sd 0:0:0:0: [sda] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
[    1.870251] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    1.870487] sd 0:0:0:0: [sda] Write Protect is off
[    1.870637] sd 1:0:0:0: [sdb] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)
[    1.870638] sd 1:0:0:0: Attached scsi generic sg1 type 0
[    1.870667] sd 1:0:0:0: [sdb] Write Protect is off
[    1.870668] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    1.870697] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.870989] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    1.870999] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.916610]  sda: sda1 sda2 sda3 sda4 sda5
[    1.917195] sd 0:0:0:0: [sda] Attached SCSI disk
[    1.928325]  sdb: sdb1 sdb2 sdb3 sdb4 sdb5
[    1.929042] sd 1:0:0:0: [sdb] Attached SCSI disk
[    2.069112] md: bind
[    2.070684] md: bind
[    2.072032] md: bind
[    2.116159] md: bind
[    2.124174] md: bind
[    2.140113] md: bind
[    2.230596] md: bind
[    2.436160] md: bind

Cron script que estou executando como usuário mybackup para sincronizar conteúdo entre dois servidores que eu gerencio

#!/bin/bash
#follow instructions to setup mybackup account and sh keys from https://blogs.oracle.com/jkini/entry/how_to_scp_scp_and
rsync -a -r -u [email protected]:/tralev/images /home/tralev/backup
echo finished tralev images
sleep 2s

rsync -a -r -u [email protected]:/backup/* /home/tralev/backup/db
echo finished tralev db
sleep 2s

#backup numbeo files to tralev server
rsync -a -r -u /numbeo/* [email protected]:/numbeo/backup
echo finished numbeo files like images
sleep 2s

rsync -a -r -u /root/backup/* [email protected]:/numbeo/db_backup
echo finished numbeo db backup
sleep 2s

Eu posso reproduzir o problema apenas ao executá-lo no cron, quando executo o script no servidor, não obtenho o mesmo problema.

Alguma ideia do que poderia correr mal?

EDIT: Acabei descobrindo que estava checando o servidor errado. Ainda mais, ambas as unidades no servidor titan707 falharam, então tive que substituir o servidor do backup! Erro humano!

    
por Mladen Adamovic 19.03.2014 / 09:46

1 resposta

3

Você está verificando o servidor errado. A segunda saída / proc / mdstat (com 4 raid arrays) não é de titan707 que possui três arrays de raid.

    
por 08.04.2014 / 10:55