Eu tive um problema ontem com um software Raid, onde um disco teve que ser substituído. Eu removi as partições do array usando
mdadm /dev/mdx -r /dev/sdbx
Depois que a unidade com falha foi substituída pelo centro de hospedagem, apliquei a tabela de partição ao novo disco (sdb era o dispositivo ruim)
sgdisk -R /dev/sdb /dev/sda
Deu um novo ID:
sgdisk -G /dev/sdb
Em seguida, adicionei todas as partições novamente usando:
mdadm /dev/mdx -r /dev/sdbx
Isso correu bem para todas as partições, exceto uma, que é liberada após algumas horas a 60%
Este é o estado atual do ataque:
cat /proc/mdstat
Personalities : [raid1]
md5 : active raid1 sda6[0] sdb6[2](S)
2633910528 blocks super 1.2 [2/1] [U_]
md4 : active raid1 sda5[0] sdb5[2]
16768896 blocks super 1.2 [2/2] [UU]
md3 : active raid1 sda4[0] sdb4[2]
2096064 blocks super 1.2 [2/2] [UU]
md2 : active raid1 sda3[0] sdb3[2]
268304192 blocks super 1.2 [2/2] [UU]
md1 : active raid1 sda2[0] sdb2[2]
523968 blocks super 1.2 [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[2]
8384448 blocks super 1.2 [2/2] [UU]
unused devices: <none>
No syslog eu posso ver mensagens como:
n 23 14:24:04 rescue kernel: [11163.329021] ata1.00: exception Emask 0x0 SAct 0xf00000 SErr 0x0 action 0x0
Jan 23 14:24:04 rescue kernel: [11163.376449] ata1.00: configured for UDMA/133
Jan 23 14:24:04 rescue kernel: [11163.376475] sd 0:0:0:0: [sda] Unhandled sense code
Jan 23 14:24:04 rescue kernel: [11163.376477] sd 0:0:0:0: [sda]
Jan 23 14:24:04 rescue kernel: [11163.376479] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 23 14:24:04 rescue kernel: [11163.376481] sd 0:0:0:0: [sda]
Jan 23 14:24:04 rescue kernel: [11163.376483] Sense Key : Medium Error [current] [descriptor]
Jan 23 14:24:04 rescue kernel: [11163.376486] Descriptor sense data with sense descriptors (in hex):
Jan 23 14:24:04 rescue kernel: [11163.376487] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jan 23 14:24:04 rescue kernel: [11163.376495] ce 1f 0d 58
Jan 23 14:24:04 rescue kernel: [11163.376498] sd 0:0:0:0: [sda]
Jan 23 14:24:04 rescue kernel: [11163.376501] Add. Sense: Unrecovered read error - auto reallocate failed
Jan 23 14:24:04 rescue kernel: [11163.376503] sd 0:0:0:0: [sda] CDB:
Jan 23 14:24:04 rescue kernel: [11163.376504] Read(16): 88 00 00 00 00 00 ce 1f 0b 80 00 00 04 00 00 00
Jan 23 14:24:04 rescue kernel: [11163.376513] end_request: I/O error, dev sda, sector 3458141528
e
Jan 23 14:35:22 rescue kernel: [11840.396206] ata1.00: configured for UDMA/133
Jan 23 14:35:22 rescue kernel: [11840.396212] ata1.00: device reported invalid CHS sector 0
Jan 23 14:35:22 rescue kernel: [11840.396216] ata1.00: device reported invalid CHS sector 0
Jan 23 14:35:22 rescue kernel: [11840.396220] ata1.00: device reported invalid CHS sector 0
Jan 23 14:35:22 rescue kernel: [11840.396223] ata1.00: device reported invalid CHS sector 0
Jan 23 14:35:22 rescue kernel: [11840.396230] ata1: EH complete
Jan 23 14:35:52 rescue kernel: [11870.888343] ata1.00: exception Emask 0x0 SAct 0x40000007 SErr 0x0 action 0x6 frozen
Jan 23 14:35:52 rescue kernel: [11870.945207] ata1.00: cmd 60/00:08:80:c3:58/04:00:ce:00:00/40 tag 1 ncq 524288 in
Jan 23 14:35:52 rescue kernel: [11870.945207] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 23 14:35:52 rescue kernel: [11870.982487] ata1.00: cmd 60/80:10:00:c0:58/03:00:ce:00:00/40 tag 2 ncq 458752 in
Jan 23 14:35:52 rescue kernel: [11870.982487] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 23 14:35:53 rescue kernel: [11871.019291] ata1.00: cmd 60/00:f0:80:cb:58/04:00:ce:00:00/40 tag 30 ncq 524288 in
Jan 23 14:35:53 rescue kernel: [11871.019291] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 23 14:35:53 rescue kernel: [11871.055486] ata1: hard resetting link
Jan 23 14:35:53 rescue kernel: [11871.707811] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 23 14:35:53 rescue kernel: [11871.708270] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-359)
Jan 23 14:35:53 rescue kernel: [11871.708279] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88041d869a88), AE_NOT_FOUND (20131
218/psparse-536)
Jan 23 14:35:53 rescue kernel: [11871.709174] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20131218/psargs-359)
Jan 23 14:35:53 rescue kernel: [11871.709182] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT0._GTF] (Node ffff88041d869a88), AE_NOT_FOUND (20131
218/psparse-536)
Eu sou capaz de montar / dev / md5 e listar os arquivos. No entanto, não posso adicionar a nova partição ao array.
Existe alguma maneira de corrigir isso sem perder os dados na partição?
Se não, é possível apenas formatar essa única partição e depois adicionar a nova unidade novamente? Eu deveria ter backup atualizado dessa partição, de modo que não seria um problema. Se possível, eu gostaria de ter que apagar todas as partições.
saída smartctl:
/ dev / sda:
smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.14.27] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: ST3000DM001-1CH166
Serial Number: Z1F1XJHC
LU WWN Device Id: 5 000c50 04f3fc2c7
Firmware Version: CC24
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Fri Jan 23 16:16:32 2015 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Error SMART Values Read failed: scsi error aborted command
Smartctl: SMART Read Values failed.
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: UNKNOWN!
SMART Status, Attributes and Thresholds cannot be read.
SMART Error Log Version: 1
ATA Error Count: 107 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 107 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 15:56:49.931 READ DMA EXT
25 00 08 ff ff ff ef 00 15:56:48.680 READ DMA EXT
ef 10 02 00 00 00 a0 00 15:56:48.644 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 e0 00 15:56:48.644 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 15:56:48.644 IDENTIFY DEVICE
Error 106 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 15:56:45.363 READ DMA EXT
25 00 08 ff ff ff ef 00 15:56:44.071 READ DMA EXT
25 00 08 ff ff ff ef 00 15:56:42.789 READ DMA EXT
25 00 08 ff ff ff ef 00 15:56:42.755 READ DMA EXT
25 00 08 ff ff ff ef 00 15:56:42.722 READ DMA EXT
Error 105 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 15:56:15.716 READ DMA EXT
25 00 08 ff ff ff ef 00 15:56:12.832 READ DMA EXT
25 00 08 ff ff ff ef 00 15:56:11.540 READ DMA EXT
25 00 08 ff ff ff ef 00 15:56:10.290 READ DMA EXT
25 00 08 ff ff ff ef 00 15:56:09.448 READ DMA EXT
Error 104 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 ff ff ff ef 00 15:56:02.563 READ DMA EXT
25 00 08 ff ff ff ef 00 15:55:59.655 READ DMA EXT
25 00 08 ff ff ff ef 00 15:55:58.319 READ DMA EXT
25 00 08 ff ff ff ef 00 15:55:58.069 READ DMA EXT
25 00 08 ff ff ff ef 00 15:55:57.838 READ DMA EXT
Error 103 occurred at disk power-on lifetime: 13180 hours (549 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 80 ff ff ff ef 00 15:55:51.995 READ DMA EXT
25 00 08 ff ff ff ef 00 15:55:50.735 READ DMA EXT
ef 10 02 00 00 00 a0 00 15:55:50.700 SET FEATURES [Reserved for Serial ATA]
27 00 00 00 00 00 e0 00 15:55:50.700 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 15:55:50.699 IDENTIFY DEVICE
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 4561 -
# 2 Extended offline Completed without error 00% 2977 -
# 3 Extended offline Completed without error 00% 5 -
Device does not support Selective Self Tests/Logging
/ dev / sdb:
smartctl -a /dev/sdb
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.14.27] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: ST33000650NS
Serial Number: Z295TK0G
LU WWN Device Id: 5 000c50 04f891ded
Firmware Version: 0004
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Size: 512 bytes logical/physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Fri Jan 23 16:15:30 2015 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 600) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x10bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 078 053 044 Pre-fail Always - 70825960
3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 11
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1
7 Seek_Error_Rate 0x000f 088 060 030 Pre-fail Always - 791126750
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7155
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 11
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 090 090 000 Old_age Always - 10
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 1
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 066 043 045 Old_age Always In_the_past 34 (5 173 37 27)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 8
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 11
194 Temperature_Celsius 0x0022 034 057 000 Old_age Always - 34 (0 24 0 0)
195 Hardware_ECC_Recovered 0x001a 018 007 000 Old_age Always - 70825960
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 18 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 18 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: WP at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 00 18 ff ff ff 4f 00 26d+03:52:28.560 WRITE FPDMA QUEUED
60 00 00 ff ff ff 4f 00 26d+03:52:28.560 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 26d+03:52:28.559 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 26d+03:52:28.559 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 26d+03:52:28.559 READ FPDMA QUEUED
Error 17 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 08 ff ff ff 4f 00 26d+03:52:13.471 READ FPDMA QUEUED
60 00 58 d0 57 44 43 00 26d+03:52:13.471 READ FPDMA QUEUED
61 00 02 08 90 6d 49 00 26d+03:52:13.471 WRITE FPDMA QUEUED
ea 00 00 00 00 00 a0 00 26d+03:52:13.470 FLUSH CACHE EXT
60 00 00 e0 42 20 4e 00 26d+03:52:13.422 READ FPDMA QUEUED
Error 16 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 26d+03:51:56.176 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 26d+03:51:56.176 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 26d+03:51:56.175 READ FPDMA QUEUED
60 00 00 e0 0d 20 4e 00 26d+03:51:56.116 READ FPDMA QUEUED
60 00 00 e0 0c 20 4e 00 26d+03:51:56.114 READ FPDMA QUEUED
Error 15 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 08 50 59 cb 43 00 26d+03:51:24.077 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 26d+03:51:24.077 READ FPDMA QUEUED
60 00 00 e0 c5 1c 4e 00 26d+03:51:24.076 READ FPDMA QUEUED
ea 00 00 00 00 00 a0 00 26d+03:51:24.071 FLUSH CACHE EXT
60 00 08 28 46 c1 43 00 26d+03:51:22.717 READ FPDMA QUEUED
Error 14 occurred at disk power-on lifetime: 5559 hours (231 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 26d+03:51:02.317 READ FPDMA QUEUED
61 00 08 ff ff ff 4f 00 26d+03:51:02.317 WRITE FPDMA QUEUED
ea 00 00 00 00 00 a0 00 26d+03:51:02.316 FLUSH CACHE EXT
60 00 08 ff ff ff 4f 00 26d+03:51:02.303 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 26d+03:51:02.300 READ FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 7071 -
# 2 Extended offline Completed without error 00% 7060 -
# 3 Extended offline Completed without error 00% 5600 -
# 4 Short offline Completed without error 00% 2489 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.