Alta carga média, mas baixos valores de CPU / IO - como diagnosticar? (saída do dmesg incluída)

3

Temos um cluster do Hadoop no qual nós de dados arbitrários serão bloqueados. Isso é geralmente pré-cedido por médias de carga sempre crescentes, com CPU e IO que permanecem praticamente inexistentes. O caso de uso para as máquinas afetadas são os nós de dados high-IO hadoop com muitos arquivos grandes não-dedicados e a gravação de muitos arquivos pequenos e grandes. Os discos subjacentes estão executando o XFS com o kernel 2.6.32-358.18.1.el6.x86_64. Todas as máquinas têm 32 GB + de RAM com mais de 8 núcleos

O modelo do dispositivo é o Dell R720xd

Configuração Raid é:

sudo /opt/MegaRAID/MegaCli/MegaCli64 -PdList -aAll

Adapter #0

Enclosure Device ID: 32
Slot Number: 0
Device Id: 0
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c5008e1f239d
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600957SS     ESF76SLAH2NQ
FDE Capable: Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 32
Slot Number: 1
Device Id: 1
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c5005e7b6bd1
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL5J0NV
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 32
Slot Number: 2
Device Id: 2
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c5005e783fa9
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL5FE47
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 32
Slot Number: 3
Device Id: 3
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c5005e7b6ea9
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL5J0W4
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 32
Slot Number: 4
Device Id: 4
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c5005e78e8cd
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL5HPC9
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 32
Slot Number: 5
Device Id: 5
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c5005e7b6e51
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL5GFW2
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 32
Slot Number: 6
Device Id: 6
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c5005e7b6ef5
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL5J0GC
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 32
Slot Number: 7
Device Id: 7
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c5005e78e991
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL5GG86
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 32
Slot Number: 8
Device Id: 8
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c50095a39799
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SLAQM3Y
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 32
Slot Number: 9
Device Id: 9
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c5005e78e7b1
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL5HP5A
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 32
Slot Number: 10
Device Id: 10
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c5005e7b6ce5
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL5J0MW
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device

Enclosure Device ID: 32
Slot Number: 11
Device Id: 11
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Firmware state: Online
SAS Address(0): 0x5000c5005e78e269
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST3600057SS     ES666SL5HP7Y
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device


Exit Code: 0x00

A configuração do Raid Virtual Drive é:

 sudo /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aAll


Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (Target Id: 0)
Name:OS
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:558.375 GB
State: Optimal
Stripe Size: 64 KB
Number Of Drives:2
Span Depth:1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disabled
Encryption Type: None
Virtual Disk: 1 (Target Id: 1)
Name:
RAID Level: Primary-6, Secondary-0, RAID Level Qualifier-3
Size:4.362 TB
State: Optimal
Stripe Size: 64 KB
Number Of Drives:10
Span Depth:1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None

Exit Code: 0x00

Saída do iostat -x

[[email protected] ~]$ iostat -x
    Linux 2.6.32-358.18.1.el6.x86_64 (data1234.svx.foo.bar)     02/17/2016  _x86_64_    (32 CPU)

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              17.72    0.00    3.54    0.10    0.00   78.65

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.31    27.97    0.49    3.35    18.59   250.38    69.96     0.01    2.26   0.31   0.12
sdb               0.00     1.51   26.10   47.14  4989.96 15418.12   278.65     2.58   35.25   0.50   3.64

Conteúdo do / etc / fstab

UUID=4fe41c9b-f3f1-4c36-99a2-30e2af5c75e1 /                       ext3    defaults        1 1
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
/dev/sdb        /data           xfs defaults,noatime,nodiratime,logbufs=8,nobarrier 1 2
/data/home      /home           none    bind            0 0

Saída de xfs_info

xfs_info /dev/sdb
meta-data=/dev/sdb               isize=256    agcount=32, agsize=36593648 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=1170996736, imaxpct=5
         =                       sunit=16     swidth=128 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Saída do dmesg

INFO: task swh-logfiles_pr:22324 blocked for more than 180 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
swh-logfiles_ D 0000000000000000     0 22324  22300 0x00000000
 ffff881fe29cdd38 0000000000000086 ffff881fe29cdc98 ffffffff8109f641
 ffff881fe29cdcc8 ffffffff8118e05d ffff881fe29cdcc8 ffff881c2e78300a
 ffff881ded459ab8 ffff881fe29cdfd8 000000000000fb88 ffff881ded459ab8
Call Trace:
 [<ffffffff8109f641>] ? in_group_p+0x31/0x40
 [<ffffffff8118e05d>] ? acl_permission_check+0x5d/0xc0
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff81192e67>] do_filp_open+0x2d7/0xdc0
 [<ffffffff8118f541>] ? path_put+0x31/0x40
 [<ffffffff8119f922>] ? alloc_fd+0x92/0x160
 [<ffffffff8117e249>] do_sys_open+0x69/0x140
 [<ffffffff8117e360>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task swh-logfiles_pr:22345 blocked for more than 180 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
swh-logfiles_ D 0000000000000001     0 22345  22323 0x00000000
 ffff88201044fd38 0000000000000086 0000000000000000 ffffffff8109f641
 ffff88201044fcc8 ffffffff8118e05d ffff88201044fcc8 ffff881fc7a1500a
 ffff8819d03fe638 ffff88201044ffd8 000000000000fb88 ffff8819d03fe638
Call Trace:
 [<ffffffff8109f641>] ? in_group_p+0x31/0x40
 [<ffffffff8118e05d>] ? acl_permission_check+0x5d/0xc0
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff81192e67>] do_filp_open+0x2d7/0xdc0
 [<ffffffff811b3ffb>] ? vfs_statfs+0x1b/0xb0
 [<ffffffff811a20d0>] ? mntput_no_expire+0x30/0x110
 [<ffffffff8119f922>] ? alloc_fd+0x92/0x160
 [<ffffffff8117e249>] do_sys_open+0x69/0x140
 [<ffffffff8117e360>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task swh-logfiles_pr:22356 blocked for more than 180 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
swh-logfiles_ D 0000000000000001     0 22356  22334 0x00000000
 ffff881cc4f8f698 0000000000000086 ffff881cc4f8f85c ffff880e59395038
 ffff881cc4f8f6a8 ffffffffa01a670d ffff881cc4f8f908 0000000000000000
 ffff881fdf067ab8 ffff881cc4f8ffd8 000000000000fb88 ffff881fdf067ab8
Call Trace:
 [<ffffffffa01a670d>] ? xfs_bmap_add_extent+0xad/0x3c0 [xfs]
 [<ffffffff8150efa5>] schedule_timeout+0x215/0x2e0
 [<ffffffffa01a7562>] ? xfs_bmapi+0xb42/0x1120 [xfs]
 [<ffffffff8150fec2>] __down+0x72/0xb0
 [<ffffffffa01e78e5>] ? _xfs_buf_find+0xe5/0x230 [xfs]
 [<ffffffff8109cb61>] down+0x41/0x50
 [<ffffffffa01e7751>] xfs_buf_lock+0x51/0x100 [xfs]
 [<ffffffffa01e78e5>] _xfs_buf_find+0xe5/0x230 [xfs]
 [<ffffffffa01e7a64>] xfs_buf_get+0x34/0x1b0 [xfs]
 [<ffffffffa01e80ec>] xfs_buf_read+0x2c/0x100 [xfs]
 [<ffffffffa01dd9a7>] xfs_trans_read_buf+0x1f7/0x410 [xfs]
 [<ffffffffa01c0404>] xfs_read_agi+0x74/0x100 [xfs]
 [<ffffffffa01c04be>] xfs_ialloc_read_agi+0x2e/0x90 [xfs]
 [<ffffffffa01c07a3>] xfs_ialloc_ag_select+0x133/0x270 [xfs]
 [<ffffffffa01c1e67>] xfs_dialloc+0x3d7/0x850 [xfs]
 [<ffffffffa01e6e25>] ? xfs_buf_rele+0x55/0x100 [xfs]
 [<ffffffffa01ddf98>] ? xfs_trans_brelse+0xe8/0x130 [xfs]
 [<ffffffffa01b029b>] ? xfs_da_brelse+0x7b/0xc0 [xfs]
 [<ffffffffa01c5ba0>] xfs_ialloc+0x60/0x6e0 [xfs]
 [<ffffffffa01e2eaa>] ? kmem_zone_zalloc+0x3a/0x50 [xfs]
 [<ffffffffa01de534>] xfs_dir_ialloc+0x74/0x2b0 [xfs]
 [<ffffffffa01e0610>] xfs_create+0x440/0x640 [xfs]
 [<ffffffffa01ed7bd>] xfs_vn_mknod+0xad/0x1c0 [xfs]
 [<ffffffffa01ed900>] xfs_vn_create+0x10/0x20 [xfs]
 [<ffffffff8118fbd4>] vfs_create+0xb4/0xe0
 [<ffffffff811936a0>] do_filp_open+0xb10/0xdc0
 [<ffffffff8118f541>] ? path_put+0x31/0x40
 [<ffffffff8119f922>] ? alloc_fd+0x92/0x160
 [<ffffffff8117e249>] do_sys_open+0x69/0x140
 [<ffffffff8117e360>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task swh-logfiles_pr:22386 blocked for more than 180 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
swh-logfiles_ D 0000000000000001     0 22386  22362 0x00000000
 ffff88200be6dd38 0000000000000082 ffff88200be6dc98 ffffffff8109f641
 ffff88200be6dcc8 ffffffff8118e05d ffff88200be6dcc8 ffff881fd395800a
 ffff881fce825af8 ffff88200be6dfd8 000000000000fb88 ffff881fce825af8
Call Trace:
 [<ffffffff8109f641>] ? in_group_p+0x31/0x40
 [<ffffffff8118e05d>] ? acl_permission_check+0x5d/0xc0
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff81192e67>] do_filp_open+0x2d7/0xdc0
 [<ffffffff8118f541>] ? path_put+0x31/0x40
 [<ffffffff8119f922>] ? alloc_fd+0x92/0x160
 [<ffffffff8117e249>] do_sys_open+0x69/0x140
 [<ffffffff8117e360>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task swh-logfiles_pr:22415 blocked for more than 180 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
swh-logfiles_ D 0000000000000000     0 22415  22402 0x00000000
 ffff881cd8f6dd38 0000000000000086 0000000000000000 ffffffff8109f641
 ffff881cd8f6dcc8 ffffffff8118e05d ffff881cd8f6dcc8 ffff881f2073500a
 ffff881fd367c5f8 ffff881cd8f6dfd8 000000000000fb88 ffff881fd367c5f8
Call Trace:
 [<ffffffff8109f641>] ? in_group_p+0x31/0x40
 [<ffffffff8118e05d>] ? acl_permission_check+0x5d/0xc0
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff81192e67>] do_filp_open+0x2d7/0xdc0
 [<ffffffff811b3ffb>] ? vfs_statfs+0x1b/0xb0
 [<ffffffff811a20d0>] ? mntput_no_expire+0x30/0x110
 [<ffffffff8119f922>] ? alloc_fd+0x92/0x160
 [<ffffffff8117e249>] do_sys_open+0x69/0x140
 [<ffffffff8117e360>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task flush-8:16:5856 blocked for more than 180 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-8:16    D 000000000000000b     0  5856      2 0x00000000
 ffff881fd151b798 0000000000000046 0000000000000000 ffff8820129af380
 0000000000000086 ffff881fd151b720 ffff88200b648ea8 0000000000000001
 ffff881fda34f058 ffff881fd151bfd8 000000000000fb88 ffff881fda34f058
Call Trace:
 [<ffffffff8125ea61>] ? blk_queue_bio+0x121/0x5d0
 [<ffffffff81510695>] rwsem_down_failed_common+0x95/0x1d0
 [<ffffffff81510826>] rwsem_down_read_failed+0x26/0x30
 [<ffffffff81283844>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff8150fd24>] ? down_read+0x24/0x30
 [<ffffffffa01c29cd>] xfs_ilock+0x9d/0xd0 [xfs]
 [<ffffffffa01e491b>] xfs_map_blocks+0x1fb/0x250 [xfs]
 [<ffffffffa01e4a83>] ? xfs_submit_ioend_bio+0x33/0x40 [xfs]
 [<ffffffffa01e5401>] xfs_vm_writepage+0x261/0x5a0 [xfs]
 [<ffffffff811198c0>] ? find_get_pages_tag+0x40/0x130
 [<ffffffff8112cbb7>] __writepage+0x17/0x40
 [<ffffffff8112de6d>] write_cache_pages+0x1fd/0x4c0
 [<ffffffff8112cba0>] ? __writepage+0x0/0x40
 [<ffffffff8112e154>] generic_writepages+0x24/0x30
 [<ffffffffa01e46dd>] xfs_vm_writepages+0x5d/0x80 [xfs]
 [<ffffffff8112e181>] do_writepages+0x21/0x40
 [<ffffffff811aca0d>] writeback_single_inode+0xdd/0x290
 [<ffffffff811ace1e>] writeback_sb_inodes+0xce/0x180
 [<ffffffff811acf7b>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff811ad31b>] wb_writeback+0x29b/0x3f0
 [<ffffffff8150e130>] ? thread_return+0x4e/0x76e
 [<ffffffff81081be2>] ? del_timer_sync+0x22/0x30
 [<ffffffff811ad615>] wb_do_writeback+0x1a5/0x240
 [<ffffffff811ad713>] bdi_writeback_task+0x63/0x1b0
 [<ffffffff81096c67>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100
 [<ffffffff8113cca6>] bdi_start_fn+0x86/0x100
 [<ffffffff8113cc20>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81096a36>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
INFO: task java:1114 blocked for more than 180 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
java          D 0000000000000006     0  1114  31588 0x00000000
 ffff881c5bba7dd8 0000000000000086 0000000000000000 0000000000000001
 ffff881c5bba7d58 ffff881d6ccc9500 ffff881d6ccc9500 ffff881d6ccc9500
 ffff881d6ccc9ab8 ffff881c5bba7fd8 000000000000fb88 ffff881d6ccc9ab8
Call Trace:
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff8118ebb0>] lookup_create+0x30/0xd0
 [<ffffffff811924ac>] sys_mkdirat+0x7c/0x130
 [<ffffffff81186f36>] ? sys_newstat+0x36/0x50
 [<ffffffff81192578>] sys_mkdir+0x18/0x20
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task java:803 blocked for more than 180 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
java          D 0000000000000004     0   803  31612 0x00000000
 ffff881c2e7a1dd8 0000000000000082 0000000000000000 0000000000000001
 ffff881c2e7a1d58 ffff881fe5494ae0 ffff881fe5494ae0 ffff881fe5494ae0
 ffff881fe5495098 ffff881c2e7a1fd8 000000000000fb88 ffff881fe5495098
Call Trace:
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff8118ebb0>] lookup_create+0x30/0xd0
 [<ffffffff811924ac>] sys_mkdirat+0x7c/0x130
 [<ffffffff81186f36>] ? sys_newstat+0x36/0x50
 [<ffffffff81192578>] sys_mkdir+0x18/0x20
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task java:1171 blocked for more than 180 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
java          D 0000000000000000     0  1171  31636 0x00000000
 ffff881961ce9dd8 0000000000000086 0000000000000000 0000000000000001
 ffff881961ce9d58 ffff881cc26f3540 ffff881cc26f3540 ffff881cc26f3540
 ffff881cc26f3af8 ffff881961ce9fd8 000000000000fb88 ffff881cc26f3af8
Call Trace:
 [<ffffffff811a20d0>] ? mntput_no_expire+0x30/0x110
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff8118ebb0>] lookup_create+0x30/0xd0
 [<ffffffff811924ac>] sys_mkdirat+0x7c/0x130
 [<ffffffff81186f36>] ? sys_newstat+0x36/0x50
 [<ffffffff81192578>] sys_mkdir+0x18/0x20
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
INFO: task java:950 blocked for more than 180 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
java          D 0000000000000002     0   950  31666 0x00000000
 ffff88200d42bdd8 0000000000000082 0000000000000000 0000000000000001
 ffff88200d42bd58 ffff881cccccc040 ffff881cccccc040 ffff881cccccc040
 ffff881cccccc5f8 ffff88200d42bfd8 000000000000fb88 ffff881cccccc5f8
Call Trace:
 [<ffffffff811a20d0>] ? mntput_no_expire+0x30/0x110
 [<ffffffff8150f78e>] __mutex_lock_slowpath+0x13e/0x180
 [<ffffffff8150f62b>] mutex_lock+0x2b/0x50
 [<ffffffff8118ebb0>] lookup_create+0x30/0xd0
 [<ffffffff811924ac>] sys_mkdirat+0x7c/0x130
 [<ffffffff81186f36>] ? sys_newstat+0x36/0x50
 [<ffffffff81192578>] sys_mkdir+0x18/0x20
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
    
por David Corley 16.02.2016 / 16:10

1 resposta

3

Como seu log do kernel diz - você tem problemas no nível do sistema de arquivos ou abaixo. A coisa ruim - o hardware está ok. E parece ser mais que suficiente para a carga atual.

Na minha experiência, apesar de um XFS ser recomendado como um sistema de arquivos escalável, usá-lo oferece mais problemas do que o desempenho. Mas se a migração para o EXT4 não for uma opção para você, você pode tentar o seguinte ajuste do seu próprio risco:

#increase number of requests:
echo 4096 > /sys/block/sdb/queue/nr_requests
#use aggressive mount options:
mount  -oremount,noatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier /dev/sdb /data

Além disso, você pode tentar remontar o diretório / data com as opções padrão e ver se o problema persiste.

    
por 16.02.2016 / 16:41