冬の日2014〜ioDriveが壊れた日〜
ある日
ioDriveを積んでるMySQLスレーブサーバが突然の死。
というか、レプリケーションが止まっていました。
サービスから参照されていないDBではあったので、
特に死んでいても問題にはなりませんでした。
今回つかっていたのはioDrive Duoです。
/var/log/messages確認
とりあえずシステムのログを確認してみると、
Jan 27 05:39:36 hoge-dbs kernel: fioinf HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0: groomer read had error -1024 Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0: groomer error -1024 during read on eb 3408 Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0:- Due to simultaneous multiple device failur es in EB 3408, the location of this Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0:- error in the filesystem can not be easily determined. It is suggested that Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0:- all data be checked to find the bad block and overwritten. Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0:- For best results do not reboot the device until this is done. Jan 27 05:39:36 hoge-dbs kernel: fioerr HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers 0000:09:00.0: Groomer was unable to groom EB 3408 after 5 retries: 5 sectors were ungroomed
なんだか怪しげなログがいっぱい出ていたので、
こちらの記事を参考にしながら確認してみます。
データ領域に書き込めるか確認
/dataにマウントしていたので、そこにデータが書き込めるかを確認したところ現状では書き込めていました。
fio-status確認
# fio-status Found 2 ioDrives in this system with 1 ioDrive Duo Fusion-io driver version: 2.3.10 build 110 Adapter: ioDrive Duo HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx External Power: NOT connected PCIE Power limit threshold: 24.75W Sufficient power available: Unknown Connected ioDimm modules: fct0: HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx fct1: HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx fct0 Attached as 'fioa' (block device) HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx Located in slot 0 Upper of ioDrive Duo SN:xxxxx PCI:09:00.0 Firmware v5.0.7, rev 107053 322.55 GBytes block device size, 396 GBytes physical device size Sufficient power available: Unknown Internal temperature: 51.7 degC, max 56.1 degC Media status: Healthy; Reserves: 100.00%, warn at 10.00% fct1 Attached as 'fiob' (block device) HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx Located in slot 1 Lower of ioDrive Duo SN:xxxxx PCI:0a:00.0 Firmware v5.0.7, rev 107053 322.55 GBytes block device size, 396 GBytes physical device size Sufficient power available: Unknown Internal temperature: 52.2 degC, max 62.0 degC Media status: Healthy; Reserves: 100.00%, warn at 10.00%
なんだか正常に見えるなぁ。
※他にもログとか確認していたのですが、
OS再起動する前に記録取り忘れていたので割愛します。
OS再起動してみる
特にサービスで利用してないサーバだったのでOS再起動してみることに。
OSは無事上がりました。
# df -h Filesystem Size Used Avail Use% マウント位置 /dev/sda3 547G 4.5G 515G 1% / tmpfs 16G 0 16G 0% /dev/shm /dev/sda1 504M 63M 416M 14% /boot
さようなら、/dataちゃん・・・
もういっちょfio-status
# fio-status Found 2 ioDrives in this system with 1 ioDrive Duo Fusion-io driver version: 2.3.10 build 110 Adapter: ioDrive Duo HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx External Power: NOT connected PCIE Power limit threshold: 24.75W Sufficient power available: Unknown Connected ioDimm modules: fct0: HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx fct1: HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx fct0 Not attached HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx Located in slot 0 Upper of ioDrive Duo SN:xxxxx PCI:09:00.0 Firmware v5.0.7, rev 107053 322.55 GBytes block device size, 396 GBytes physical device size Sufficient power available: Unknown Internal temperature: 51.7 degC, max 56.1 degC Media status: Healthy; Reserves: 100.00%, warn at 10.00% fct1 Attached as 'fiob' (block device) HP 640GB MLC PCIe ioDrive Duo for ProLiant Servers, Product Number:600282-B21 SN:xxxxx Located in slot 1 Lower of ioDrive Duo SN:xxxxx PCI:0a:00.0 Firmware v5.0.7, rev 107053 322.55 GBytes block device size, 396 GBytes physical device size Sufficient power available: Unknown Internal temperature: 52.2 degC, max 62.0 degC Media status: Healthy; Reserves: 100.00%, warn at 10.00%
さようなら、fct0ちゃん・・・
試しにfct0にattach
# fio-attach /dev/fct0 Attaching: [====================] (100%) \ Error: failed to attach /dev/fct0. (4)
やっぱりダメ。
fct0ちゃんよ・・・永遠に・・・。
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::。::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::。::::::...... ... --─- :::::::::::::::::::: ..::::: . ..:::::::: :::::::::::::::::...... ....:::::::゜::::::::::.. (___ )(___ ) ::::。::::::::::::::::: ゜.:::::::::::: :. .:::::。:::........ . .::::::::::::::::: _ i/ = =ヽi :::::::::::::。::::::::::: . . . ..:::: :::: :::::::::.....:☆彡:::: //[|| 」 ||] ::::::::::゜:::::::::: ...:: ::::: :::::::::::::::::: . . . ..: :::: / ヘ | | ____,ヽ | | :::::::::::.... .... .. .:::::::::::::: ::::::...゜ . .::::::::: /ヽ ノ ヽ__/ ....... . .::::::::::::........ ..:::: :.... .... .. . く / 三三三∠⌒>:.... .... .. .:.... .... .. :.... .... ..:.... .... ..... .... .. .:.... .... .. ..... .... .. ..... ............. .. . ........ ...... :.... . ∧∧ ∧∧ ∧∧ ∧∧ .... .... .. .:.... .... ..... .... .. . ... ..:( )ゝ ( )ゝ( )ゝ( )ゝ無茶しやがって… .......... .... i⌒ / i⌒ / i⌒ / i⌒ / .. ..... ................... .. . ... .. 三 | 三 | 三 | 三 | ... ............. ........... . ..... ... ∪ ∪ ∪ ∪ ∪ ∪ ∪ ∪ ............. ............. .. ........ ... 三三 三三 三三 三三 三三 三三 三三 三三
Fusion ioのioDriveは結構な枚数と種類を1年以上使ってきていますが、
故障に出会ったのは始めてでした。
fio-bugreportとりました