Ubuntu下LSI卡查看硬盘状态smartctl,以及定位硬盘storcli64,热插拔更换

Ubuntu下硬盘状态维护,定位出错硬盘,热插拔更换

使用smartctl查看硬盘状态

smartctl的安装

1
2
$ sudo apt update
$ sudo apt install smartmontools

查看目录和磁盘的对应关系

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ df
Filesystem       1K-blocks        Used  Available Use% Mounted on
tmpfs              3282308       20776    3261532   1% /run
efivarfs               302         192        106  65% /sys/firmware/efi/efivars
/dev/sdi2        228554124    11376292  205495068   6% /
tmpfs             16411524           0   16411524   0% /dev/shm
tmpfs                 5120           0       5120   0% /run/lock
/dev/sdi1          1098628        6296    1092332   1% /boot/efi
/dev/sda2      15625861116  9835617292 5790243824  63% /mnt/A01
tmpfs              3282304          12    3282292   1% /run/user/1000

查看对应磁盘的smart信息

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
$ sudo smartctl -a /dev/sda
[sudo] password for knightli: 
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-51-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Ultrastar DC HC550
Device Model:     WDC  WUH721816ALE6L4
Serial Number:    2BHUKDEN
LU WWN Device Id: 5 000cca 295d9b61a
Firmware Version: PCGNW232
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-4 published, ANSI INCITS 529-2018
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jan 24 15:14:57 2025 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1758) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   135   135   054    Pre-fail  Offline      -       100
  3 Spin_Up_Time            0x0007   083   083   001    Pre-fail  Always       -       350 (Average 350)
  4 Start_Stop_Count        0x0012   097   097   000    Old_age   Always       -       1511
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       15
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       24995
 10 Spin_Retry_Count        0x0013   100   100   001    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       148
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       4832
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       4832
194 Temperature_Celsius     0x0002   058   058   000    Old_age   Always       -       36 (Min/Max 16/54)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       28

SMART Error Log Version: 1
ATA Error Count: 28 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 28 occurred at disk power-on lifetime: 24824 hours (1034 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 53 20 30 80 60 40  Error: ICRC, ABRT 32 sectors at LBA = 0x00608030 = 6324272

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 03 20 30 80 60 40 00      00:01:43.419  WRITE DMA EXT
  ef 03 46 00 00 00 00 00      00:01:43.416  SET FEATURES [Set transfer mode]
  ef 03 0c 00 00 00 00 00      00:01:43.414  SET FEATURES [Set transfer mode]
  ec 00 01 00 00 00 00 00      00:01:43.413  IDENTIFY DEVICE
  61 08 00 88 f2 4f 40 00      00:01:43.287  WRITE FPDMA QUEUED

Error 27 occurred at disk power-on lifetime: 24824 hours (1034 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 00 88 f2 4f 40 00      00:01:43.287  WRITE FPDMA QUEUED
  ec 00 01 00 00 00 a0 00      00:01:42.858  IDENTIFY DEVICE
  ec 00 01 00 00 00 a0 00      00:01:42.858  IDENTIFY DEVICE
  25 03 10 00 00 00 40 00      00:01:42.526  READ DMA EXT
  ef 03 46 30 80 60 00 00      00:01:42.524  SET FEATURES [Set transfer mode]

Error 26 occurred at disk power-on lifetime: 24824 hours (1034 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 53 20 30 80 60 40  Error: ICRC, ABRT 32 sectors at LBA = 0x00608030 = 6324272

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 03 20 30 80 60 40 00      00:01:42.402  WRITE DMA EXT
  ef 03 46 88 f2 4f 00 00      00:01:42.399  SET FEATURES [Set transfer mode]
  ef 03 0c 88 f2 4f 00 00      00:01:42.397  SET FEATURES [Set transfer mode]
  ec 08 00 88 f2 4f 00 00      00:01:42.396  IDENTIFY DEVICE
  61 08 00 88 f2 4f 40 00      00:01:42.272  WRITE FPDMA QUEUED

Error 25 occurred at disk power-on lifetime: 24824 hours (1034 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 00 88 f2 4f 40 00      00:01:42.272  WRITE FPDMA QUEUED
  35 03 08 a0 f2 4f 40 00      00:01:42.271  WRITE DMA EXT
  ef 03 46 a0 f2 4f 00 00      00:01:42.269  SET FEATURES [Set transfer mode]
  ef 03 0c a0 f2 4f 00 00      00:01:42.267  SET FEATURES [Set transfer mode]
  ec 03 08 a0 f2 4f 00 00      00:01:42.267  IDENTIFY DEVICE

Error 24 occurred at disk power-on lifetime: 24824 hours (1034 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 53 08 a0 f2 4f 40  Error: ICRC, ABRT 8 sectors at LBA = 0x004ff2a0 = 5239456

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  35 03 08 a0 f2 4f 40 00      00:01:42.272  WRITE DMA EXT
  ef 03 46 a0 f2 4f 00 00      00:01:42.269  SET FEATURES [Set transfer mode]
  ef 03 0c a0 f2 4f 00 00      00:01:42.267  SET FEATURES [Set transfer mode]
  ec 03 08 a0 f2 4f 00 00      00:01:42.267  IDENTIFY DEVICE
  35 03 08 a0 f2 4f 40 00      00:00:55.790  WRITE DMA EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

运行后可以根据输出判断硬盘状态, 可以根据输出的SN号,在后面的过程定位硬盘

使用 storcli64 定位硬盘的物理位置

安装 storcli64

去broadcom官网下载storcli
https://www.broadcom.com/support/download-search?dk=storcli 下载对应的版本

解压缩后 /STORCLI_SAS3.5_P33/univ_viva_cli_rel/Unified_storcli_all_os/Ubuntu目录下

1
2
3
unzip STORCLI_SAS3.5_P33.zip
cd STORCLI_SAS3.5_P33/univ_viva_cli_rel/Unified_storcli_all_os/Ubuntu
dpkg -i storcli_007.3205.0000.0000_all.deb

运行storcli64

必须使用root运行

查找控制器

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# storcli64 show
CLI Version = 007.3205.0000.0000 Oct 09, 2024
Operating system = Linux 6.8.0-51-generic
Status Code = 0
Status = Success
Description = None

Number of Controllers = 1
Host Name = knightli-m3
Operating System  = Linux 6.8.0-51-generic
StoreLib IT Version = 07.3205.0200.0000
StoreLib IR3 Version = 16.16-0

IT System Overview :
==================

------------------------------------------------------------------------------
Ctl Model           AdapterType   VendId DevId SubVendId SubDevId PCI Address 
------------------------------------------------------------------------------
  0 Dell HBA330 Adp   SAS3008(C0) 0x1000  0x97    0x1028   0x1F45 00:02:00:00 
------------------------------------------------------------------------------

运行后 ctl 就是控制器id, 这里是0号

查找硬盘

1
# storcli64 /c0 show all > hd.txt

在hd.txt中查找硬盘

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
Drive /c0/e0/s7 Device attributes :
=================================
Manufacturer Id = ATA     
Model Number = WDC  WUH721816ALE6L4
NAND Vendor = NA
SN = 2BHUKDEN            
WWN = 5000CCA295D9B61A
Firmware Revision = PCGNW232
Raw size = 14.552 TB [0x746bfffff Sectors]
Coerced size = 14.552 TB [0x746bfffff Sectors]
Non Coerced size = 14.552 TB [0x746bfffff Sectors]
Device Speed = Unknown
Link Speed = 6.0Gb/s
NCQ setting = N/A
Sector Size = 512B
Config ID = NA
Number of Blocks = 31251759103
Connector Name = N/A

在上面的片段中先查找序列号SN, 和你要查找的SN对应起来, 然后在上面可以看到 硬盘对应的位置, 上面是 /c0/e0/s7, 后面可以用这个定位硬盘

定位硬盘,让硬盘闪烁

1
# storcli64 /c0/e0/s7 start locate

硬盘灯开始闪烁

1
# storcli64 /c0/e0/s7 stop locate

硬盘灯停止闪烁

记录并分享
Built with Hugo
主题 StackJimmy 设计