mx-qual用户手册
概述
本文档介绍了 mx-qual
(MOFFETT QUALIFICATION )工具的使用方法。mx-qual
是基于 SOLA Runtime API
实现的设备质量测试工具,主要用于检测设备的可用性、稳定性、性能等方面的指标。
使用方法
mx-qual
是一个命令行工具,可以通过 mx-qual -h
查看帮助信息。
Moffett Quality Inspection Application v1.3.0
Usage: mx-qual [OPTIONS] SUBCOMMAND
Options:
-h,--help Print this help message and exit
--version Display program version information and exit
Subcommands:
list List all devices detected on the system
hardware_link Run hardware link test
pcie_bandwidth Run PCIe bandwidth test
memory_bandwidth Run memory bandwidth test
p2p Run peer to peer test
compute Run computing power test
stress Run stress test
memtest Run hardware memory test
mx-qual
以子命令的方式去执行相应的测试,子命令的使用方法可以通过 mx-qual <sub_command> -h
查看。
子命令说明
list
List devices
Usage: mx-qual list [OPTIONS]
Options:
-h,--help Print this help message and exit
-i,--index UINT:INT in [0 - 31] ...
The device index you specified (default: all). Separate values with spaces.
Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}
列出指定的设备信息,如果不指定设备,则列出所有设备信息。
测试命令示例:
# 列出所有设备
mx-qual list
# 列出指定设备
mx-qual list -i 0
mx-qual list -i 0 1 2
mx-qual list -i {0,1,2}
输出结果示例:
Device 0: "01S30-00A"
Serial number: 2023243080074
PCI Bus ID: 0000:8a:00.0
Runtime version: 3.4.0
Driver version: 3.4.0
Firmware version: 1.0.14
hardware_link
Run hardware link test
Usage: mx-qual hardware_link [OPTIONS]
Options:
-h,--help Print this help message and exit
运行硬件链路测试,测试驱动和所有设备的通信链路是否正常。
测试命令示例:
mx-qual hardware_link
输出结果示例:
Test driver link... ok
Test device count... ok
Test device link... ok
pcie_bandwidth
Usage: mx-qual pcie_bandwidth [OPTIONS]
Options:
-h,--help Print this help message and exit
-i,--index UINT:INT in [0 - 31] ...
The device index you specified (default: all). Separate values with spaces.
Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}
-s,--sn TEXT The device sn you specified.
-d,--data_size UINT:INT in [32 - 100]
The transfer size (MB) you specified. (default 100MB)
-l,--loop INT:INT in [1 - 100]
The number of test loop (default: 1)
-f,--full_duplex Enable full duplex mode
运行 PCIe 带宽测试,不指定设备时默认测试所有设备,可以通过-i
指定设备的index
,也可以通过-s
指定设备的sn
,-s
的优先级比-i
高,若同时指定了-i
和-s
,则只测试-s
指定的设备。
测试的数据大小可以通过-d
指定,单位为MB,默认为100MB,测试的循环次数可以通过-l
指定,默认为1次。默认进行半双工测试,使用-f
可以开启全双工测试。
测试命令示例:
mx-qual pcie_bandwidth
mx-qual pcie_bandwidth -i 0
mx-qual pcie_bandwidth -i 0 1 2
mx-qual pcie_bandwidth -s 2023243080096
mx-qual pcie_bandwidth --sn=2023243080096
mx-qual pcie_bandwidth -i 0 -d 32 -l 10
mx-qual pcie_bandwidth -f -l 10
输出结果示例:
PCIe Bandwidth Test
Device id: [0]
Host to Device Bandwidth
Transfer Size: 100.000 MB, Bandwidth: 11.947 GB/s
Device to Host Bandwidth
Transfer Size: 100.000 MB, Bandwidth: 12.139 GB/s
Result = PASS
memory_bandwidth
Run memory bandwidth test
Usage: mx-qual memory_bandwidth [OPTIONS]
Options:
-h,--help Print this help message and exit
-i,--index UINT:INT in [0 - 31] ...
The device index you specified (default: all). Separate values with spaces.
Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}
-s,--sn TEXT The device sn you specified.
-d,--data_size UINT:INT in [32 - 100]
The transfer size (MB) you specified. (default 100MB)
-l,--loop INT:INT in [1 - 100]
The number of test loop (default: 1)
运行设备内存带宽测试,不指定设备时默认测试所有设备,可以通过-i
指定设备的index
,也可以通过-s
指定设备的sn
,-s
的优先级比-i
高,若同时指定了-i
和-s
,则只测试-s
指定的设备。
测试的数据大小可以通过-d
指定,单位为MB,默认为100MB,测试的循环次数可以通过-l
指定,默认为1次。
测试命令示例:
mx-qual memory_bandwidth
mx-qual memory_bandwidth -i 0
mx-qual memory_bandwidth -i 0 1 2
mx-qual memory_bandwidth -s 2023243080096
mx-qual memory_bandwidth --sn=2023243080096
mx-qual memory_bandwidth -i 0 -d 32 -l 10
输出结果示例:
Memory Bandwidth Test
Device id: [0]
Memory Read Bandwidth
Transfer Size: 100.000 MB, Bandwidth: 50.187 GB/s
Memory Write Bandwidth
Transfer Size: 100.000 MB, Bandwidth: 50.053 GB/s
Result = PASS
P2P
Run peer to peer test
Usage: mx-qual p2p [OPTIONS]
Options:
-h,--help Print this help message and exit
-d,--device UINT:INT in [0 - 31] ...
Device index you specified (default: all). Separate values with spaces.
Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {0,1,2}
-c,--card UINT:INT in [1 - 32] ...
Card index you specified. Separate values with spaces.
Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2}
运行 p2p
带宽测试,可以通过 --device
指定设备,或者通过 --card
指定卡,不指定时则默认测试所有设备。
测试命令示例:
mx-qual p2p
mx-qual p2p -d 0 1 2
mx-qual p2p -c 1 2
输出结果示例:
以 device 为维度:
P2P Connectivity Matrix
D/D 0 1 2
0 0 1 1
1 1 0 1
2 1 1 0
Unidirectional P2P Bandwidth Matrix (GB/s)
D/D 0 1 2
0 0.00 10.97 10.95
1 10.93 0.00 10.94
2 10.94 10.95 0.00
Bidirectional P2P Bandwidth Matrix (GB/s)
D/D 0 1 2
0 0.00 21.96 21.89
1 21.96 0.00 21.89
2 21.89 21.89 0.00
P2P Latency Matrix (ms)
SPU 0 1 2
0 0.00 3.07 3.06
1 3.09 0.00 3.06
2 3.06 3.06 0.00
CPU 0 1 2
0 0.00 8.63 8.60
1 8.63 0.00 8.60
2 8.71 8.63 0.00
Test Result = PASS
以 card 为维度:
P2P Connectivity Matrix
C/C 1 2
1 0 1
2 1 0
Unidirectional P2P Bandwidth Matrix (GB/s)
C/C 1 2
1 0.00 14.12
2 14.10 0.00
Bidirectional P2P Bandwidth Matrix (GB/s)
C/C 1 2
1 0.00 28.48
2 28.48 0.00
P2P Latency Matrix (ms)
SPU 1 2
1 0.00 7.13
2 7.14 0.00
CPU 1 2
1 0.00 21.50
2 21.48 0.00
Test Result = PASS
stress
Usage: mx-qual stress [OPTIONS]
Options:
-h,--help Print this help message and exit
-i,--index UINT:INT in [0 - 31] ...
The device index you specified (default: all). Separate values with spaces.
Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}
-t,--time INT:INT in [2 - 100000]
Number of minutes consumed in a single stress test. (default: 2)
The deviation is subject to the influence of the machine.
-l,--load INT:{0,50,100} Pressure test load, such as 0%, 50%, 100% (default: 100%)
运行压力测试,可以对设备的内存和计算单元进行压测,可以通过 -i
指定设备的 index
,默认测试所有设备。
测试的时间可以通过 -t
指定,表示分钟数,默认为 2 分钟。若需要测试一小时,那么可以指定 -t 60
。
内存和计算的负载可以通过 -l
指定,有三个可选数值,0
、50
和 100
,默认负载是 100%。
运行的前一分钟,会先让设备预热,不会监控设备状态。一分钟后,开始会监控设备的温度、功率和利用率,每隔一秒刷新一次,同时还会在当前目录下生成 mx-qual-stress.log
文件,可以用于后续分析。压测完成后,会输出压测过程中的整体信息。
测试命令示例:
# 压测2分钟
mx-qual stress
# 压测60分钟
mx-qual stress -t 60
# 使用50%的负载压测60分钟
mx-qual stress -t 60 -l 50
输出结果示例:
device temp.cur temp.avg power.cur power.avg util.cur util.avg
0 68 68 65 66 99 99
1 67 66 65 65 99 100
2 69 68 67 66 99 99
===================================================================================================
Summary
===================================================================================================
device temp.min temp.max temp.avg power.min power.max power.avg util.min util.max util.avg
0 66 68 68 65 68 66 99 99 99
1 65 67 66 63 67 65 99 100 100
2 67 69 68 65 68 66 99 99 99
Test Result = PASS
compute
Run computing power test
Usage: mx-qual compute [OPTIONS]
Options:
-h,--help Print this help message and exit
-i,--index UINT:INT in [0 - 31] ...
The device index you specified (default: all). Separate values with spaces.
Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}
运行算力测试时,如果不指定设备,默认对所有设备进行测试,可以通过 -i
参数指定测试设备的 index
。
测试结果将以卡为单位进行展示,并包含在不同稀疏倍率下的性能数据。注意,具有相同序列号 (SN) 的设备被视为同一张卡。
测试命令示例:
mx-qual compute
输出结果示例:
INT8
1x sparsity:
SN: 2023243080074, actual: 87.64 TOPS, target: 88.47 TOPS, Utilization: 99.06%
SN: 2023243080096, actual: 87.64 TOPS, target: 88.47 TOPS, Utilization: 99.06%
2x sparsity:
SN: 2023243080074, actual: 173.77 TOPS, target: 176.95 TOPS, Utilization: 98.21%
SN: 2023243080096, actual: 173.62 TOPS, target: 176.95 TOPS, Utilization: 98.12%
4x sparsity:
SN: 2023243080074, actual: 347.35 TOPS, target: 353.89 TOPS, Utilization: 98.15%
SN: 2023243080096, actual: 347.55 TOPS, target: 353.89 TOPS, Utilization: 98.21%
8x sparsity:
SN: 2023243080074, actual: 695.91 TOPS, target: 707.79 TOPS, Utilization: 98.32%
SN: 2023243080096, actual: 695.91 TOPS, target: 707.79 TOPS, Utilization: 98.32%
16x sparsity:
SN: 2023243080074, actual: 1377.36 TOPS, target: 1415.58 TOPS, Utilization: 97.30%
SN: 2023243080096, actual: 1374.20 TOPS, target: 1415.58 TOPS, Utilization: 97.08%
32x sparsity:
SN: 2023243080074, actual: 2662.64 TOPS, target: 2831.16 TOPS, Utilization: 94.05%
SN: 2023243080096, actual: 2662.64 TOPS, target: 2831.16 TOPS, Utilization: 94.05%
BF16
1x sparsity:
SN: 2023243080074, actual: 43.82 TOPS, target: 44.24 TOPS, Utilization: 99.05%
SN: 2023243080096, actual: 43.82 TOPS, target: 44.24 TOPS, Utilization: 99.05%
2x sparsity:
SN: 2023243080074, actual: 86.77 TOPS, target: 88.47 TOPS, Utilization: 98.08%
SN: 2023243080096, actual: 86.80 TOPS, target: 88.47 TOPS, Utilization: 98.11%
4x sparsity:
SN: 2023243080074, actual: 174.13 TOPS, target: 176.95 TOPS, Utilization: 98.41%
SN: 2023243080096, actual: 174.23 TOPS, target: 176.95 TOPS, Utilization: 98.46%
8x sparsity:
SN: 2023243080074, actual: 348.76 TOPS, target: 353.89 TOPS, Utilization: 98.55%
SN: 2023243080096, actual: 348.76 TOPS, target: 353.89 TOPS, Utilization: 98.55%
16x sparsity:
SN: 2023243080074, actual: 698.35 TOPS, target: 707.79 TOPS, Utilization: 98.67%
SN: 2023243080096, actual: 698.35 TOPS, target: 707.79 TOPS, Utilization: 98.67%
32x sparsity:
SN: 2023243080074, actual: 1371.04 TOPS, target: 1415.58 TOPS, Utilization: 96.85%
SN: 2023243080096, actual: 1374.20 TOPS, target: 1415.58 TOPS, Utilization: 97.08%
memtest
Run hardware memory test
Usage: mx-qual memtest [OPTIONS]
Options:
-h,--help Print this help message and exit
-i,--index UINT:INT in [0 - 31] ...
The device index you specified (default: all). Separate values with spaces.
Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}
-t,--type UINT:INT in [0 - 10] ...
The memory test type you specified (default: 7)
type 0 [Walking 1 bit]
type 1 [Own address test]
type 2 [Moving inversions, ones&zeros]
type 3 [Moving inversions, 8 bit pat]
type 4 [Moving inversions, random pattern]
type 5 [Block move, 64 moves]
type 6 [Moving inversions, 32 bit pat]
type 7 [Random number sequence]
type 8 [Modulo 20, random pattern]
type 9 [Bit fade test]
type 10 [Memory stress test]
-l,--loop INT:INT in [1 - 100]
The number of test loop (default: 1)
这是一个参考MemTest86
实现的用于测试硬件内存稳定性和可靠性的工具。
可以通过-i
指定设备的index
,默认运行所有设备。可以通过-t
指定测试的类型,可以指定多个类型,类型的取值范围为0-10,具体的类型含义可以参考MemTest86
的说明,默认是7。可以通过-l
指定测试的循环次数,默认为1次。
测试命令示例:
mx-qual memtest
mx-qual memtest -i 0 -t 0
输出结果示例:
Device[5] running test 4 / 4 ... passed
Device[4] running test 4 / 4 ... passed
Device[3] running test 4 / 4 ... passed
Device[2] running test 4 / 4 ... passed
Device[0] running test 4 / 4 ... passed
Device[1] running test 4 / 4 ... passed
Device[6] running test 4 / 4 ... passed
Test Result = PASS