# mx-qual用户手册 ## 概述 本文档介绍了 `mx-qual`(MOFFETT QUALIFICATION )工具的使用方法。`mx-qual` 是基于 `SOLA Runtime API` 实现的设备质量测试工具,主要用于检测设备的可用性、稳定性、性能等方面的指标。 ## 使用方法 `mx-qual` 是一个命令行工具,可以通过 `mx-qual -h` 查看帮助信息。 ```shell Moffett Quality Inspection Application v1.3.0 Usage: mx-qual [OPTIONS] SUBCOMMAND Options: -h,--help Print this help message and exit --version Display program version information and exit Subcommands: list List all devices detected on the system hardware_link Run hardware link test pcie_bandwidth Run PCIe bandwidth test memory_bandwidth Run memory bandwidth test p2p Run peer to peer test compute Run computing power test stress Run stress test memtest Run hardware memory test ``` `mx-qual` 以子命令的方式去执行相应的测试,子命令的使用方法可以通过 `mx-qual -h` 查看。 ### 子命令说明 #### list ```text List devices Usage: mx-qual list [OPTIONS] Options: -h,--help Print this help message and exit -i,--index UINT:INT in [0 - 31] ... The device index you specified (default: all). Separate values with spaces. Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3} ``` 列出指定的设备信息,如果不指定设备,则列出所有设备信息。 测试命令示例: ```shell # 列出所有设备 mx-qual list # 列出指定设备 mx-qual list -i 0 mx-qual list -i 0 1 2 mx-qual list -i {0,1,2} ``` 输出结果示例: ```shell Device 0: "01S30-00A" Serial number: 2023243080074 PCI Bus ID: 0000:8a:00.0 Runtime version: 3.4.0 Driver version: 3.4.0 Firmware version: 1.0.14 ``` #### hardware_link ```text Run hardware link test Usage: mx-qual hardware_link [OPTIONS] Options: -h,--help Print this help message and exit ``` 运行硬件链路测试,测试驱动和所有设备的通信链路是否正常。 测试命令示例: ```shell mx-qual hardware_link ``` 输出结果示例: ```text Test driver link... ok Test device count... ok Test device link... ok ``` #### pcie_bandwidth ```text Usage: mx-qual pcie_bandwidth [OPTIONS] Options: -h,--help Print this help message and exit -i,--index UINT:INT in [0 - 31] ... The device index you specified (default: all). Separate values with spaces. Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3} -s,--sn TEXT The device sn you specified. -d,--data_size UINT:INT in [32 - 100] The transfer size (MB) you specified. (default 100MB) -l,--loop INT:INT in [1 - 100] The number of test loop (default: 1) -f,--full_duplex Enable full duplex mode ``` 运行 PCIe 带宽测试,不指定设备时默认测试所有设备,可以通过`-i`指定设备的`index`,也可以通过`-s`指定设备的`sn`,`-s`的优先级比`-i`高,若同时指定了`-i`和`-s`,则只测试`-s`指定的设备。 测试的数据大小可以通过`-d`指定,单位为MB,默认为100MB,测试的循环次数可以通过`-l`指定,默认为1次。默认进行半双工测试,使用`-f`可以开启全双工测试。 测试命令示例: ```shell mx-qual pcie_bandwidth mx-qual pcie_bandwidth -i 0 mx-qual pcie_bandwidth -i 0 1 2 mx-qual pcie_bandwidth -s 2023243080096 mx-qual pcie_bandwidth --sn=2023243080096 mx-qual pcie_bandwidth -i 0 -d 32 -l 10 mx-qual pcie_bandwidth -f -l 10 ``` 输出结果示例: ```text PCIe Bandwidth Test Device id: [0] Host to Device Bandwidth Transfer Size: 100.000 MB, Bandwidth: 11.947 GB/s Device to Host Bandwidth Transfer Size: 100.000 MB, Bandwidth: 12.139 GB/s Result = PASS ``` #### memory_bandwidth ```text Run memory bandwidth test Usage: mx-qual memory_bandwidth [OPTIONS] Options: -h,--help Print this help message and exit -i,--index UINT:INT in [0 - 31] ... The device index you specified (default: all). Separate values with spaces. Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3} -s,--sn TEXT The device sn you specified. -d,--data_size UINT:INT in [32 - 100] The transfer size (MB) you specified. (default 100MB) -l,--loop INT:INT in [1 - 100] The number of test loop (default: 1) ``` 运行设备内存带宽测试,不指定设备时默认测试所有设备,可以通过`-i`指定设备的`index`,也可以通过`-s`指定设备的`sn`,`-s`的优先级比`-i`高,若同时指定了`-i`和`-s`,则只测试`-s`指定的设备。 测试的数据大小可以通过`-d`指定,单位为MB,默认为100MB,测试的循环次数可以通过`-l`指定,默认为1次。 测试命令示例: ```shell mx-qual memory_bandwidth mx-qual memory_bandwidth -i 0 mx-qual memory_bandwidth -i 0 1 2 mx-qual memory_bandwidth -s 2023243080096 mx-qual memory_bandwidth --sn=2023243080096 mx-qual memory_bandwidth -i 0 -d 32 -l 10 ``` 输出结果示例: ```text Memory Bandwidth Test Device id: [0] Memory Read Bandwidth Transfer Size: 100.000 MB, Bandwidth: 50.187 GB/s Memory Write Bandwidth Transfer Size: 100.000 MB, Bandwidth: 50.053 GB/s Result = PASS ``` #### P2P ```bash Run peer to peer test Usage: mx-qual p2p [OPTIONS] Options: -h,--help Print this help message and exit -d,--device UINT:INT in [0 - 31] ... Device index you specified (default: all). Separate values with spaces. Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {0,1,2} -c,--card UINT:INT in [1 - 32] ... Card index you specified. Separate values with spaces. Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2} ``` 运行 `p2p` 带宽测试,可以通过 `--device` 指定设备,或者通过 `--card` 指定卡,不指定时则默认测试所有设备。 测试命令示例: ```bash mx-qual p2p mx-qual p2p -d 0 1 2 mx-qual p2p -c 1 2 ``` 输出结果示例: 以 device 为维度: ```bash P2P Connectivity Matrix D/D 0 1 2 0 0 1 1 1 1 0 1 2 1 1 0 Unidirectional P2P Bandwidth Matrix (GB/s) D/D 0 1 2 0 0.00 10.97 10.95 1 10.93 0.00 10.94 2 10.94 10.95 0.00 Bidirectional P2P Bandwidth Matrix (GB/s) D/D 0 1 2 0 0.00 21.96 21.89 1 21.96 0.00 21.89 2 21.89 21.89 0.00 P2P Latency Matrix (ms) SPU 0 1 2 0 0.00 3.07 3.06 1 3.09 0.00 3.06 2 3.06 3.06 0.00 CPU 0 1 2 0 0.00 8.63 8.60 1 8.63 0.00 8.60 2 8.71 8.63 0.00 Test Result = PASS ``` 以 card 为维度: ```text P2P Connectivity Matrix C/C 1 2 1 0 1 2 1 0 Unidirectional P2P Bandwidth Matrix (GB/s) C/C 1 2 1 0.00 14.12 2 14.10 0.00 Bidirectional P2P Bandwidth Matrix (GB/s) C/C 1 2 1 0.00 28.48 2 28.48 0.00 P2P Latency Matrix (ms) SPU 1 2 1 0.00 7.13 2 7.14 0.00 CPU 1 2 1 0.00 21.50 2 21.48 0.00 Test Result = PASS ``` #### stress ```text Usage: mx-qual stress [OPTIONS] Options: -h,--help Print this help message and exit -i,--index UINT:INT in [0 - 31] ... The device index you specified (default: all). Separate values with spaces. Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3} -t,--time INT:INT in [2 - 100000] Number of minutes consumed in a single stress test. (default: 2) The deviation is subject to the influence of the machine. -l,--load INT:{0,50,100} Pressure test load, such as 0%, 50%, 100% (default: 100%) ``` 运行压力测试,可以对设备的内存和计算单元进行压测,可以通过 `-i` 指定设备的 `index`,默认测试所有设备。 测试的时间可以通过 `-t` 指定,表示分钟数,默认为 2 分钟。若需要测试一小时,那么可以指定 `-t 60`。 内存和计算的负载可以通过 `-l` 指定,有三个可选数值,`0`、`50` 和 `100`,默认负载是 100%。 运行的前一分钟,会先让设备预热,不会监控设备状态。一分钟后,开始会监控设备的温度、功率和利用率,每隔一秒刷新一次,同时还会在当前目录下生成 `mx-qual-stress.log` 文件,可以用于后续分析。压测完成后,会输出压测过程中的整体信息。 测试命令示例: ```shell # 压测2分钟 mx-qual stress # 压测60分钟 mx-qual stress -t 60 # 使用50%的负载压测60分钟 mx-qual stress -t 60 -l 50 ``` 输出结果示例: ```text device temp.cur temp.avg power.cur power.avg util.cur util.avg 0 68 68 65 66 99 99 1 67 66 65 65 99 100 2 69 68 67 66 99 99 =================================================================================================== Summary =================================================================================================== device temp.min temp.max temp.avg power.min power.max power.avg util.min util.max util.avg 0 66 68 68 65 68 66 99 99 99 1 65 67 66 63 67 65 99 100 100 2 67 69 68 65 68 66 99 99 99 Test Result = PASS ``` #### compute ```text Run computing power test Usage: mx-qual compute [OPTIONS] Options: -h,--help Print this help message and exit -i,--index UINT:INT in [0 - 31] ... The device index you specified (default: all). Separate values with spaces. Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3} ``` 运行算力测试时,如果不指定设备,默认对所有设备进行测试,可以通过 `-i` 参数指定测试设备的 `index` 。 测试结果将以卡为单位进行展示,并包含在不同稀疏倍率下的性能数据。注意,具有相同序列号 (SN) 的设备被视为同一张卡。 测试命令示例: ```shell mx-qual compute ``` 输出结果示例: ```text INT8 1x sparsity: SN: 2023243080074, actual: 87.64 TOPS, target: 88.47 TOPS, Utilization: 99.06% SN: 2023243080096, actual: 87.64 TOPS, target: 88.47 TOPS, Utilization: 99.06% 2x sparsity: SN: 2023243080074, actual: 173.77 TOPS, target: 176.95 TOPS, Utilization: 98.21% SN: 2023243080096, actual: 173.62 TOPS, target: 176.95 TOPS, Utilization: 98.12% 4x sparsity: SN: 2023243080074, actual: 347.35 TOPS, target: 353.89 TOPS, Utilization: 98.15% SN: 2023243080096, actual: 347.55 TOPS, target: 353.89 TOPS, Utilization: 98.21% 8x sparsity: SN: 2023243080074, actual: 695.91 TOPS, target: 707.79 TOPS, Utilization: 98.32% SN: 2023243080096, actual: 695.91 TOPS, target: 707.79 TOPS, Utilization: 98.32% 16x sparsity: SN: 2023243080074, actual: 1377.36 TOPS, target: 1415.58 TOPS, Utilization: 97.30% SN: 2023243080096, actual: 1374.20 TOPS, target: 1415.58 TOPS, Utilization: 97.08% 32x sparsity: SN: 2023243080074, actual: 2662.64 TOPS, target: 2831.16 TOPS, Utilization: 94.05% SN: 2023243080096, actual: 2662.64 TOPS, target: 2831.16 TOPS, Utilization: 94.05% BF16 1x sparsity: SN: 2023243080074, actual: 43.82 TOPS, target: 44.24 TOPS, Utilization: 99.05% SN: 2023243080096, actual: 43.82 TOPS, target: 44.24 TOPS, Utilization: 99.05% 2x sparsity: SN: 2023243080074, actual: 86.77 TOPS, target: 88.47 TOPS, Utilization: 98.08% SN: 2023243080096, actual: 86.80 TOPS, target: 88.47 TOPS, Utilization: 98.11% 4x sparsity: SN: 2023243080074, actual: 174.13 TOPS, target: 176.95 TOPS, Utilization: 98.41% SN: 2023243080096, actual: 174.23 TOPS, target: 176.95 TOPS, Utilization: 98.46% 8x sparsity: SN: 2023243080074, actual: 348.76 TOPS, target: 353.89 TOPS, Utilization: 98.55% SN: 2023243080096, actual: 348.76 TOPS, target: 353.89 TOPS, Utilization: 98.55% 16x sparsity: SN: 2023243080074, actual: 698.35 TOPS, target: 707.79 TOPS, Utilization: 98.67% SN: 2023243080096, actual: 698.35 TOPS, target: 707.79 TOPS, Utilization: 98.67% 32x sparsity: SN: 2023243080074, actual: 1371.04 TOPS, target: 1415.58 TOPS, Utilization: 96.85% SN: 2023243080096, actual: 1374.20 TOPS, target: 1415.58 TOPS, Utilization: 97.08% ``` #### memtest ```text Run hardware memory test Usage: mx-qual memtest [OPTIONS] Options: -h,--help Print this help message and exit -i,--index UINT:INT in [0 - 31] ... The device index you specified (default: all). Separate values with spaces. Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3} -t,--type UINT:INT in [0 - 10] ... The memory test type you specified (default: 7) type 0 [Walking 1 bit] type 1 [Own address test] type 2 [Moving inversions, ones&zeros] type 3 [Moving inversions, 8 bit pat] type 4 [Moving inversions, random pattern] type 5 [Block move, 64 moves] type 6 [Moving inversions, 32 bit pat] type 7 [Random number sequence] type 8 [Modulo 20, random pattern] type 9 [Bit fade test] type 10 [Memory stress test] -l,--loop INT:INT in [1 - 100] The number of test loop (default: 1) ``` 这是一个参考`MemTest86`实现的用于测试硬件内存稳定性和可靠性的工具。 可以通过`-i`指定设备的`index`,默认运行所有设备。可以通过`-t`指定测试的类型,可以指定多个类型,类型的取值范围为0-10,具体的类型含义可以参考`MemTest86`的说明,默认是7。可以通过`-l`指定测试的循环次数,默认为1次。 测试命令示例: ```shell mx-qual memtest mx-qual memtest -i 0 -t 0 ``` 输出结果示例: ```text Device[5] running test 4 / 4 ... passed Device[4] running test 4 / 4 ... passed Device[3] running test 4 / 4 ... passed Device[2] running test 4 / 4 ... passed Device[0] running test 4 / 4 ... passed Device[1] running test 4 / 4 ... passed Device[6] running test 4 / 4 ... passed Test Result = PASS ```