mx-qual用户手册

概述

本文档介绍了 mx-qual(MOFFETT QUALIFICATION )工具的使用方法。mx-qual 是基于 SOLA Runtime API 实现的设备质量测试工具,主要用于检测设备的可用性、稳定性、性能等方面的指标。

使用方法

mx-qual 是一个命令行工具,可以通过 mx-qual -h 查看帮助信息。

Moffett Quality Inspection Application  v1.3.0
Usage: mx-qual [OPTIONS] SUBCOMMAND

Options:
  -h,--help                   Print this help message and exit
  --version                   Display program version information and exit

Subcommands:
  list                        List all devices detected on the system
  hardware_link               Run hardware link test
  pcie_bandwidth              Run PCIe bandwidth test
  memory_bandwidth            Run memory bandwidth test
  p2p                         Run peer to peer test
  compute                     Run computing power test
  stress                      Run stress test
  memtest                     Run hardware memory test

mx-qual 以子命令的方式去执行相应的测试,子命令的使用方法可以通过 mx-qual <sub_command> -h 查看。

子命令说明

list

List devices
Usage: mx-qual list [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -i,--index UINT:INT in [0 - 31] ...
                              The device index you specified (default: all). Separate values with spaces.
                              Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}

列出指定的设备信息,如果不指定设备,则列出所有设备信息。

测试命令示例:

# 列出所有设备
mx-qual list
# 列出指定设备
mx-qual list -i 0
mx-qual list -i 0 1 2
mx-qual list -i {0,1,2}

输出结果示例:

Device 0: "01S30-00A"
  Serial number:      2023243080074
  PCI Bus ID:         0000:8a:00.0
  Runtime version:    3.4.0
  Driver version:     3.4.0
  Firmware version:   1.0.14

pcie_bandwidth

Usage: mx-qual pcie_bandwidth [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -i,--index UINT:INT in [0 - 31] ...
                              The device index you specified (default: all). Separate values with spaces.
                              Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}
  -s,--sn TEXT                The device sn you specified.
                              
  -d,--data_size UINT:INT in [32 - 100]
                              The transfer size (MB) you specified. (default 100MB)
                              
  -l,--loop INT:INT in [1 - 100]
                              The number of test loop (default: 1)
                              
  -f,--full_duplex            Enable full duplex mode

运行 PCIe 带宽测试,不指定设备时默认测试所有设备,可以通过-i指定设备的index,也可以通过-s指定设备的sn-s的优先级比-i高,若同时指定了-i-s,则只测试-s指定的设备。

测试的数据大小可以通过-d指定,单位为MB,默认为100MB,测试的循环次数可以通过-l指定,默认为1次。默认进行半双工测试,使用-f可以开启全双工测试。

测试命令示例:

mx-qual pcie_bandwidth
mx-qual pcie_bandwidth -i 0
mx-qual pcie_bandwidth -i 0 1 2
mx-qual pcie_bandwidth -s 2023243080096
mx-qual pcie_bandwidth --sn=2023243080096
mx-qual pcie_bandwidth -i 0 -d 32 -l 10
mx-qual pcie_bandwidth -f -l 10

输出结果示例:

PCIe Bandwidth Test
 Device id: [0]

 Host to Device Bandwidth
 Transfer Size: 100.000 MB, Bandwidth: 11.947 GB/s

 Device to Host Bandwidth
 Transfer Size: 100.000 MB, Bandwidth: 12.139 GB/s

Result = PASS

memory_bandwidth

Run memory bandwidth test
Usage: mx-qual memory_bandwidth [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -i,--index UINT:INT in [0 - 31] ...
                              The device index you specified (default: all). Separate values with spaces.
                              Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}
  -s,--sn TEXT                The device sn you specified.
                              
  -d,--data_size UINT:INT in [32 - 100]
                              The transfer size (MB) you specified. (default 100MB)
                              
  -l,--loop INT:INT in [1 - 100]
                              The number of test loop (default: 1)

运行设备内存带宽测试,不指定设备时默认测试所有设备,可以通过-i指定设备的index,也可以通过-s指定设备的sn-s的优先级比-i高,若同时指定了-i-s,则只测试-s指定的设备。

测试的数据大小可以通过-d指定,单位为MB,默认为100MB,测试的循环次数可以通过-l指定,默认为1次。

测试命令示例:

mx-qual memory_bandwidth
mx-qual memory_bandwidth -i 0
mx-qual memory_bandwidth -i 0 1 2
mx-qual memory_bandwidth -s 2023243080096
mx-qual memory_bandwidth --sn=2023243080096
mx-qual memory_bandwidth -i 0 -d 32 -l 10

输出结果示例:

Memory Bandwidth Test
 Device id: [0]

 Memory Read  Bandwidth
 Transfer Size: 100.000 MB, Bandwidth: 50.187 GB/s

 Memory Write Bandwidth
 Transfer Size: 100.000 MB, Bandwidth: 50.053 GB/s

Result = PASS

P2P

Run peer to peer test
Usage: mx-qual p2p [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -d,--device UINT:INT in [0 - 31] ...
                              Device index you specified (default: all). Separate values with spaces.
                              Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {0,1,2}
  -c,--card UINT:INT in [1 - 32] ...
                              Card index you specified. Separate values with spaces.
                              Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2}

运行 p2p 带宽测试,可以通过 --device 指定设备,或者通过 --card 指定卡,不指定时则默认测试所有设备。

测试命令示例:

mx-qual p2p
mx-qual p2p -d 0 1 2
mx-qual p2p -c 1 2

输出结果示例:

以 device 为维度:

P2P Connectivity Matrix
    D/D      0      1      2
    0        0      1      1
    1        1      0      1
    2        1      1      0

Unidirectional P2P Bandwidth Matrix (GB/s)
    D/D      0      1      2
    0     0.00  10.97  10.95
    1    10.93   0.00  10.94
    2    10.94  10.95   0.00

Bidirectional P2P Bandwidth Matrix (GB/s)
    D/D      0      1      2
    0     0.00  21.96  21.89
    1    21.96   0.00  21.89
    2    21.89  21.89   0.00

P2P Latency Matrix (ms)
    SPU      0      1      2
    0     0.00   3.07   3.06
    1     3.09   0.00   3.06
    2     3.06   3.06   0.00

    CPU      0      1      2
    0     0.00   8.63   8.60
    1     8.63   0.00   8.60
    2     8.71   8.63   0.00

Test Result = PASS

以 card 为维度:

P2P Connectivity Matrix
    C/C      1      2
    1        0      1
    2        1      0

Unidirectional P2P Bandwidth Matrix (GB/s)
    C/C      1      2
    1     0.00  14.12
    2    14.10   0.00

Bidirectional P2P Bandwidth Matrix (GB/s)
    C/C      1      2
    1     0.00  28.48
    2    28.48   0.00

P2P Latency Matrix (ms)
    SPU      1      2
    1     0.00   7.13
    2     7.14   0.00

    CPU      1      2
    1     0.00  21.50
    2    21.48   0.00

Test Result = PASS

stress

Usage: mx-qual stress [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -i,--index UINT:INT in [0 - 31] ...
                              The device index you specified (default: all). Separate values with spaces.
                              Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}
  -t,--time INT:INT in [2 - 100000]
                              Number of minutes consumed in a single stress test. (default: 2)
                              The deviation is subject to the influence of the machine.
  -l,--load INT:{0,50,100}    Pressure test load, such as 0%, 50%, 100% (default: 100%)

运行压力测试,可以对设备的内存和计算单元进行压测,可以通过 -i 指定设备的 index,默认测试所有设备。

测试的时间可以通过 -t 指定,表示分钟数,默认为 2 分钟。若需要测试一小时,那么可以指定 -t 60

内存和计算的负载可以通过 -l 指定,有三个可选数值,050100,默认负载是 100%。

运行的前一分钟,会先让设备预热,不会监控设备状态。一分钟后,开始会监控设备的温度、功率和利用率,每隔一秒刷新一次,同时还会在当前目录下生成 mx-qual-stress.log 文件,可以用于后续分析。压测完成后,会输出压测过程中的整体信息。

测试命令示例:

# 压测2分钟
mx-qual stress
# 压测60分钟
mx-qual stress -t 60
# 使用50%的负载压测60分钟
mx-qual stress -t 60 -l 50

输出结果示例:

device  temp.cur  temp.avg  power.cur  power.avg  util.cur  util.avg    
0       68        68        65         66         99        99        
1       67        66        65         65         99        100       
2       69        68        67         66         99        99        


===================================================================================================
Summary
===================================================================================================
device  temp.min  temp.max  temp.avg  power.min  power.max  power.avg  util.min  util.max  util.avg
0       66        68        68        65         68         66         99        99        99       
1       65        67        66        63         67         65         99        100       100      
2       67        69        68        65         68         66         99        99        99       

Test Result = PASS

compute

Run computing power test
Usage: mx-qual compute [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -i,--index UINT:INT in [0 - 31] ...
                              The device index you specified (default: all). Separate values with spaces.
                              Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}

运行算力测试时,如果不指定设备,默认对所有设备进行测试,可以通过 -i 参数指定测试设备的 index 。 测试结果将以卡为单位进行展示,并包含在不同稀疏倍率下的性能数据。注意,具有相同序列号 (SN) 的设备被视为同一张卡。

测试命令示例:

mx-qual compute

输出结果示例:

INT8
  1x sparsity:
    SN: 2023243080074, actual:   87.64 TOPS, target:   88.47 TOPS, Utilization: 99.06%
    SN: 2023243080096, actual:   87.64 TOPS, target:   88.47 TOPS, Utilization: 99.06%
  2x sparsity:
    SN: 2023243080074, actual:  173.77 TOPS, target:  176.95 TOPS, Utilization: 98.21%
    SN: 2023243080096, actual:  173.62 TOPS, target:  176.95 TOPS, Utilization: 98.12%
  4x sparsity:
    SN: 2023243080074, actual:  347.35 TOPS, target:  353.89 TOPS, Utilization: 98.15%
    SN: 2023243080096, actual:  347.55 TOPS, target:  353.89 TOPS, Utilization: 98.21%
  8x sparsity:
    SN: 2023243080074, actual:  695.91 TOPS, target:  707.79 TOPS, Utilization: 98.32%
    SN: 2023243080096, actual:  695.91 TOPS, target:  707.79 TOPS, Utilization: 98.32%
  16x sparsity:
    SN: 2023243080074, actual: 1377.36 TOPS, target: 1415.58 TOPS, Utilization: 97.30%
    SN: 2023243080096, actual: 1374.20 TOPS, target: 1415.58 TOPS, Utilization: 97.08%
  32x sparsity:
    SN: 2023243080074, actual: 2662.64 TOPS, target: 2831.16 TOPS, Utilization: 94.05%
    SN: 2023243080096, actual: 2662.64 TOPS, target: 2831.16 TOPS, Utilization: 94.05%
BF16
  1x sparsity:
    SN: 2023243080074, actual:   43.82 TOPS, target:   44.24 TOPS, Utilization: 99.05%
    SN: 2023243080096, actual:   43.82 TOPS, target:   44.24 TOPS, Utilization: 99.05%
  2x sparsity:
    SN: 2023243080074, actual:   86.77 TOPS, target:   88.47 TOPS, Utilization: 98.08%
    SN: 2023243080096, actual:   86.80 TOPS, target:   88.47 TOPS, Utilization: 98.11%
  4x sparsity:
    SN: 2023243080074, actual:  174.13 TOPS, target:  176.95 TOPS, Utilization: 98.41%
    SN: 2023243080096, actual:  174.23 TOPS, target:  176.95 TOPS, Utilization: 98.46%
  8x sparsity:
    SN: 2023243080074, actual:  348.76 TOPS, target:  353.89 TOPS, Utilization: 98.55%
    SN: 2023243080096, actual:  348.76 TOPS, target:  353.89 TOPS, Utilization: 98.55%
  16x sparsity:
    SN: 2023243080074, actual:  698.35 TOPS, target:  707.79 TOPS, Utilization: 98.67%
    SN: 2023243080096, actual:  698.35 TOPS, target:  707.79 TOPS, Utilization: 98.67%
  32x sparsity:
    SN: 2023243080074, actual: 1371.04 TOPS, target: 1415.58 TOPS, Utilization: 96.85%
    SN: 2023243080096, actual: 1374.20 TOPS, target: 1415.58 TOPS, Utilization: 97.08%

memtest

Run hardware memory test
Usage: mx-qual memtest [OPTIONS]

Options:
  -h,--help                   Print this help message and exit
  -i,--index UINT:INT in [0 - 31] ...
                              The device index you specified (default: all). Separate values with spaces.
                              Or give a list of elements, separated by commas and enclosed in curly brackets e.g. {1,2,3}
  -t,--type UINT:INT in [0 - 10] ...
                              The memory test type you specified (default: 7)
                                type 0          [Walking 1 bit]
                                type 1          [Own address test]
                                type 2          [Moving inversions, ones&zeros]
                                type 3          [Moving inversions, 8 bit pat]
                                type 4          [Moving inversions, random pattern]
                                type 5          [Block move, 64 moves]
                                type 6          [Moving inversions, 32 bit pat]
                                type 7          [Random number sequence]
                                type 8          [Modulo 20, random pattern]
                                type 9          [Bit fade test]
                                type 10         [Memory stress test]

  -l,--loop INT:INT in [1 - 100]
                              The number of test loop (default: 1)

这是一个参考MemTest86实现的用于测试硬件内存稳定性和可靠性的工具。

可以通过-i指定设备的index,默认运行所有设备。可以通过-t指定测试的类型,可以指定多个类型,类型的取值范围为0-10,具体的类型含义可以参考MemTest86的说明,默认是7。可以通过-l指定测试的循环次数,默认为1次。

测试命令示例:

mx-qual memtest
mx-qual memtest -i 0 -t 0

输出结果示例:

Device[5] running test 4 / 4 ... passed
Device[4] running test 4 / 4 ... passed
Device[3] running test 4 / 4 ... passed
Device[2] running test 4 / 4 ... passed
Device[0] running test 4 / 4 ... passed
Device[1] running test 4 / 4 ... passed
Device[6] running test 4 / 4 ... passed

Test Result = PASS