FAQ

系统无法识别设备

现象

执行 mx-mft status 命令查询 Device 状态,Device 的 SNPNunknown,示例如下:

Get device 0 info fail: ERROR, query device 0 version fail

Device 0 info:
    SN: unknown
    PN: unknown
    PCI bus: 0000:34:00.0
    Mode: Bootloader

处理步骤

  1. 通过 Rescan 功能找回设备。

    sudo /usr/local/sola/driver/mx-utils pci-rescan all
    
  2. 加载设备固件。

    mx-mft boot all /usr/local/sola/driver/firmware/moffett-antoum-<version>
    

卸载 SOLA Toolkit 时提示被占用

现象

rmmod: ERROR: Module moffett is in use
Error, uninstall moffett driver fail
ERROR: An MOFFETT kernel module 'moffett' appears to already be loaded in your kernel. This may be because it is currently in use, for example, by a SOLA program, or the MOFFETT Persistence Daemon (such as moffett-device-plugin, mx-hostengine or dcsm-exporter), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the SPU(s) before attempting to upgrade your driver. If no SPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an MOFFETT kernel module's usage count, for which the simplest remedy is to reboot your computer.

处理步骤

  1. 查看是否有进程占用设备,并结束占用的进程。

    mx-smi list          # 查看设备及进程信息
    sudo lsof | grep mf- # 查找所有与 mf- 相关的运行进程
    sudo kill -9 <PID>   # 终止对应的进程
    
  2. 查看 moffett 的系统 module 的数量。如果数量不为 0,移除对应的 module。

    lsmod | grep moffett # 查看 moffett 的系统 module 的数量
    sudo rmmod moffett   # 移除已加载的 moffett 系统 module
    
  3. 查看是否有推理服务、DCSM、DCSM Exporter、Device Plugin 运行。如果有这些服务和组件运行,结束服务并停用组件,具体操作请参见:

使用 MOFFETT Container Toolkit 运行工作负载时报错

现象

docker run --rm -ti --device moffett.ai/spu=all ubuntu:22.04 mx-smi
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: failed to fulfil mount request: open /usr/local/sola-3.8.1.2/lib/libsola.so: no such file or directory: unknown.

原因

设备或 SOLA Toolkit 版本变更导致该错误。MOFFETT Container Toolkit 的安装配置时间早于设备或驱动库软件的版本变更时间,仍沿用了旧的静态配置。

处理步骤

当容器中的设备节点或 SOLA Toolkit 版本发生变化时,执行以下命令重新生成 CDI 配置文件:

sudo moffett-ctk cdi generate --output=/etc/cdi/moffett.yaml

KylinOS 系统中运行 MOFFETT Container Toolkit 验证示例时报错,无法正常载入设备

现象

执行以下 docker 命令验证 MOFFETT Container Toolkit 是否成功挂载设备并输出相关命令结果时,提示错误,无法正在载入设备。

docker run --rm --device moffett.ai/spu=all ubuntu:22.04 mx-smi list
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: container_linux.go:318: starting container process caused "process_linux.go:378: container init caused \"rootfs_linux.go:61: mounting \\\"/usr/bin/mx-qual\\\" to rootfs \\\"/var/lib/docker/overlay2/61383eb2d14eb420a7c845f533b192d308c97999b1c30bd058bc0f14f61931fc/merged\\\" at \\\"/var/lib/docker/overlay2/61383eb2d14eb420a7c845f533b192d308c97999b1c30bd058bc0f14f61931fc/merged/usr/bin/mx-qual\\\" caused \\\"not a directory\\\"\"": unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type.

原因

KylinOS 自带的 Podman 和 Docker 引擎冲突。

处理步骤

卸载 Podman。

yum remove podman