Skip to main content

NVIDIA GPU SMI SSH

Pack Assets​

Monitored Objects​

The Pack NVIDIA GPU collects metrics for:

  • Gpu-stats

Collected Metrics​

Metric nameDescriptionUnit
devices.gpu.total.countNumber of gpu devices
product_name:id#device.gpu.utilization.percentagePercent of time over the past sample period (between 1 second and 1/6 second depending on the product) during which one or more kernels was executing on the GPU%
product_name:id#device.gpu.memory.utilization.percentagePercent of time over the past sample period (between 1 second and 1/6 second depending on the product) during which global (device) memory was being read or written%
product_name:id#device.gpu.encoder.utilization.percentagePercent of time over the past sample period (sampling rate is variable) during which the GPU video encoder was being used%
product_name:id#device.gpu.decoder.utilization.percentagePercent of time over the past sample period (sampling rate is variable) during which the GPU video decoder was being used%
product_name:id#device.gpu.frame_buffer.memory.usage.bytesOn-board frame buffer memory usageB
product_name:id#device.gpu.frame_buffer.memory.free.bytesOn-board frame buffer memory available usageB
product_name:id#device.gpu.frame_buffer.memory.usage.percentageOn-board frame buffer memory usage in percentage%
product_name:id#device.gpu.bar1.memory.usage.bytesBAR1 memory usageB
product_name:id#device.gpu.bar1.memory.free.bytesBAR1 memory available usageB
product_name:id#device.gpu.bar1.memory.usage.percentageBAR1 memory usage in percentage%
product_name:id#device.gpu.fan.speed.percentageFan speed value%
product_name:id#device.gpu.temperature.celsiusTemperature valueC
product_name:id#device.gpu.power.consumption.wattThe last measured power draw for the entire boardW

Prerequisites​

The centreon-engine user performs a SSH connection to a remote system user. This user must have enough privileges to run nvidia-smi command.

Setup​

  1. Install the Centreon Plugin on every Poller:
yum install centreon-plugin-Hardware-Devices-Nvidia-Gpu-Smi-Ssh
  1. On the Centreon Web interface in Configuration > Plugin packs > Manager, install the NVIDIA GPU SMI SSH Pack

Host configuration​

  • Add a new Host and apply the HW-Device-Nvidia-Gpu-Smi-SSH-custom Host Template

Once the template applied, some Macros have to be configured. 3 SSH backends are available to connect to the remote server: sshcli, plink and libssh which are detailed below.

MandatoryNameDescription
XSSHBACKENDName of the backend: sshcli
XSSHUSERNAMEBy default, it uses the user running process centengine on your Poller
SSHPASSWORDCannot be used with backend. Only ssh key authentication
SSHPORTBy default: 22
SSHEXTRAOPTIONSCustomize it with your own if needed. E.g.: --ssh-priv-key=/user/.ssh/id_rsa

Warning With that backend, you have to validate the target server fingerprint manually (with the SSHUSERNAME used).

How to test the Plugin and what are the main options for?​

Once the Plugin installed, log into your Poller using the centreon-engine user account and test by running the following command :

/usr/lib/centreon/plugins/centreon_nvidia_gpu_smi_ssh.pl \
--plugin=hardware::devices::nvidia::gpu::smi::plugin \
--mode=stats \
--hostname=10.30.2.81 \
--ssh-username=centreon \
--ssh-password='centreon-password' \
--ssh-backend=libssh \
--verbose

Expected command output is shown below:

OK: All devices are ok | 'devices.gpu.total.count'=2;;;0; 'Quadro K6000:00000000:08:00.0#device.gpu.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.memory.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.encoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.decoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.usage.bytes'=1349517312B;;;0;12798918656 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.free.bytes'=11449401344B;;;0;12798918656 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.usage.percentage'=10.54%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.usage.bytes'=13631488B;;;0;268435456 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.free.bytes'=254803968B;;;0;268435456 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.usage.percentage'=5.08%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.fan.speed.percentage'=26.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.temperature.celsius'=40C;;;; 'Quadro K6000:00000000:08:00.0#device.gpu.power.consumption.watt'=24.16W;;;0; 'Quadro K6000:00000000:84:00.0#device.gpu.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.memory.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.encoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.decoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.usage.bytes'=732954624B;;;0;12798918656 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.free.bytes'=12065964032B;;;0;12798918656 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.usage.percentage'=5.73%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.usage.bytes'=5242880B;;;0;268435456 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.free.bytes'=263192576B;;;0;268435456 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.usage.percentage'=1.95%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.fan.speed.percentage'=26.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.temperature.celsius'=40C;;;; 'Quadro K6000:00000000:84:00.0#device.gpu.power.consumption.watt'=23.86W;;;0;
checking device gpu 'Quadro K6000:00000000:08:00.0'
utilization gpu: 0.00 %, memory: 0.00 %, encoder: 0.00 %, decoder: 0.00 %
frame buffer memory usage total: 11.92 GB used: 1.26 GB (10.54%) free: 10.66 GB (89.46%)
bar1 memory usage total: 256.00 MB used: 13.00 MB (5.08%) free: 243.00 MB (94.92%)
fan speed: 26.00 %
gpu temperature: 40 C
power consumption: 24.16 W
checking device gpu 'Quadro K6000:00000000:84:00.0'
utilization gpu: 0.00 %, memory: 0.00 %, encoder: 0.00 %, decoder: 0.00 %
frame buffer memory usage total: 11.92 GB used: 699.00 MB (5.73%) free: 11.24 GB (94.27%)
bar1 memory usage total: 256.00 MB used: 5.00 MB (1.95%) free: 251.00 MB (98.05%)
fan speed: 26.00 %gpu temperature: 40 C
power consumption: 23.86 W

The command above gets GPU statistics (--mode=stats).

It uses a SSH username centreon (--ssh-username=centreon), a SSH password centreon-password (--ssh-password='centreon-password'), uses a SSH backend libssh (--ssh-backend='libssh') and it connects to the host 10.30.2.81 (--hostname=10.30.2.81) on the SSH default port 22 (--ssh-port=22).

All the options as well as all the available thresholds can be displayed by adding the --help parameter to the command:

/usr/lib/centreon/plugins/centreon_nvidia_gpu_smi_ssh.pl \
--plugin=hardware::devices::nvidia::gpu::smi::plugin \
--mode=stats \
--help

Troubleshooting​

Troubleshooting plugins