Skip to main content

NVIDIA GPU SMI SSH

Pack Assets

Monitored Objects

The Pack NVIDIA GPU collects metrics for:

  • Gpu-stats

Collected Metrics

Metric nameDescriptionUnit
devices.gpu.total.countNumber of gpu devices
product_name:id#device.gpu.utilization.percentagePercent of time over the past sample period (between 1 second and 1/6 second depending on the product) during which one or more kernels was executing on the GPU%
product_name:id#device.gpu.memory.utilization.percentagePercent of time over the past sample period (between 1 second and 1/6 second depending on the product) during which global (device) memory was being read or written%
product_name:id#device.gpu.encoder.utilization.percentagePercent of time over the past sample period (sampling rate is variable) during which the GPU video encoder was being used%
product_name:id#device.gpu.decoder.utilization.percentagePercent of time over the past sample period (sampling rate is variable) during which the GPU video decoder was being used%
product_name:id#device.gpu.frame_buffer.memory.usage.bytesOn-board frame buffer memory usageB
product_name:id#device.gpu.frame_buffer.memory.free.bytesOn-board frame buffer memory available usageB
product_name:id#device.gpu.frame_buffer.memory.usage.percentageOn-board frame buffer memory usage in percentage%
product_name:id#device.gpu.bar1.memory.usage.bytesBAR1 memory usageB
product_name:id#device.gpu.bar1.memory.free.bytesBAR1 memory available usageB
product_name:id#device.gpu.bar1.memory.usage.percentageBAR1 memory usage in percentage%
product_name:id#device.gpu.fan.speed.percentageFan speed value%
product_name:id#device.gpu.temperature.celsiusTemperature valueC
product_name:id#device.gpu.power.consumption.wattThe last measured power draw for the entire boardW

Prerequisites

The centreon-engine user performs a SSH connection to a remote system user. This user must have enough privileges to run nvidia-smi command.

Setup

  1. Install the Centreon Plugin on every Poller:
yum install centreon-plugin-Hardware-Devices-Nvidia-Gpu-Smi-Ssh
  1. On the Centreon Web interface in Configuration > Monitoring Connector Manager, install the NVIDIA GPU SMI SSH Pack

Host configuration

  • Add a new Host and apply the HW-Device-Nvidia-Gpu-Smi-SSH-custom Host Template

Once the template applied, some Macros have to be configured. 3 SSH backends are available to connect to the remote server: sshcli, plink and libssh which are detailed below.

MandatoryNameDescription
XSSHBACKENDName of the backend: sshcli
XSSHUSERNAMEBy default, it uses the user running process centengine on your Poller
SSHPASSWORDCannot be used with backend. Only ssh key authentication
SSHPORTBy default: 22
SSHEXTRAOPTIONSCustomize it with your own if needed. E.g.: --ssh-priv-key=/user/.ssh/id_rsa

Warning With that backend, you have to validate the target server fingerprint manually (with the SSHUSERNAME used).

How to test the Plugin and what are the main options for?

Once the Plugin installed, log into your Poller using the centreon-engine user account and test by running the following command :

/usr/lib/centreon/plugins/centreon_nvidia_gpu_smi_ssh.pl \
--plugin=hardware::devices::nvidia::gpu::smi::plugin \
--mode=stats \
--hostname=10.30.2.81 \
--ssh-username=centreon \
--ssh-password='centreon-password' \
--ssh-backend=libssh \
--verbose

Expected command output is shown below:

OK: All devices are ok | 'devices.gpu.total.count'=2;;;0; 'Quadro K6000:00000000:08:00.0#device.gpu.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.memory.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.encoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.decoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.usage.bytes'=1349517312B;;;0;12798918656 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.free.bytes'=11449401344B;;;0;12798918656 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.usage.percentage'=10.54%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.usage.bytes'=13631488B;;;0;268435456 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.free.bytes'=254803968B;;;0;268435456 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.usage.percentage'=5.08%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.fan.speed.percentage'=26.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.temperature.celsius'=40C;;;; 'Quadro K6000:00000000:08:00.0#device.gpu.power.consumption.watt'=24.16W;;;0; 'Quadro K6000:00000000:84:00.0#device.gpu.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.memory.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.encoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.decoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.usage.bytes'=732954624B;;;0;12798918656 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.free.bytes'=12065964032B;;;0;12798918656 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.usage.percentage'=5.73%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.usage.bytes'=5242880B;;;0;268435456 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.free.bytes'=263192576B;;;0;268435456 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.usage.percentage'=1.95%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.fan.speed.percentage'=26.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.temperature.celsius'=40C;;;; 'Quadro K6000:00000000:84:00.0#device.gpu.power.consumption.watt'=23.86W;;;0;
checking device gpu 'Quadro K6000:00000000:08:00.0'
utilization gpu: 0.00 %, memory: 0.00 %, encoder: 0.00 %, decoder: 0.00 %
frame buffer memory usage total: 11.92 GB used: 1.26 GB (10.54%) free: 10.66 GB (89.46%)
bar1 memory usage total: 256.00 MB used: 13.00 MB (5.08%) free: 243.00 MB (94.92%)
fan speed: 26.00 %
gpu temperature: 40 C
power consumption: 24.16 W
checking device gpu 'Quadro K6000:00000000:84:00.0'
utilization gpu: 0.00 %, memory: 0.00 %, encoder: 0.00 %, decoder: 0.00 %
frame buffer memory usage total: 11.92 GB used: 699.00 MB (5.73%) free: 11.24 GB (94.27%)
bar1 memory usage total: 256.00 MB used: 5.00 MB (1.95%) free: 251.00 MB (98.05%)
fan speed: 26.00 %gpu temperature: 40 C
power consumption: 23.86 W

The command above gets GPU statistics (--mode=stats).

It uses a SSH username centreon (--ssh-username=centreon), a SSH password centreon-password (--ssh-password='centreon-password'), uses a SSH backend libssh (--ssh-backend='libssh') and it connects to the host 10.30.2.81 (--hostname=10.30.2.81) on the SSH default port 22 (--ssh-port=22).

All the options as well as all the available thresholds can be displayed by adding the --help parameter to the command:

/usr/lib/centreon/plugins/centreon_nvidia_gpu_smi_ssh.pl \
--plugin=hardware::devices::nvidia::gpu::smi::plugin \
--mode=stats \
--help

Troubleshooting

Troubleshooting plugins