NVIDIA GPU SMI SSH
Pack Assets​
Monitored Objects​
The Pack NVIDIA GPU collects metrics for:
- Gpu-stats
Collected Metrics​
- Gpu-stats
Metric name | Description | Unit |
---|---|---|
devices.gpu.total.count | Number of gpu devices | |
product_name:id#device.gpu.utilization.percentage | Percent of time over the past sample period (between 1 second and 1/6 second depending on the product) during which one or more kernels was executing on the GPU | % |
product_name:id#device.gpu.memory.utilization.percentage | Percent of time over the past sample period (between 1 second and 1/6 second depending on the product) during which global (device) memory was being read or written | % |
product_name:id#device.gpu.encoder.utilization.percentage | Percent of time over the past sample period (sampling rate is variable) during which the GPU video encoder was being used | % |
product_name:id#device.gpu.decoder.utilization.percentage | Percent of time over the past sample period (sampling rate is variable) during which the GPU video decoder was being used | % |
product_name:id#device.gpu.frame_buffer.memory.usage.bytes | On-board frame buffer memory usage | B |
product_name:id#device.gpu.frame_buffer.memory.free.bytes | On-board frame buffer memory available usage | B |
product_name:id#device.gpu.frame_buffer.memory.usage.percentage | On-board frame buffer memory usage in percentage | % |
product_name:id#device.gpu.bar1.memory.usage.bytes | BAR1 memory usage | B |
product_name:id#device.gpu.bar1.memory.free.bytes | BAR1 memory available usage | B |
product_name:id#device.gpu.bar1.memory.usage.percentage | BAR1 memory usage in percentage | % |
product_name:id#device.gpu.fan.speed.percentage | Fan speed value | % |
product_name:id#device.gpu.temperature.celsius | Temperature value | C |
product_name:id#device.gpu.power.consumption.watt | The last measured power draw for the entire board | W |
Prerequisites​
The centreon-engine user performs a SSH connection to a remote system user. This user must have enough privileges to run nvidia-smi
command.
Setup​
- Online License
- Offline License
- Install the Centreon Plugin on every Poller:
yum install centreon-plugin-Hardware-Devices-Nvidia-Gpu-Smi-Ssh
- On the Centreon Web interface in Configuration > Monitoring Connector Manager, install the NVIDIA GPU SMI SSH Pack
- Install the Centreon Plugin on every Poller:
yum install centreon-plugin-Hardware-Devices-Nvidia-Gpu-Smi-Ssh
- On the Centreon Central server, install the Centreon Pack from the RPM:
yum install centreon-pack-hardware-devices-nvidia-gpu-smi-ssh
- On the Centreon Web interface in Configuration > Monitoring Connector Manager, install the NVIDIA GPU SMI SSH Pack
Host configuration​
- Add a new Host and apply the HW-Device-Nvidia-Gpu-Smi-SSH-custom Host Template
Once the template applied, some Macros have to be configured. 3 SSH backends are available to connect to the remote server: sshcli, plink and libssh which are detailed below.
- sshcli backend
- plink backend
- libssh backend (default)
Mandatory | Name | Description |
---|---|---|
X | SSHBACKEND | Name of the backend: sshcli |
X | SSHUSERNAME | By default, it uses the user running process centengine on your Poller |
SSHPASSWORD | Cannot be used with backend. Only ssh key authentication | |
SSHPORT | By default: 22 | |
SSHEXTRAOPTIONS | Customize it with your own if needed. E.g.: --ssh-priv-key=/user/.ssh/id_rsa |
Warning With that backend, you have to validate the target server fingerprint manually (with the SSHUSERNAME used).
Mandatory | Name | Description |
---|---|---|
X | SSHBACKEND | Name of the backend: plink |
X | SSHUSERNAME | By default, it uses the user running process centengine on your Poller |
SSHPASSWORD | Can be used. If not set, SSH key authentication is used | |
SSHPORT | By default: 22 | |
SSHEXTRAOPTIONS | Customize it with your own if needed. E.g.: --ssh-priv-key=/user/.ssh/id_rsa |
Warning With that backend, you have to validate the target server fingerprint manually (with the SSHUSERNAME used).
Mandatory | Name | Description |
---|---|---|
X | SSHBACKEND | Name of the backend: libssh |
X | SSHUSERNAME | By default, it uses the user running process centengine on your Poller |
SSHPASSWORD | Can be used. If not set, SSH key authentication is used | |
SSHPORT | By default: 22 | |
SSHEXTRAOPTIONS | Customize it with your own if needed. E.g.: --ssh-priv-key=/user/.ssh/id_rsa |
With that backend, you do not have to validate the target server fingerprint manually.
How to test the Plugin and what are the main options for?​
Once the Plugin installed, log into your Poller using the centreon-engine user account and test by running the following command :
/usr/lib/centreon/plugins/centreon_nvidia_gpu_smi_ssh.pl \
--plugin=hardware::devices::nvidia::gpu::smi::plugin \
--mode=stats \
--hostname=10.30.2.81 \
--ssh-username=centreon \
--ssh-password='centreon-password' \
--ssh-backend=libssh \
--verbose
Expected command output is shown below:
OK: All devices are ok | 'devices.gpu.total.count'=2;;;0; 'Quadro K6000:00000000:08:00.0#device.gpu.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.memory.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.encoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.decoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.usage.bytes'=1349517312B;;;0;12798918656 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.free.bytes'=11449401344B;;;0;12798918656 'Quadro K6000:00000000:08:00.0#device.gpu.frame_buffer.memory.usage.percentage'=10.54%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.usage.bytes'=13631488B;;;0;268435456 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.free.bytes'=254803968B;;;0;268435456 'Quadro K6000:00000000:08:00.0#device.gpu.bar1.memory.usage.percentage'=5.08%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.fan.speed.percentage'=26.00%;;;0;100 'Quadro K6000:00000000:08:00.0#device.gpu.temperature.celsius'=40C;;;; 'Quadro K6000:00000000:08:00.0#device.gpu.power.consumption.watt'=24.16W;;;0; 'Quadro K6000:00000000:84:00.0#device.gpu.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.memory.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.encoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.decoder.utilization.percentage'=0.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.usage.bytes'=732954624B;;;0;12798918656 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.free.bytes'=12065964032B;;;0;12798918656 'Quadro K6000:00000000:84:00.0#device.gpu.frame_buffer.memory.usage.percentage'=5.73%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.usage.bytes'=5242880B;;;0;268435456 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.free.bytes'=263192576B;;;0;268435456 'Quadro K6000:00000000:84:00.0#device.gpu.bar1.memory.usage.percentage'=1.95%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.fan.speed.percentage'=26.00%;;;0;100 'Quadro K6000:00000000:84:00.0#device.gpu.temperature.celsius'=40C;;;; 'Quadro K6000:00000000:84:00.0#device.gpu.power.consumption.watt'=23.86W;;;0;
checking device gpu 'Quadro K6000:00000000:08:00.0'
utilization gpu: 0.00 %, memory: 0.00 %, encoder: 0.00 %, decoder: 0.00 %
frame buffer memory usage total: 11.92 GB used: 1.26 GB (10.54%) free: 10.66 GB (89.46%)
bar1 memory usage total: 256.00 MB used: 13.00 MB (5.08%) free: 243.00 MB (94.92%)
fan speed: 26.00 %
gpu temperature: 40 C
power consumption: 24.16 W
checking device gpu 'Quadro K6000:00000000:84:00.0'
utilization gpu: 0.00 %, memory: 0.00 %, encoder: 0.00 %, decoder: 0.00 %
frame buffer memory usage total: 11.92 GB used: 699.00 MB (5.73%) free: 11.24 GB (94.27%)
bar1 memory usage total: 256.00 MB used: 5.00 MB (1.95%) free: 251.00 MB (98.05%)
fan speed: 26.00 %gpu temperature: 40 C
power consumption: 23.86 W
The command above gets GPU statistics (--mode=stats
).
It uses a SSH username centreon (--ssh-username=centreon
), a SSH password centreon-password (--ssh-password='centreon-password'
),
uses a SSH backend libssh (--ssh-backend='libssh'
) and it connects to the host 10.30.2.81 (--hostname=10.30.2.81
)
on the SSH default port 22 (--ssh-port=22
).
All the options as well as all the available thresholds can be displayed by adding the --help
parameter to the command:
/usr/lib/centreon/plugins/centreon_nvidia_gpu_smi_ssh.pl \
--plugin=hardware::devices::nvidia::gpu::smi::plugin \
--mode=stats \
--help