Looking for the Perfect Dashboard: InfluxDB, Telegraf and Grafana – Part XII (Native Telegraf Plugin for vSphere) – The Blog of Jorge de la Cruz

Looking for the Perfect Dashboard: InfluxDB, Telegraf and Grafana – Part XII (Native Telegraf Plugin for vSphere) – The Blog of Jorge de la Cruz

Greetings friends, today I bring you another one of those hidden gems that you like so much. In addition to being free and being able to display it in a few minutes, it has a potential that many commercial tools would like.

Today we are about to create four fresh Grafana Dashboards within minutes, at the end of the blog, we can have some Dashboards (in plural friends) similar to these:

vSphere Overview Dashboard

vSphere Hosts Overview Dashboard

vSphere Datastore Overview

vSphere VM Overview

Telegraf Plugin for VMware vSphere

My friend Craig told me that an official Telegraf plugin for vSphere had been released a few days ago, so the first thing I did was to go to his GitHub and check it out:

The plugin is pure joy, not only because it speaks directly with the vCenter SDK, but also because we can monitor all the following parameters:

  • Cluster Stats
    • Cluster services: CPU, memory, failover
    • CPU: total, usage
    • Memory: consumed, total, vmmemctl
    • VM operations: # changes, clone, create, deploy, destroy, power, reboot, reconfigure, register, reset, shutdown, standby, vmotion
  • Host Stats:
    • CPU: total, usage, cost, mhz
    • Datastore: iops, latency, read/write bytes, # reads/writes
    • Disk: commands, latency, kernel reads/writes, # reads/writes, queues
    • Memory: total, usage, active, latency, swap, shared, vmmemctl
    • Network: broadcast, bytes, dropped, errors, multicast, packets, usage
    • Power: energy, usage, capacity
    • Res CPU: active, max, running
    • Storage Adapter: commands, latency, # reads/writes
    • Storage Path: commands, latency, # reads/writes
    • System Resources: cpu active, cpu max, cpu running, cpu usage, mem allocated, mem consumed, mem shared, swap
    • System: uptime
    • Flash Module: active VMDKs
  • VM Stats:
    • CPU: demand, usage, readiness, cost, mhz
    • Datastore: latency, # reads/writes
    • Disk: commands, latency, # reads/writes, provisioned, usage
    • Memory: granted, usage, active, swap, vmmemctl
    • Network: broadcast, bytes, dropped, multicast, packets, usage
    • Power: energy, usage
    • Res CPU: active, max, running
    • System: operating system uptime, uptime
    • Virtual Disk: seeks, # reads/writes, latency, load
  • Datastore stats:
    • Disk: Capacity, provisioned, used

Impressive! right?, if you do not have yet Telegraf, InfluxDB and Grafana follow these steps (these for Grafana), but for some of you, who already have followed the whole series in Spanish, we only have to update our system to receive the vSphere plugin for Telegraf:

We will be able to see the telegraf package with an update, so we will say yes when it asks us to update:

Reading package lists... Done

Building dependency tree      

Reading state information... Done

Calculating upgrade... Done

The following packages have been kept back:

  linuxgenericltsutopic linuxheadersgenericltsutopic

  linuximagegenericltsutopic

The following packages will be upgraded:

  bind9host curl dnsutils filebeat influxdb libbind990 libcurl3

  libcurl3gnutls libdns100 libglib2.00 libglib2.0data libisc95 libisccc90

  libisccfg90 liblwres90 telegraf tzdata

17 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.

Need to get 50.8 MB of archives.

After this operation, 17.6 MB of additional disk space will be used.

Do you want to continue? [Y/n] y

Once we have the package installed, we only need to configure the telegraf.conf, found in /etc/telegraf/telegraf.conf, we will have to remove the # from the vSphere plugin:

[[inputs.vsphere]]

   ## List of vCenter URLs to be monitored. These three lines must be uncommented

   ## and edited for the plugin to work.

   vcenters = [ https://YOURVCSAIP/sdk ]

   username = YOURUSER@vsphere.local

   password = “YOURPASS”

Of course, we will also have to decomment all the parameters of the plugin:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

   ## VMs

   ## Typical VM metrics (if omitted or empty, all metrics are collected)

   vm_metric_include = [

     “cpu.demand.average”,

     “cpu.idle.summation”,

     “cpu.latency.average”,

     “cpu.readiness.average”,

     “cpu.ready.summation”,

     “cpu.run.summation”,

     “cpu.usagemhz.average”,

     “cpu.used.summation”,

     “cpu.wait.summation”,

     “mem.active.average”,

     “mem.granted.average”,

     “mem.latency.average”,

     “mem.swapin.average”,

     “mem.swapinRate.average”,

     “mem.swapout.average”,

     “mem.swapoutRate.average”,

     “mem.usage.average”,

     “mem.vmmemctl.average”,

     “net.bytesRx.average”,

     “net.bytesTx.average”,

     “net.droppedRx.summation”,

     “net.droppedTx.summation”,

     “net.usage.average”,

     “power.power.average”,

     “virtualDisk.numberReadAveraged.average”,

     “virtualDisk.numberWriteAveraged.average”,

     “virtualDisk.read.average”,

     “virtualDisk.readOIO.latest”,

     “virtualDisk.throughput.usage.average”,

     “virtualDisk.totalReadLatency.average”,

     “virtualDisk.totalWriteLatency.average”,

     “virtualDisk.write.average”,

     “virtualDisk.writeOIO.latest”,

     “sys.uptime.latest”,

   ]

   # vm_metric_exclude = [] ## Nothing is excluded by default

   # vm_instances = true ## true by default

   ## Hosts

   ## Typical host metrics (if omitted or empty, all metrics are collected)

   host_metric_include = [

     “cpu.coreUtilization.average”,

     “cpu.costop.summation”,

     “cpu.demand.average”,

     “cpu.idle.summation”,

     “cpu.latency.average”,

     “cpu.readiness.average”,

     “cpu.ready.summation”,

     “cpu.swapwait.summation”,

     “cpu.usage.average”,

     “cpu.usagemhz.average”,

     “cpu.used.summation”,

     “cpu.utilization.average”,

     “cpu.wait.summation”,

     “disk.deviceReadLatency.average”,

     “disk.deviceWriteLatency.average”,

     “disk.kernelReadLatency.average”,

     “disk.kernelWriteLatency.average”,

     “disk.numberReadAveraged.average”,

     “disk.numberWriteAveraged.average”,

     “disk.read.average”,

     “disk.totalReadLatency.average”,

     “disk.totalWriteLatency.average”,

     “disk.write.average”,

     “mem.active.average”,

     “mem.latency.average”,

     “mem.state.latest”,

     “mem.swapin.average”,

     “mem.swapinRate.average”,

     “mem.swapout.average”,

     “mem.swapoutRate.average”,

     “mem.totalCapacity.average”,

     “mem.usage.average”,

     “mem.vmmemctl.average”,

     “net.bytesRx.average”,

     “net.bytesTx.average”,

     “net.droppedRx.summation”,

     “net.droppedTx.summation”,

     “net.errorsRx.summation”,

     “net.errorsTx.summation”,

     “net.usage.average”,

     “power.power.average”,

     “storageAdapter.numberReadAveraged.average”,

     “storageAdapter.numberWriteAveraged.average”,

     “storageAdapter.read.average”,

     “storageAdapter.write.average”,

     “sys.uptime.latest”,

   ]

   # host_metric_exclude = [] ## Nothing excluded by default

   # host_instances = true ## true by default

   ## Clusters

   # cluster_metric_include = [] ## if omitted or empty, all metrics are collected

   # cluster_metric_exclude = [] ## Nothing excluded by default

   # cluster_instances = true ## true by default

   ## Datastores

   # datastore_metric_include = [] ## if omitted or empty, all metrics are collected

   # datastore_metric_exclude = [] ## Nothing excluded by default

   # datastore_instances = false ## false by default for Datastores only

   ## Datacenters

   datacenter_metric_include = [] ## if omitted or empty, all metrics are collected

   datacenter_metric_exclude = [ “*” ] ## Datacenters are not collected by default.

   datacenter_instances = false ## false by default for Datastores only

#

Once done, if we are not using a valid SSL CA, or if the CA it is not installed on the Grafana, InfluxDB, Telegraf server, please uncomment this as well:

insecure_skip_verify = false

Another option is to download the SSL from our vCenter to our Telegraf, to trust it:

openssl s_client servername YOURVCENTER connect YOURVCENTER:443 </dev/null | sed ne ‘/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p’ >/etc/ssl/certs/vcsa.pem

Let’s finally restart the telegraf service:

Verifying that we are ingesting information with Chronograf

The normal thing to these heights, if we have made well all the steps, is that already we are sending information compiled by Telegraf towards InfluxDB, if we realize a search using the wonderful Chronograf, we will be able to verify that we have information:

All the variables of this new vSphere plugin for Telegraf are stored in vsphere_* so it’s really easy to find them.

Grafana Dashboards

It is here where I have worked really hard, since I have created the Dashboards from scratch selecting the best requests to the database, finishing colors, thinking which graphic and how to show it, and in addition everything is automated so that it fits with your environment without any problem and without having to edit you anything manually. You can find the Dashboards here, once imported the four, you can move between them with the top menu on the right, now it’s time to download them, or know the ID at least of them:

How to easily import the Grafana Dashboards

So that you don’t have to waste hours configuring a new Dashboard, and ingesting and debugging queries, I’ve already created four wonderful Dashboards with everything you need to monitor our environment in a very simple way, it will look like the image I showed you above.

From our Grafana, we will make Create – Import

Select the name you want and enter one by one the IDs: 8159, 8162, 8165, 8168, which are the unique IDs of the Dashboard, or the URLs:

With the menu at the top right, you can switch between the Dashboards of Hosts, Datastores, VMs and of course the main one of Overview:

Some of the improvements that this Dashboard includes are the variable selections at the top left, depending on what you select, you will be able to see only the Cluster, ESXi, or VM you are interested in. Please leave your feedback in the comments.

If you want to see them working without installing anything, here is the link to my environment:

That’s all folks, if you want to follow the full Blog series about Grafana, InfluxDB, Telegraf, please click on the next links:

Advertisements