telegraf: telegraf --test shows Ceph output, but actual Telegraf run returns failed to find sockets at path '/var/run/ceph': Failed to read socket directory '/var/run/ceph': open /var/run/ceph: permission denied

Relevant telegraf.conf:

# # Collects performance metrics from the MON and OSD nodes in a Ceph storage cluster.
 [[inputs.ceph]]
#   ## This is the recommended interval to poll.  Too frequent and you will lose
#   ## data points due to timeouts during rebalancing and recovery
   interval = '1m'
#
#   ## All configuration values are optional, defaults are shown below
#
#   ## location of ceph binary
#   ceph_binary = "/usr/bin/ceph"
#
#   ## directory in which to look for socket files
#   socket_dir = "/var/run/ceph"
#
#   ## prefix of MON and OSD socket files, used to determine socket type
#   mon_prefix = "ceph-mon"
#   osd_prefix = "ceph-osd"
#
#   ## suffix used to identify socket files
#   socket_suffix = "asok"
#
#   ## Ceph user to authenticate as
#   ceph_user = "client.admin"
#
#   ## Ceph configuration to use to locate the cluster
#   ceph_config = "/etc/ceph/ceph.conf"
#
#   ## Whether to gather statistics via the admin socket
#   gather_admin_socket_stats = true
#
#   ## Whether to gather statistics via ceph commands
   gather_cluster_stats = true

System info:

Telegraf 1.10.0 (git: HEAD fe33ee8) Proxmox 5.3 (based on Debian Stretch) sysstat version 11.4.3

Contents of /var/run/ceph - public seems to have both read and execute bits on the relevant socket files. (However, I noticed the directly itself isn’t open to public?).

root@syd1:/var/run/ceph# ls -lah
total 0
drwxrwx---  2 ceph ceph  160 Mar 14 07:48 .
drwxr-xr-x 27 root root 1.3K Mar 16 05:30 ..
srwxr-xr-x  1 ceph ceph    0 Mar 14 07:45 ceph-mgr.syd1.asok
srwxr-xr-x  1 ceph ceph    0 Mar 14 07:45 ceph-mon.syd1.asok
srwxr-xr-x  1 ceph ceph    0 Mar 14 07:48 ceph-osd.0.asok
srwxr-xr-x  1 ceph ceph    0 Mar 14 07:48 ceph-osd.1.asok
srwxr-xr-x  1 ceph ceph    0 Mar 14 07:48 ceph-osd.2.asok
srwxr-xr-x  1 ceph ceph    0 Mar 14 07:48 ceph-osd.3.asok

Steps to reproduce:

  1. Install Telegraf from InfluxData repositories.
  2. Edit /etc/telegraf/telegraf.conf and enable ceph input plugin, and various options (per above).
  3. Attempt to restart telegraf using systemctl restart telegraf
  4. Run telegraf --test to check syntax.
  5. Run journalctl -u telegraf to check status of telegraf.

Expected behavior:

Ceph data that is shown in telegraf --test should also be populated into InfluxDB output.

No permission errors should be seen in Telegraf logs around Ceph.

Actual behavior:

telegraf --test does show ceph data in output.

However, no ceph data is populated into InfluxDB.

journalctl shows error messages around Ceph permissions:

Mar 16 02:32:00 syd1 telegraf[577778]: 2019-03-15T15:32:00Z E! [inputs.ceph]: Error in plugin: failed to find sockets at path '/var/run/ceph': Failed to read socket directory '/var/run/ceph': open /var/run/ceph: permission denied

Additional info:

I did find this earlier issue around Ceph permissions and Telegraf:

https://github.com/influxdata/telegraf/issues/1657

which mentions this Ceph PR to add a config option for socket permissions:

https://github.com/ceph/ceph/pull/11684

However, as per above, it seems like my actual socket files have public read/execute already set?

Also it’s odd that telegraf --test returns Ceph output. (Not sure if it’s something related to Proxmox, which doesn’t include sudo by default). What user does telegraf --test run under?

(Interestingly - the above Ceph admin socket option is apparently not that well documented - see this Medium article.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 21 (9 by maintainers)

Most upvoted comments

In the comment above I tried it with copying the keyring and ceph.conf to /etc/telegraf/ directory, setting permissions and all the stuff… Didn’t work.

To summarize my final setup…

Telegraf is in the ceph group:

root@ceph1/# groups telegraf 
telegraf : telegraf ceph

The ownership of telegraf keyring is set to telegraf user:

root@ceph1/# ls -las /etc/ceph/
total 48
 4 drwxr-xr-x   2 ceph     ceph      4096 Mai 24 11:49 .
12 drwxr-xr-x 114 root     root     12288 Mai 23 14:24 ..
 4 -rw-------   1 ceph     ceph       151 Mär 20 15:39 ceph.client.admin.keyring
 4 -rw-r--r--   1 root     root        64 Mai 15 12:27 ceph.client.nagios.keyring
 4 -rw-r-----   1 telegraf telegraf   132 Mai 24 11:48 ceph.client.telegraf.keyring
 4 -rw-r--r--   1 ceph     ceph      1096 Mai 23 17:42 ceph.conf

The ceph.conf is holding the client.telegraf and pointing to the keyring:

root@ceph1/etc# cat /etc/ceph/ceph.conf |grep telegraf
[client.telegraf]
keyring = /etc/ceph/ceph.client.telegraf.keyring

The telegraf.conf is also holding the client.telegraf user and the ceph config file:

[[inputs.ceph]]
  interval = '30s'
  ceph_binary = "/usr/bin/ceph"
  socket_dir = "/var/run/ceph"
  mon_prefix = "ceph-mon"
  osd_prefix = "ceph-osd"
  socket_suffix = "asok"
  ceph_user = "client.telegraf"
  ceph_config = "/etc/ceph/ceph.conf"
  gather_admin_socket_stats = true
  gather_cluster_stats = true

Results:

root@ceph1/# sudo -u telegraf ceph df
2019-05-24 12:37:00.117 7f18f616b700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory

root@ceph1/# tail /var/log/syslog
2019-05-24T12:37:00.875204+02:00 ceph1 telegraf[672232]: 2019-05-24T10:37:00Z E! [inputs.ceph]: Error in plugin: error executing command: error running ceph df: exit status 13
root@ceph1/# sudo -u telegraf ceph --name client.telegraf df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    10 TiB     10 TiB       13 GiB          0.12

I also added --name client.telegraf to the commands in ceph.go plugin and built it on my machine. Results same as above without the change, even if telegraf is running the check with --name client.telegraf argument. Still ignoring the client.telegraf user in ceph.conf and trying to find the keyring…

root@ceph1/# tail /var/log/syslog
2019-05-24T12:43:30.875135+02:00 ceph1 telegraf[672232]: 2019-05-24T10:43:30Z E! [inputs.ceph]: Error in plugin: error executing command: error running ceph --name client.telegraf df: exit status 13

Any suggestions?