Go back to manual command for OSD stats

Using the Ceph library was a disaster here; it had no timeout or way to
force it to continue, so keepalives would become stuck and trigger fence
storms. Go back to the manual osd dump command with a 2s timeout which
is far more reliable and can be adequately terminated if it runs long.
This commit is contained in:
Joshua Boniface 2020-08-12 22:16:56 -04:00
parent 42f2dedf6d
commit 0587bcbd67
1 changed files with 2 additions and 1 deletions

View File

@ -1149,7 +1149,8 @@ def collect_ceph_stats(queue):
command = { "prefix": "osd dump", "format": "json" }
try:
osd_dump_raw = json.loads(ceph_conn.mon_command(json.dumps(command), b'', timeout=1)[1])['osds']
retcode, stdout, stderr = common.run_os_command('ceph osd dump --format json --connect-timeout 2', timeout=2)
osd_dump_raw = json.loads(stdout)['osds']
except Exception as e:
logger.out('Failed to obtain OSD data: {}'.format(e), state='w')
osd_dump_raw = []