Go back to manual command for OSD stats
Using the Ceph library was a disaster here; it had no timeout or way to force it to continue, so keepalives would become stuck and trigger fence storms. Go back to the manual osd dump command with a 2s timeout which is far more reliable and can be adequately terminated if it runs long.
This commit is contained in:
		| @@ -1149,7 +1149,8 @@ def collect_ceph_stats(queue): | ||||
|  | ||||
|         command = { "prefix": "osd dump", "format": "json" } | ||||
|         try: | ||||
|             osd_dump_raw = json.loads(ceph_conn.mon_command(json.dumps(command), b'', timeout=1)[1])['osds'] | ||||
|             retcode, stdout, stderr = common.run_os_command('ceph osd dump --format json --connect-timeout 2', timeout=2) | ||||
|             osd_dump_raw = json.loads(stdout)['osds'] | ||||
|         except Exception as e: | ||||
|             logger.out('Failed to obtain OSD data: {}'.format(e), state='w') | ||||
|             osd_dump_raw = [] | ||||
|   | ||||
		Reference in New Issue
	
	Block a user