parallelvirtualcluster/pvc

Author	SHA1	Message	Date
Joshua M. Boniface	20b66c10e1	Move two more commands to Rados library	2021-07-10 17:28:42 -04:00
Joshua M. Boniface	cfeba50b17	Revert "Return to all command-based Ceph gathering" This reverts commit `65d14ccd92`. This was actually a bad idea. For inexplicable reasons, running these Ceph commands manually (not even via Python, but in a normal shell) takes 7 * two orders of magnitude longer than running them with the Rados module, so long in fact that some basic commands like "ceph health" would sometimes take longer than the 1 second timeout to complete. The Rados commands would however take about 1ms instead. Despite the occasional issues when monitors drop out, the Rados module is clearly far superior to the shell commands for any moderately-loaded Ceph cluster. We can look into solving timeouts another way (perhaps with Processes instead of Threads) at a later time. Rados module "ceph health": b'{"checks":{},"status":"HEALTH_OK"}' 0.001204 (s) b'{"checks":{},"status":"HEALTH_OK"}' 0.001258 (s) Command "ceph health": joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null real 0m0.772s user 0m0.707s sys 0m0.046s joshua@hv1.c.bonilan.net ~ $ time ceph health >/dev/null real 0m0.796s user 0m0.728s sys 0m0.054s	2021-07-10 03:47:45 -04:00
Joshua M. Boniface	0699c48d10	Fix bad schema path name v0.9.24	2021-07-09 16:47:09 -04:00
Joshua M. Boniface	551bae2518	Bump version to 0.9.24	2021-07-09 15:58:36 -04:00
Joshua M. Boniface	4832245d9c	Handle non-RBD disks and non-RBD errors better	2021-07-09 15:48:57 -04:00
Joshua M. Boniface	2138f2f59f	Fail VM removal on disk removal failures Prevents bad states where the VM is "removed" but some of its disks remain due to e.g. stuck watchers. Rearrange the sequence so it goes stop, delete disks, then delete VM, and then return a failure if any of the disk(s) fail to remove, allowing the task to be rerun after fixing the problem.	2021-07-09 15:39:06 -04:00
Joshua M. Boniface	d1d355a96b	Avoid errors if stats data is None	2021-07-09 13:13:54 -04:00
Joshua M. Boniface	2b5dc286ab	Correct failure to get ceph_health data	2021-07-09 13:10:28 -04:00
Joshua M. Boniface	c0c9327a7d	Return an empty log if the value is None	2021-07-09 13:08:00 -04:00
Joshua M. Boniface	5ffabcfef5	Avoid failing if we can't get the future data	2021-07-09 13:05:37 -04:00
Joshua M. Boniface	330cf14638	Remove return statements in keepalive collectors These seem to bork the keepalive timer process, so just remove them and let it continue to press on.	2021-07-09 13:04:17 -04:00
Joshua M. Boniface	9d0eb20197	Mention UUID matching in vm list help	2021-07-09 11:51:20 -04:00
Joshua M. Boniface	3f5b7045a2	Allow raw listing of cluster names in CLI	2021-07-09 10:53:20 -04:00
Joshua M. Boniface	80fe96b24d	Add some additional docstrings	2021-07-07 12:28:08 -04:00
Joshua M. Boniface	80f04ce8ee	Remove connection renewal in state handler Regenerating the ZK connection was fraught with issues, including duplicate connections, strange failures to reconnect, and various other wonkiness. Instead let Kazoo handle states sensibly. Kazoo moves to SUSPENDED state when it loses connectivity, and stays there indefinitely (based on cursory tests). And Kazoo seems to always resume from this just fine on its own. Thus all that hackery did nothing but complicate reconnection. This therefore turns the listener into a purely informational function, providing logs of when/why it failed, and we also add some additional output messages during initial connection and final disconnection.	2021-07-07 11:55:12 -04:00
Joshua M. Boniface	65d14ccd92	Return to all command-based Ceph gathering Using the Rados module was very problematic, specifically because it had no sensible timeout parameters and thus would hang for many seconds. This has poor implications since it blocks further keepalives. Instead, remove the Rados usage entirely and go back completely to using manual OS commands to gather this information. While this may cause PID exhaustion more quickly it's worthwhile to avoid failure scenarios when Ceph stats time out. Closes #137	2021-07-06 11:30:45 -04:00
Joshua M. Boniface	adc022f55d	Add missing install of pvcapid-worker.sh v0.9.23	2021-07-06 09:40:42 -04:00
Joshua M. Boniface	7082982a33	Bump version to 0.9.23	2021-07-05 23:40:32 -04:00
Joshua M. Boniface	5b6ef71909	Ensure daemon mode is updated on startup Fixes the side effect of the previous bug during deploys of 0.9.22.	2021-07-05 23:39:23 -04:00
Joshua M. Boniface	a8c28786dd	Better handle empty ipaths in schema When trying to write to sub-item paths that don't yet exist, the previous method would just blindly write to whatever the root key is, which is never what we actually want. Instead, check explicitly for a "base path" situation, and handle that. Then, if we try to get a subpath that isn't valid, return None. Finally in the various functions, if the path is None, just continue (or return false/None) and (try to) chug along.	2021-07-05 23:35:03 -04:00
Joshua M. Boniface	be7b0be8ed	Fix typo in schema path name	2021-07-05 23:23:23 -04:00
Joshua M. Boniface	c45804e8c1	Revert "Return none if a schema path is not found" This reverts commit `b1fcf6a4a5`.	2021-07-05 23:16:39 -04:00
Joshua M. Boniface	b1fcf6a4a5	Return none if a schema path is not found This can cause overwriting of unintended keys, so should not be happening. Will have to find the bugs this causes.	2021-07-05 17:15:55 -04:00
Joshua M. Boniface	47f39a1a2a	Fix ordering issue in test-cluster script v0.9.22	2021-07-05 15:14:34 -04:00
Joshua M. Boniface	54f82a3ea0	Fix bug in VM network list with SR-IOV	2021-07-05 15:14:01 -04:00
Joshua M. Boniface	37cd278bc2	Bump version to 0.9.22	2021-07-05 14:18:51 -04:00
Joshua M. Boniface	47a522f8af	Use manual zkhandler creation in Benchmark job Like the other Celery job this does not work properly with the ZKConnection decorator due to conflicting "self", so just connect manually exactly like the provisioner task does.	2021-07-05 14:12:56 -04:00
Joshua M. Boniface	087c23859c	Adjust layout of Provisioner lists output Use the same header format as the others.	2021-07-05 14:06:22 -04:00
Joshua M. Boniface	6c21a52714	Adjust layout of Ceph/storage lists output Use the same header format as node, VM, and network lists.	2021-07-05 12:57:18 -04:00
Joshua M. Boniface	afde436cd0	Adjust layout of Network lists output Use the same header format as node and VM lists.	2021-07-05 11:48:39 -04:00
Joshua M. Boniface	1fe71969ca	Adjust layout of VM list output Matches the new node list output format with the additional header line, as well as revamps some other aspects: 1. Adjusts the UUID to be under the name in the info output. 2. Removes the UUID from the list output to save space, because this is generally not needed in day-to-day quick-list output. 3. Renames the "Node" header to "Current" to better reflect what that column actually means and avoid conflicting with the parent header.	2021-07-05 10:52:48 -04:00
Joshua M. Boniface	2b04df22a6	Add PVC version to node information output Also adjusts the layout of the node list output to avoid excessively long lines. Adds another header line with categories and spacing dashes for easier visual parsing.	2021-07-05 10:45:20 -04:00
Joshua M. Boniface	a69105569f	Add node PVC version data to Node information Allows API client to see the currently-active version of the node daemon.	2021-07-05 09:57:38 -04:00
Joshua M. Boniface	21a1a7da9e	Fix bad schema reference Not sure how this didn't cause an issue until now, but the wrong key path was used and this was getting unexpected data with the newly-added version string instead of the proper mode string.	2021-07-05 09:53:51 -04:00
Joshua M. Boniface	e44f3d623e	Remove unnecessary try/except blocks from VM reads The zkhandler read() function takes care of ensuring there is a None value returned if these fail, so these aren't required. Makes the code a fair bit more readable here.	2021-07-02 12:01:58 -04:00
Joshua M. Boniface	f0fd3d3f0e	Make extra sure VMs terminate when told When doing a stop_vm or terminate_vm, check again after 0.2 seconds and try re-terminating if it's still running. Covers cases where a VM doesn't stop if given the 'stop' state.	2021-07-02 11:40:34 -04:00
Joshua M. Boniface	f12de6727d	Adjust logo slightly and add debug state	2021-07-02 02:32:08 -04:00
Joshua M. Boniface	e94f5354e6	Update startup messages with new ASCII logo	2021-07-02 02:21:30 -04:00
Joshua M. Boniface	c51023ba81	Add profiler to keepalive function	2021-07-02 01:55:15 -04:00
Joshua M. Boniface	61465ef38f	Add profiler to several other functions in API	2021-07-02 01:53:19 -04:00
Joshua M. Boniface	64c6b04352	Ensure all edited files are restored	2021-07-02 01:50:25 -04:00
Joshua M. Boniface	20542c3653	Add profiler to cluster status function	2021-07-01 17:35:29 -04:00
Joshua M. Boniface	00b503824e	Set unstable version in API and CLI too	2021-07-01 17:35:11 -04:00
Joshua M. Boniface	43009486ae	Move Ceph pool/volume list assembly to thread pool Same reasons as the VM list, though less impactful.	2021-07-01 17:33:13 -04:00
Joshua M. Boniface	58789f1db4	Move VM list assembly to thread pool This helps parallelize the numerous Zookeeper calls a little bit, at least within the bounds of the GIL, to improve performance when getting a large list of VMs. The max_workers value is capped at 32 to avoid causing too many threads during concurrent executions, but still provides a noticeable speedup (on the order of 0.2-0.4 seconds with 75 VMs, scaling up further as counts grow).	2021-07-01 17:32:47 -04:00
Joshua M. Boniface	baf4c3fbc7	Add performance profiler function Usable anywhere that the global daemon "config" parameter can be passed in (e.g. pvcapid/helper.py, pvcnoded/Daemon.py, etc.). Stores results in a subdirectory of the PVC logdir called "profiler" if this directory can be created, or prints results. The debug config parameter ensures that the profiler can be added to functions and not run unless the server is explicitly in debug mode. Might not be useful as I don't initially plan to add this to every function (only when investigating performance problems), but this flexibility allows that to change later.	2021-07-01 14:01:33 -04:00
Joshua M. Boniface	e093efceb1	Add NoNodeError handlers in ZK locks Instead of looping 5+ times acquiring an impossible lock on a nonexistent key, just fail on a different error and return failure immediately. This is likely a major corner case that shouldn't happen, but better to be safe than 500.	2021-07-01 01:17:38 -04:00
Joshua M. Boniface	a080598781	Avoid superfluous ZK exists calls These cause a major (2x) slowdown in read calls since Zookeeper connections are expensive/slow. Instead, just try the thing and return None if there's no key there. Also wrap the children command in similar error handling since that did not exist and could likely cause some bugs at some point.	2021-07-01 01:15:51 -04:00
Joshua M. Boniface	39e82ee426	Cast base schema version to int Or all our comparisons will fail later and nodes can't start. v0.9.21	2021-06-30 09:40:33 -04:00
Joshua M. Boniface	fe0a1d582a	Bump version to 0.9.21	2021-06-29 19:21:31 -04:00

... 3 4 5 6 7 ...

2503 Commits