kubernetes

mirror of https://github.com/kubernetes/kubernetes.git synced 2026-02-15 08:47:59 -05:00

Author	SHA1	Message	Date
Sascha Grunert	172a65c71d	Fix device plugin admission failure after container restart When a container restarts before kubelet restarts, containerMap has multiple entries (old exited + new running). GetContainerID() may return the exited container, causing the running check to fail. Fixed by checking if ANY container for the pod/name is running. Also filter terminal pods from podresources since they no longer consume resources, and fix test error handling to avoid exiting Eventually immediately on transient errors. Signed-off-by: Sascha Grunert <sgrunert@redhat.com>	2026-01-12 11:55:25 +01:00
Kubernetes Prow Robot	036e624317	Merge pull request #134918 from mariafromano-25/cleanup-sidecar-feature SidecarContainer feature to Node Conformance	2025-10-28 15:22:08 -07:00
Maria Romano Silva	a277269159	updating sidecar feature to node conformance	2025-10-27 23:43:43 +00:00
Swati Sehgal	1e3a6e18d0	node: e2e: update podresources check post fix of kubernetes#119423 With kubernetes/kubernetes#132028 merged, pods in terminal states are no longer reported by the podresources API. The previous test logic accounted for the old behavior where even failed pods appeared in the API response (tracked under k/k issue #119423). As a result, we used to expect the failed test pod to be present in the response but with an empty device set. This change updates the test to reflect the new, correct behavior: 1. The failed test pod should no longer appear in the podresources API response. 2. The test now asserts absence of the failed pod rather than checking for an empty device assignment. This simplifies the test logic and aligns expectations with the current upstream behavior of the podresources API. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2025-10-23 14:12:21 +01:00
Swati Sehgal	13511897bd	node: e2e: extend wait for resources exported by sample device plugin In the device plugin node reboot e2e test, the test previously waited a short period for the resources exported by the sample device plugin to appear on the local node. On slower test nodes, the plugin may take longer to register, causing flakes where the expected devices are not yet available. This change increases the polling duration to 2 minutes, ensuring the test waits long enough for the expected device capacity and allocatable resources to appear, improving test stability. This commit also updates the assertion message to be more explicit making failures clearer and improving test reliability. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2025-10-21 13:59:17 +01:00
Swati Sehgal	c2e1fdeb7a	node: e2e: Ensure device plugin pod is Running/Ready before registration In the device plugin node reboot e2e test, the registration trigger (control file deletion) was being executed immediately after pod creation. This could create a race condition: the device plugin container might not be fully running, causing the test to flake when devices were not reported as available on the node. This change explicitly waits for the sample device plugin pod to reach the Running/Ready state before deleting the registration control file. This ensures that the device plugin is ready to register its devices with the kubelet, eliminating a possible source of test flakiness. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2025-10-21 13:41:30 +01:00
Gavin Lam	908fb0266d	Fix gocritic issues Signed-off-by: Gavin Lam <gavin.oss@tutamail.com>	2025-07-24 23:23:43 -04:00
carlory	bd30b0adef	remove general avaliable feature-gate DevicePluginCDIDevices Signed-off-by: carlory <baofa.fan@daocloud.io>	2025-07-15 16:55:12 +08:00
Kubernetes Prow Robot	c029e2715e	Merge pull request #128355 from lengrongfu/feat/add-log add device-plugin-test e2e log	2025-03-20 20:46:31 -07:00
Swati Sehgal	82f0303f89	node: e2e: Remove flaky label as device plugin reboot test is deflaked With the device plugin node reboot test fixed, we can see in testgrid [node-kubelet-containerd-flaky](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-flaky) that the test is passing consitently and we can remove the flaky label. With the test not flaky anymore, we can validate new PRs against it and ensure we don't cause regressions. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2025-01-29 11:12:40 +00:00
Kubernetes Prow Robot	29bf17b6cf	Merge pull request #129168 from kannon92/drop-node-features [KEP-3041] - remove nodefeatures from k/k repo	2025-01-23 12:07:29 -08:00
Kubernetes Prow Robot	4f979c9db8	Merge pull request #129010 from ffromani/e2e-fix-device-plugin-reboot-test node: e2e: fix device plugin reboot test	2025-01-23 12:07:22 -08:00
Sotiris Salloumis	c5fc4193bb	Fix pod delete issues in podresize tests	2025-01-21 07:25:14 +01:00
Kevin Hannon	bae4122f56	deprecate nodefeature for feature labels	2025-01-20 17:02:59 -05:00
Kubernetes Prow Robot	2d0a4f7556	Merge pull request #129166 from kannon92/move-node-features-to-features [KEP-3041]: deprecate nodefeature for feature labels	2025-01-14 20:02:33 -08:00
Kevin Hannon	8495df64b2	deprecate nodefeature for feature labels	2024-12-17 13:58:12 -05:00
Kevin Hannon	6a608c3cdb	drop NodeSpecialFeature and NodeAlphaFeature from e2e-node	2024-12-16 09:29:04 -05:00
Francesco Romani	29d26297a1	e2e: node: fix misleading device plugin test We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: #128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by #103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549 compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR https://github.com/kubernetes/kubernetes/pull/120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>	2024-12-04 17:06:27 +01:00
Ed Bartosh	3aa95dafea	e2e_node: refactor stopping and restarting kubelet Moved Kubelet health checks from test cases to the stopKubelet API. This should make the API cleaner and easier to use.	2024-11-06 11:34:48 +02:00
rongfu.leng	6f97d06377	add device-plugin-test e2e log Signed-off-by: rongfu.leng <lenronfu@gmail.com>	2024-10-30 12:13:27 +00:00
Sascha Grunert	ff50da579e	Fix device plugin node ready test assertion Introduced in `d770dd695a` and high likely the issue caused in the failing test: https://github.com/kubernetes/kubernetes/issues/126915 Signed-off-by: Sascha Grunert <sgrunert@redhat.com>	2024-08-26 14:56:59 +02:00
Kubernetes Prow Robot	3f306ae140	Merge pull request #126343 from SergeyKanzhelev/succeededPodReadmitted Terminated pod should not be re-admitted	2024-08-22 16:32:09 +01:00
Sujay	223aedcf6b	enhance boolean assertions	2024-07-31 15:58:15 +00:00
Sergey Kanzhelev	300128de65	succeeded pod is being re-admitted	2024-07-25 17:45:27 +00:00
Ed Bartosh	ba7a74a0be	e2e_node: fix DevicePlugin feature flags Feature:DevicePluginProbe and NodeFeature:DevicePluginProbe are not used by any of the test-infra jobs. This commit renames NodeFeature:DevicePluginProbe to NodeFeature:DevicePlugin and removes Feature:DevicePlugin and Feature:DeviceManager to avoid having both Feature and NodeFeature tags for the same feature. NOTE: Test-infra SIG-Node jobs should focus on the NodeFeature:DevicePlugin to run generic Device Plugins tests.	2024-05-05 23:19:50 +03:00
Kevin Hannon	43e0bd4304	mark flaky jobs as flaky and move them to a different job	2024-04-08 09:27:15 -04:00
Gunju Kim	dd890b899f	Make PodResources API include restartable init containers	2024-02-21 22:00:09 +09:00
Gunju Kim	1cd1092dd9	Remove NodeAlphaFeature label from sidecar e2e tests	2023-11-06 19:50:05 +09:00
Patrick Ohly	f2cfbf44b1	e2e: use framework labels This changes the text registration so that tags for which the framework has a dedicated API (features, feature gates, slow, serial, etc.) those APIs are used. Arbitrary, custom tags are still left in place for now.	2023-11-01 15:17:34 +01:00
Kubernetes Prow Robot	a5ff0324a9	Merge pull request #120461 from gjkim42/do-not-reuse-device-of-restartable-init-container Don't reuse the device of a restartable init container	2023-10-31 19:15:53 +01:00
Ed Bartosh	69b9d50f9d	e2e_node: mark CDI test as NodeSpecialFeature This test depends on CDI support in a runtime and doesn't work with the out-of-the box Containerd. Marking it as a NodeSpecialFeature should fix Containerd CI job failures.	2023-10-27 02:06:43 +03:00
Ed Bartosh	bbb4a88bbb	e2e_node: implement DevicePluginCDIDevices test case	2023-10-24 12:35:33 +03:00
Gunju Kim	d2b803246a	Don't reuse the device allocated to the restartable init container	2023-10-17 18:28:29 +09:00
Kubernetes Prow Robot	ae9dc3330e	Merge pull request #120874 from ruquanzhao/fixDevicePluginProbeCI fix DevicePluginProbe node-e2e: pod and kubelet restarts	2023-10-14 23:50:28 +02:00
RuquanZhao	babac47c6f	fix DevicePluginProbe node-e2e: pod and kubelet restarts The kubelet restarts working pods with an exponential back-off delay, with a maximum cap of 5 minutes. The waiting 1 minutes may happen to be in back-off time. Signed-off-by: Ruquan Zhao <ruquan.zhao@arm.com>	2023-10-11 10:15:32 +08:00
carlory	d5d7fb595e	e2e_node: stop using deprecated framework.ExpectEqual	2023-10-09 16:42:42 +08:00
Kubernetes Prow Robot	900237fada	Merge pull request #118635 from ffromani/devmgr-check-pod-running kubelet: devices: skip allocation for running pods	2023-07-15 05:43:16 -07:00
Francesco Romani	d78671447f	e2e: node: add test to check device-requiring pods are cleaned up Make sure orphanded pods (pods deleted while kubelet is down) are handled correctly. Outline: 1. create a pod (not static pod) 2. stop kubelet 3. while kubelet is down, force delete the pod on API server 4. restart kubelet the pod becomes an orphaned pod and is expected to be killed by HandlePodCleanups. There is a similar test already, but here we want to check device assignment. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-07-12 13:25:36 +02:00
Francesco Romani	5cf50105a2	e2e: node: devices: improve the node reboot test The recently added e2e device plugins test to cover node reboot works fine if runs every time on CI environment (e.g CI) but doesn't handle correctly partial setup when run repeatedly on the same instance (developer setup). To accomodate both flows, we extend the error management, checking more error conditions in the flow. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-07-12 13:25:36 +02:00
Francesco Romani	b926aba268	e2e: node: devicemanager: update tests Fix e2e device manager tests. Most notably, the workload pods needs to survive a kubelet restart. Update tests to reflect that. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-07-12 13:25:36 +02:00
Stanislav Laznicka	7f532891c9	e2e tests: set all PSa labels instead of just enforcing	2023-06-21 15:05:13 +02:00
Hana (Hyang-Ah) Kim	17c17da97b	e2e_node: move getSampleDevicePluginPod to device_plugin_test.go image_list.go is one of the files included in the non-test variant Go build list, but its getSampleDevicePluginPod function references readDaemonSetV1OrDie function defined in device_plugin_test.go which is included in the test variant Go build list only. (The file name is *_test.go). As a result, "go build" fails with the undefined reference error. In practice, that may not be an issue since k8s project contributors aren't meant to run go build on this package. However, tools that depend on go build to operate - e.g., gopls or govulncheck ./... - will report this as an error. Fix this error and make test/e2e package pass go build by moving this file to also test-only source code.	2023-05-03 08:37:40 -04:00
Swati Sehgal	d727df1741	node: device-plugin: e2e: Additional test cases Additional test cases added: Keeps device plugin assignments across pod and kubelet restarts (no device plugin re-registration) Keeps device plugin assignments after the device plugin has re-registered (no kubelet or pod restart) Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-04-28 14:45:21 +01:00
Swati Sehgal	3dbb741c97	node: device-plugin: add node reboot test scenario Add a test suit to simulate node reboot (achieved by removing pods using CRI API before kubelet is restarted). Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-04-28 14:45:21 +01:00
Swati Sehgal	a26f4d855d	node: device-plugin: e2e: Capture pod admission failure This test captures that scenario where after kubelet restart, application pod comes up and the device plugin pod hasn't re-registered itself, the pod fails with admission error. It is worth noting that once the device plugin pod has registered itself, another application pod requesting devices ends up running successfully. For the test case where kubelet is restarted and device plugin has re-registered without involving pod restart, since the pod after kubelet restart ends up with admission error, we cannot be certain the device that the second pod (pod2) would get. As long as, it gets a device we consider the test to pass. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-04-28 14:44:42 +01:00
Swati Sehgal	0a58243159	node: device-plugin: e2e: Add test case for kubelet restart Capture explicitly a test case pertaining to kubelet restart but with no pod restart and device plugin re-registration. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-04-26 15:33:00 +02:00
Swati Sehgal	0910080472	node: device-plugin: e2e: Provide sleep intervals via constants Based on whether the test case requires pod restart or not, the sleep interval needs to be updated and we define constants to represent the two sleep intervals that can be used in the corresponding test cases. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>	2023-04-26 15:32:59 +02:00
Swati Sehgal	4a0f7c791f	node: device-plugin: e2e: Update test description to make it explicit Explicitly state that the test involves kubelet restart and device plugin re-registration (no pod restart) We remove the part of the code where we wait for the pod to restart as this test case should no longer involve pod restart. In addition to that, we use `waitForNodeReady` instead of `WaitForAllNodesSchedulable` for ensuring that the node is ready for pods to be scheduled on it. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>	2023-04-26 15:32:57 +02:00
Swati Sehgal	fd459beeff	node: device-plugin: e2e: Isolate test to pod restart scenario Rather than testing out for both pod restart and kubelet restart, we change the tests to just handle pod restart scenario. Clarify the test purpose and add extra check to tighten the test. We would be adding additional tests to cover kubelet restart scenarios in subsequent commits. Signed-off-by: Swati Sehgal <swsehgal@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-04-26 15:32:54 +02:00
Swati Sehgal	5ab4ba6205	node: device-plugin: e2e: Annotate device check with error message With this change the error message are more helpful and easier to troubleshoot in case of test failures. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-04-26 15:32:21 +02:00

1 2

95 commits