Commit graph

95 commits

Author SHA1 Message Date
Sascha Grunert
172a65c71d
Fix device plugin admission failure after container restart
When a container restarts before kubelet restarts, containerMap has
multiple entries (old exited + new running). GetContainerID() may
return the exited container, causing the running check to fail. Fixed
by checking if ANY container for the pod/name is running.

Also filter terminal pods from podresources since they no longer
consume resources, and fix test error handling to avoid exiting
Eventually immediately on transient errors.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2026-01-12 11:55:25 +01:00
Kubernetes Prow Robot
036e624317
Merge pull request #134918 from mariafromano-25/cleanup-sidecar-feature
SidecarContainer feature to Node Conformance
2025-10-28 15:22:08 -07:00
Maria Romano Silva
a277269159 updating sidecar feature to node conformance 2025-10-27 23:43:43 +00:00
Swati Sehgal
1e3a6e18d0 node: e2e: update podresources check post fix of kubernetes#119423
With kubernetes/kubernetes#132028 merged, pods in terminal states are no longer
reported by the podresources API. The previous test logic accounted for the old
behavior where even failed pods appeared in the API response (tracked under k/k
issue #119423). As a result, we used to expect the failed test pod to be present in the
response but with an empty device set.

This change updates the test to reflect the new, correct behavior:
1. The failed test pod should no longer appear in the podresources API response.
2. The test now asserts absence of the failed pod rather than checking for an empty device assignment.

This simplifies the test logic and aligns expectations with the current upstream
behavior of the podresources API.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2025-10-23 14:12:21 +01:00
Swati Sehgal
13511897bd node: e2e: extend wait for resources exported by sample device plugin
In the device plugin node reboot e2e test, the test previously waited a
short period for the resources exported by the sample device plugin to
appear on the local node. On slower test nodes, the plugin may take
longer to register, causing flakes where the expected devices are not
yet available.

This change increases the polling duration to 2 minutes, ensuring the test
waits long enough for the expected device capacity and allocatable resources
to appear, improving test stability.

This commit also updates the assertion message to be more explicit making
failures clearer and improving test reliability.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2025-10-21 13:59:17 +01:00
Swati Sehgal
c2e1fdeb7a node: e2e: Ensure device plugin pod is Running/Ready before registration
In the device plugin node reboot e2e test, the registration trigger
(control file deletion) was being executed immediately after pod creation.
This could create a race condition: the device plugin container might not
be fully running, causing the test to flake when devices were not reported
as available on the node.

This change explicitly waits for the sample device plugin pod to reach the
Running/Ready state before deleting the registration control file. This
ensures that the device plugin is ready to register its devices with the
kubelet, eliminating a possible source of test flakiness.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2025-10-21 13:41:30 +01:00
Gavin Lam
908fb0266d
Fix gocritic issues
Signed-off-by: Gavin Lam <gavin.oss@tutamail.com>
2025-07-24 23:23:43 -04:00
carlory
bd30b0adef remove general avaliable feature-gate DevicePluginCDIDevices
Signed-off-by: carlory <baofa.fan@daocloud.io>
2025-07-15 16:55:12 +08:00
Kubernetes Prow Robot
c029e2715e
Merge pull request #128355 from lengrongfu/feat/add-log
add device-plugin-test e2e log
2025-03-20 20:46:31 -07:00
Swati Sehgal
82f0303f89 node: e2e: Remove flaky label as device plugin reboot test is deflaked
With the device plugin node reboot test fixed, we can see in testgrid
[node-kubelet-containerd-flaky](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-flaky)
that the test is passing consitently and we can remove the flaky label.

With the test not flaky anymore, we can validate new PRs against it
and ensure we don't cause regressions.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2025-01-29 11:12:40 +00:00
Kubernetes Prow Robot
29bf17b6cf
Merge pull request #129168 from kannon92/drop-node-features
[KEP-3041] - remove nodefeatures from k/k repo
2025-01-23 12:07:29 -08:00
Kubernetes Prow Robot
4f979c9db8
Merge pull request #129010 from ffromani/e2e-fix-device-plugin-reboot-test
node: e2e: fix device plugin reboot test
2025-01-23 12:07:22 -08:00
Sotiris Salloumis
c5fc4193bb Fix pod delete issues in podresize tests 2025-01-21 07:25:14 +01:00
Kevin Hannon
bae4122f56 deprecate nodefeature for feature labels 2025-01-20 17:02:59 -05:00
Kubernetes Prow Robot
2d0a4f7556
Merge pull request #129166 from kannon92/move-node-features-to-features
[KEP-3041]: deprecate nodefeature for feature labels
2025-01-14 20:02:33 -08:00
Kevin Hannon
8495df64b2 deprecate nodefeature for feature labels 2024-12-17 13:58:12 -05:00
Kevin Hannon
6a608c3cdb drop NodeSpecialFeature and NodeAlphaFeature from e2e-node 2024-12-16 09:29:04 -05:00
Francesco Romani
29d26297a1 e2e: node: fix misleading device plugin test
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: #128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by #103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR https://github.com/kubernetes/kubernetes/pull/120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-12-04 17:06:27 +01:00
Ed Bartosh
3aa95dafea e2e_node: refactor stopping and restarting kubelet
Moved Kubelet health checks from test cases to the stopKubelet API.
This should make the API cleaner and easier to use.
2024-11-06 11:34:48 +02:00
rongfu.leng
6f97d06377 add device-plugin-test e2e log
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
2024-10-30 12:13:27 +00:00
Sascha Grunert
ff50da579e
Fix device plugin node ready test assertion
Introduced in d770dd695a and high likely
the issue caused in the failing test:
https://github.com/kubernetes/kubernetes/issues/126915

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2024-08-26 14:56:59 +02:00
Kubernetes Prow Robot
3f306ae140
Merge pull request #126343 from SergeyKanzhelev/succeededPodReadmitted
Terminated pod should not be re-admitted
2024-08-22 16:32:09 +01:00
Sujay
223aedcf6b enhance boolean assertions 2024-07-31 15:58:15 +00:00
Sergey Kanzhelev
300128de65 succeeded pod is being re-admitted 2024-07-25 17:45:27 +00:00
Ed Bartosh
ba7a74a0be e2e_node: fix DevicePlugin feature flags
Feature:DevicePluginProbe and NodeFeature:DevicePluginProbe
are not used by any of the test-infra jobs.

This commit renames NodeFeature:DevicePluginProbe to NodeFeature:DevicePlugin
and removes Feature:DevicePlugin and Feature:DeviceManager to avoid
having both Feature and NodeFeature tags for the same feature.

NOTE: Test-infra SIG-Node jobs should focus on the
NodeFeature:DevicePlugin to run generic Device Plugins tests.
2024-05-05 23:19:50 +03:00
Kevin Hannon
43e0bd4304 mark flaky jobs as flaky and move them to a different job 2024-04-08 09:27:15 -04:00
Gunju Kim
dd890b899f
Make PodResources API include restartable init containers 2024-02-21 22:00:09 +09:00
Gunju Kim
1cd1092dd9
Remove NodeAlphaFeature label from sidecar e2e tests 2023-11-06 19:50:05 +09:00
Patrick Ohly
f2cfbf44b1 e2e: use framework labels
This changes the text registration so that tags for which the framework has a
dedicated API (features, feature gates, slow, serial, etc.) those APIs are
used.

Arbitrary, custom tags are still left in place for now.
2023-11-01 15:17:34 +01:00
Kubernetes Prow Robot
a5ff0324a9
Merge pull request #120461 from gjkim42/do-not-reuse-device-of-restartable-init-container
Don't reuse the device of a restartable init container
2023-10-31 19:15:53 +01:00
Ed Bartosh
69b9d50f9d e2e_node: mark CDI test as NodeSpecialFeature
This test depends on CDI support in a runtime and doesn't work
with the out-of-the box Containerd. Marking it as a NodeSpecialFeature
should fix Containerd CI job failures.
2023-10-27 02:06:43 +03:00
Ed Bartosh
bbb4a88bbb e2e_node: implement DevicePluginCDIDevices test case 2023-10-24 12:35:33 +03:00
Gunju Kim
d2b803246a
Don't reuse the device allocated to the restartable init container 2023-10-17 18:28:29 +09:00
Kubernetes Prow Robot
ae9dc3330e
Merge pull request #120874 from ruquanzhao/fixDevicePluginProbeCI
fix DevicePluginProbe node-e2e: pod and kubelet restarts
2023-10-14 23:50:28 +02:00
RuquanZhao
babac47c6f fix DevicePluginProbe node-e2e: pod and kubelet restarts
The kubelet restarts working pods with an exponential back-off delay,
with a maximum cap of 5 minutes. The waiting 1 minutes may happen to be
in back-off time.

Signed-off-by: Ruquan Zhao <ruquan.zhao@arm.com>
2023-10-11 10:15:32 +08:00
carlory
d5d7fb595e e2e_node: stop using deprecated framework.ExpectEqual 2023-10-09 16:42:42 +08:00
Kubernetes Prow Robot
900237fada
Merge pull request #118635 from ffromani/devmgr-check-pod-running
kubelet: devices: skip allocation for running pods
2023-07-15 05:43:16 -07:00
Francesco Romani
d78671447f e2e: node: add test to check device-requiring pods are cleaned up
Make sure orphanded pods (pods deleted while kubelet is down) are
handled correctly.
Outline:
1. create a pod (not static pod)
2. stop kubelet
3. while kubelet is down, force delete the pod on API server
4. restart kubelet
the pod becomes an orphaned pod and is expected to be killed by HandlePodCleanups.

There is a similar test already, but here we want to check device
assignment.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-07-12 13:25:36 +02:00
Francesco Romani
5cf50105a2 e2e: node: devices: improve the node reboot test
The recently added e2e device plugins test to cover node reboot
works fine if runs every time on CI environment (e.g CI) but
doesn't handle correctly partial setup when run repeatedly on
the same instance (developer setup).

To accomodate both flows, we extend the error management, checking
more error conditions in the flow.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-07-12 13:25:36 +02:00
Francesco Romani
b926aba268 e2e: node: devicemanager: update tests
Fix e2e device manager tests.
Most notably, the workload pods needs to survive a kubelet
restart. Update tests to reflect that.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-07-12 13:25:36 +02:00
Stanislav Laznicka
7f532891c9
e2e tests: set all PSa labels instead of just enforcing 2023-06-21 15:05:13 +02:00
Hana (Hyang-Ah) Kim
17c17da97b e2e_node: move getSampleDevicePluginPod to device_plugin_test.go
image_list.go is one of the files included in the non-test variant Go build list, but its getSampleDevicePluginPod function references readDaemonSetV1OrDie function defined in device_plugin_test.go which is included in the test variant Go build list only. (The file name is *_test.go).

As a result, "go build" fails with the undefined reference error.

In practice, that may not be an issue since k8s project contributors aren't meant to run go build on this package. However, tools that depend on go build to operate - e.g., gopls or govulncheck ./... - will report this as an error.

Fix this error and make test/e2e package pass go build by moving this file to also test-only source code.
2023-05-03 08:37:40 -04:00
Swati Sehgal
d727df1741 node: device-plugin: e2e: Additional test cases
Additional test cases added:
Keeps device plugin assignments across pod and kubelet restarts (no device plugin re-registration)
Keeps device plugin assignments after the device plugin has re-registered (no kubelet or pod restart)

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-04-28 14:45:21 +01:00
Swati Sehgal
3dbb741c97 node: device-plugin: add node reboot test scenario
Add a test suit to simulate node reboot (achieved by removing pods
using CRI API before kubelet is restarted).

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-04-28 14:45:21 +01:00
Swati Sehgal
a26f4d855d node: device-plugin: e2e: Capture pod admission failure
This test captures that scenario where after kubelet restart,
application pod comes up and the device plugin pod hasn't re-registered
itself, the pod fails with admission error. It is worth noting that
once the device plugin pod has registered itself, another
application pod requesting devices ends up running
successfully.

For the test case where kubelet is restarted and device plugin
has re-registered without involving pod restart, since the
pod after kubelet restart ends up with admission error,
we cannot be certain the device that the second pod (pod2) would
get. As long as, it gets a device we consider the test to pass.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-04-28 14:44:42 +01:00
Swati Sehgal
0a58243159 node: device-plugin: e2e: Add test case for kubelet restart
Capture explicitly a test case pertaining to kubelet restart
but with no pod restart and device plugin re-registration.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-04-26 15:33:00 +02:00
Swati Sehgal
0910080472 node: device-plugin: e2e: Provide sleep intervals via constants
Based on whether the test case requires pod restart or not, the sleep
interval needs to be updated and we define constants to represent the two
sleep intervals that can be used in the corresponding test cases.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
2023-04-26 15:32:59 +02:00
Swati Sehgal
4a0f7c791f node: device-plugin: e2e: Update test description to make it explicit
Explicitly state that the test involves kubelet restart and device plugin
re-registration (no pod restart)

We remove the part of the code where we wait for the pod to restart as this
test case should no longer involve pod restart.

In addition to that, we use `waitForNodeReady` instead of `WaitForAllNodesSchedulable`
for ensuring that the node is ready for pods to be scheduled on it.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Co-authored-by: Francesco Romani <fromani@redhat.com>
2023-04-26 15:32:57 +02:00
Swati Sehgal
fd459beeff node: device-plugin: e2e: Isolate test to pod restart scenario
Rather than testing out for both pod restart and kubelet restart,
we change the tests to just handle pod restart scenario.

Clarify the test purpose and add extra check to tighten the test.

We would be adding additional tests to cover kubelet restart scenarios
in subsequent commits.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-04-26 15:32:54 +02:00
Swati Sehgal
5ab4ba6205 node: device-plugin: e2e: Annotate device check with error message
With this change the error message are more helpful and easier
to troubleshoot in case of test failures.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-04-26 15:32:21 +02:00