Commit graph

31 commits

Author SHA1 Message Date
Karthik Bhat
8962f08815 Remove deprecated test methods 2026-01-12 16:15:04 +05:30
Karthik Bhat
cd7d35fa3d Fix flake TestDeviceTaintRule test by adjusting event hanlder status update logic
Co-authored-by: Pohly <patrick.ohly@intel.com>
2026-01-06 11:00:06 +05:30
Kubernetes Prow Robot
31fb6f64ef
Merge pull request #135821 from pohly/dra-device-taints-owner
DRA device taints controller: add pohly to OWNERS
2025-12-18 19:24:38 -08:00
Patrick Ohly
9194bfe75b DRA device taints controller: add pohly to OWNERS
While the code is nominally owned by SIG Scheduling, in practice I am the one
who knows it best, so I should be a reviewer and should be able to merge simple
changes without additional approvals (will use cautiously!).
2025-12-18 12:07:52 +01:00
Patrick Ohly
b2151b1f51 DRA device taints: fix and simplify unit tests
Using `t` instead of `tCtx` is subtly wrong: the failure is attributed to the
parent test, not the sub-test. Using a separate function with tCtx as
parameter ensures that t is not in scope of the code and thus this mistake
cannot happen. The number of lines is the same, it's just a bit more code.

For TestRetry another advantage is the reduced indention.

It's worth calling out that the same cannot be done for benchmarks:
- They need methods (Loop) or fields (N) which are not exposed by TContext.
- The `for b.Loop()` pattern only works if the for loop is written exactly
  like that.
2025-12-05 19:13:55 +01:00
Patrick Ohly
60744fc8b9 DRA device taint eviction: track evicting rules
This avoids having to call the rule lister (which theoretically, but not in
practice) fail and having to iterate over rules which can be ignored (might be
a small performance boost).
2025-11-05 20:03:17 +01:00
Patrick Ohly
9527987293 DRA device taint eviction: use NOP queue during simulation
It's slightly more efficient and a bit cleaner.
2025-11-05 20:03:17 +01:00
Patrick Ohly
eaee6b6bce DRA device taints: add separate feature gate for rules
Support for DeviceTaintRules depends on a significant amount of
additional code:
- ResourceSlice tracker is a NOP without it.
- Additional informers and corresponding permissions in scheduler and controller.
- Controller code for handling status.

Not all users necessarily need DeviceTaintRules, so adding a second feature
gate for that code makes it possible to limit the blast radius of bugs in that
code without having to turn off device taints and tolerations entirely.
2025-11-05 20:03:17 +01:00
Patrick Ohly
bbf8bc766e DRA device taints: DeviceTaintRule status
To update the right statuses, the controller must collect more information
about why a pod is being evicted. Updating the DeviceTaintRule statuses then is
handled by the same work queue as evicting pods.

Both operations already share the same client instance and thus QPS+server-side
throttling, so they might as well share the same work queue. Deleting pods is
not necessarily more important than informing users or vice-versa, so there is
no strong argument for having different queues.

While at it, switching the unit tests to usage of the same mock work queue as
in staging/src/k8s.io/dynamic-resource-allocation/internal/workqueue. Because
there is no time to add it properly to a staging repo, the implementation gets
copied.
2025-11-04 21:57:24 +01:00
Patrick Ohly
0689b628c7 generated files 2025-11-04 21:57:24 +01:00
Patrick Ohly
f4a453389d DRA device taint eviction: configurable number of workers
It might never be necessary to change the default, but it is hard to be sure.
It's better to have the option, just in case.
2025-11-04 21:57:24 +01:00
Patrick Ohly
c69259cb71 DRA device taints: switch to workqueue in controller
The approach copied from node taint eviction was to fire off one goroutine per
pod the intended time. This leads to the "thundering herd" problem: when a
single taint causes eviction of several pods and those all have no or the same
toleration grace period, then they all get deleted concurrently at the same
time.

For node taint eviction that is limited by the number of pods per node, which
is typically ~100. In an integration test, that already led to problems with
watchers:

   cacher.go:855] cacher (pods): 100 objects queued in incoming channel.
   cache_watcher.go:203] Forcing pods watcher close due to unresponsiveness: key: "/pods/", labels: "", fields: "". len(c.input) = 10, len(c.result) = 10, graceful = false

It also causes spikes in memory consumption (mostly the 2KB stack per goroutine
plus closure) with no upper limit.

Using a workqueue makes concurrency more deterministic because there is an
upper limit. In the integration test, 10 workers kept the watch active.

Another advantage is that failures to evict the pod get retried with
exponential backoff per affected pod forever. Previously, evicting was tried a
few times with a fixed rate and then the controller gave up. If the apiserver
was down long enough, pods didn't get evicted.
2025-10-31 18:11:19 +01:00
Patrick Ohly
e5fcd20a26 DRA device taints: tighten controller test
We know how often the controller should get a pod, let's check it.
Must run before we do our own GET call.
2025-10-31 18:11:18 +01:00
Patrick Ohly
6ebd853f17 DRA: implementation of none taint effect
While at it, ensure that future unknown effects are treating like
the None effect.
2025-10-31 18:11:18 +01:00
Patrick Ohly
e4dda7b282 DRA device taints: fix DeviceTaintRule + missing slice case
When the ResourceSlice no longer exists, the ResourceSlice tracker didn't and
couldn't report the tainted devices even if they are allocated and in use. The
controller must keep track of DeviceTaintRules itself and handle this scenario.

In this scenario it is impossible to evaluation CEL expressions because the
necessary device attributes aren't available. We could:
- Copy them in the allocation result: too large, big change.
- Limit usage of CEL expressions to rules with no eviction: inconsistent.
- Remove the fields which cannot be supported well.

The last option is chosen.

The tracker is now no longer needed by the eviction controller. Reading
directly from the informer means that we cannot assume that pointers are
consistent. We have to track ResourceSlices by their name, not their pointer.
2025-10-31 18:11:18 +01:00
Patrick Ohly
2e543d151b DRA device taints: convert unit test to synctest
The immediate benefit is that the time required for running the package's unit
test goes down from ~10 seconds (because of required real-world delays) to ~0.5
seconds (depending on the CPU performance of the host). It can also make
writing tests easier because after a `Wait` there is no need for locking before
accessing internal state (all background goroutines are known to be blocked
waiting for the main goroutine).

What somewhat ruins the perfect determinism is the polling for informer cache
syncs: that can take an unknown number of loop iterations. Probably could be
fixed by making the waiting block on channels (requires work in client-go).

The only change required in the implementation is avoiding the sleep when
deleting a pod failed for the last time in the loop (a useful, albeit minor
improvement by itself): the test proceeds after having blocked that last Delete
call, in which case synctest expects the background goroutine to exit without
delay.
2025-10-30 17:29:58 +01:00
Patrick Ohly
6f51446802 DRA device taints: fix toleration of NoExecute
As usual, consumers of an allocated claim react to the information stored in
the status. In this case, the scheduler did not copy the tolerations into the
status and as a result a pod with a toleration for NoExecute got scheduled and
then immediately evicted.

Some additional logging gets added to make the handling easier to track in the
eviction controller. Example YAMLs allow reproducing the use case manually.
2025-10-08 13:13:47 +02:00
Sunyanan Choochotkaew
7f052afaef
KEP 5075: implement scheduler
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
2025-07-30 09:52:49 +09:00
Patrick Ohly
5c4f81743c DRA: use v1 API
As before when adding v1beta2, DRA drivers built using the
k8s.io/dynamic-resource-allocation helper packages remain compatible with all
Kubernetes release >= 1.32. The helper code picks whatever API version is
enabled from v1beta1/v1beta2/v1.

However, the control plane now depends on v1, so a cluster configuration where
only v1beta1 or v1beta2 are enabled without the v1 won't work.
2025-07-24 08:33:45 +02:00
Kubernetes Prow Robot
0617903e9d
Merge pull request #131344 from pohly/dra-taint-unit-test-flake-minimal
DRA: work around fake.ClientSet informer deficiency in unit test
2025-07-03 02:51:25 -07:00
Davanum Srinivas
03afe6471b
Add a replacement for cmp.Diff using json+go-difflib
Co-authored-by: Jordan Liggitt <jordan@liggitt.net>
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2025-06-16 17:10:42 -04:00
Patrick Ohly
ff108e72a5 DRA device taints: fix rare unit test flake
TestCancelEviction flaked with a 0,01% rate because assumed that an event had
already been created once the pod was updated, but that was only true under
some timing conditions.
2025-04-17 17:16:23 +02:00
Patrick Ohly
ff2e6dddc8 DRA device taints: work around fake.ClientSet informer race
fake.Clientset suffers from a race condition related to informers:
it does not implement resource version support in its Watch
implementation and instead assumes that watches are set up
before further changes are made.

If a test waits for caches to be synced and then immediately
adds an object, that new object will never be seen by event handlers
if the race goes wrong and the Watch call hadn't completed yet
(can be triggered by adding a sleep before b53b9fb557/staging/src/k8s.io/client-go/tools/cache/reflector.go (L431)).

To work around this, we count all watches and only proceed when
all of them are in place. This replaces the normal watch reactor
(b53b9fb557/staging/src/k8s.io/client-go/kubernetes/fake/clientset_generated.go (L161-L173)).
2025-04-17 10:57:27 +02:00
Patrick Ohly
638abf0339 DRA device taints: more logging in test 2025-04-17 10:55:13 +02:00
Patrick Ohly
40f2085d68 DRA device taint: clean up test initialization
The creation of the shared informer factory and starting it can be done all in
the same function, which makes it a bit more obvious what happens in which
order and avoids some code duplication.
2025-04-17 10:55:13 +02:00
Patrick Ohly
56adcd06f3 DRA device eviction: fix eviction triggered by pod scheduling
Normally the scheduler shouldn't schedule when there is a taint, but perhaps it
didn't know yet.

The TestEviction/update test covered this, but only failed under the right
timing conditions. The new event handler test case covers it reliably.
2025-03-20 19:49:54 +01:00
Patrick Ohly
5856d3ee6f DRA taint eviction: fix waiting in unit test
Events get recorded in the apiserver asynchronously, so even if the test knows
that the event has been evicted because the pod is deleted, it still has to
also check for the event to be recorded.

This caused a flake in the "Consistently" check of events.
2025-03-20 17:59:48 +01:00
Patrick Ohly
ac6e47cb14 DRA taint eviction: improve error handling
There was one error path that led to a "controller has shut down" log
message. Other errors caused different log entries or are so unlikely (event
handler registration failure!) that they weren't checked at all.

It's clearer to let Run return an error in all cases and then log the
"controller has shut down" error at the call site. This also enables tests to
mark themselves as failed, should that ever happen.
2025-03-20 17:59:06 +01:00
Patrick Ohly
9f161590be metrics testing: add type aliases to avoid direct prometheus imports
In tests it is sometimes unavoidable to use the Prometheus types directly,
for example when writing a custom gatherer which needs to normalize data
before testing it. device_taint_eviction_test.go does this to strip
out unpredictable data in a histogram.

With type aliases in a package that is explicitly meant for tests we
can avoid adding exceptions for such tests to the global exception list.
2025-03-19 09:18:38 +01:00
Patrick Ohly
a027b439e5 DRA: add device taint eviction controller
The controller is derived from the node taint eviction controller.
In contrast to that controller it tracks the UID of pods to prevent
deleting the wrong pod when it got replaced.
2025-03-19 09:18:38 +01:00
Patrick Ohly
13d04d4a92 DRA device taints: copy taintseviction controller
This is a verbatim copy of the current pkg/controller/taintseviction code,
revision fc268ecd09 (v1.33.0 plus one commit),
minus the TimedWorker helper.

The intent is to modify the code such that it enforces eviction of pods which
use tainted devices.
2025-03-18 20:52:54 +01:00