Test_isSchedulableAfterClaimChange was sensitive to system load because of the
arbitrary delay when waiting for the assume cache to catch up. Running inside
a synctest bubble avoids this. While at it, the unit tests get converted
to ktesting (nicer failure output, no extra indention needed for
tCtx.SyncTest).
TestPlugin/prebind-fail-with-binding-timeout relied on setting up a claim with
certain time stamps and then getting that test case tested within a certain
real-world time window. It's surprising that this didn't flake more often
because test execution order is random. Now the time stamp gets set right
before the test case is about to be tested. Conversion to a synctest would
be nicer, but synctests cannot have sub-tests, which are used here to track
where log output and failures come from within the larger test case.
Inside the plugin itself some log output gets added to explain why a claim is
unavailable on a node in case of a binding timeout or error during Filter.
For debugging double allocation of the same
device (https://github.com/kubernetes/kubernetes/issues/133602) it is necessary
to have information about pools, devices and in-flight claims. Log calls get
extended and the config for DRA CI jobs updated to enable higher verbosity for
relevant source files.
Log output in such a cluster at verbosity 6 looks like this:
I1215 10:28:54.166872 1 allocator_incubating.go:130] "Gathered pool information" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" pools={"count":1,"devices":["dra-8841.k8s.io/kind-worker2/device-00"],"meta":[{"InvalidReason":"","id":"dra-8841.k8s.io/kind-worker2","isIncomplete":false,"isInvalid":false}]}
I1215 10:28:54.166941 1 allocator_incubating.go:254] "Gathered information about devices" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" allocatedDevices={"count":2,"devices":["dra-8841.k8s.io/kind-worker/device-00","dra-8841.k8s.io/kind-worker3/device-00"]} minDevicesToBeAllocated=1
* First version of batching w/out signatures.
* First version of pod signatures.
* Integrate batching with signatures.
* Fix merge conflicts.
* Fixes from self-review.
* Test fixes.
* Fix a bug that limited batches to size 2
Also add some new high-level logging and
simplify the pod affinity signature.
* Re-enable batching on perf tests for now.
* fwk.NewStatus(fwk.Success)
* Review feedback.
* Review feedback.
* Comment fix.
* Two plugin specific unit tests.:
* Add cycle state to the sign call, apply to topo spread.
Also add unit tests for several plugi signature
calls.
* Review feedback.
* Switch to distinct stats for hint and store calls.
* Switch signature from string to []byte
* Revert cyclestate in signs. Update node affinity.
Node affinity now sorts all of the various
nested arrays in the structure. CycleState no
longer in signature; revert to signing fewer
cases for pod spread.
* hack/update-vendor.sh
* Disable signatures when extenders are configured.
* Update pkg/scheduler/framework/runtime/batch.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update staging/src/k8s.io/kube-scheduler/framework/interface.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Review feedback.
* Disable node resource signatures when extended DRA enabled.
* Review feedback.
* Update pkg/scheduler/framework/plugins/imagelocality/image_locality.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/interface.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/plugins/nodedeclaredfeatures/nodedeclaredfeatures.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/runtime/batch.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Review feedback.
* Fixes for review suggestions.
* Add integration tests.
* Linter fixes, test fix.
* Whitespace fix.
* Remove broken test.
* Unschedulable test.
* Remove go.mod changes.
---------
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
The test started without waiting for the ResourceSlice informer to have
synced. As a result, the "CEL-runtime-error-for-one-of-three-nodes" test case
failed randomly with a very low flake rate (less than 1% in local runs) because
CEL expressions never got evaluated due to not having the slices (yet).
Other tests also were less reliable, but not known to fail.
Support for DeviceTaintRules depends on a significant amount of
additional code:
- ResourceSlice tracker is a NOP without it.
- Additional informers and corresponding permissions in scheduler and controller.
- Controller code for handling status.
Not all users necessarily need DeviceTaintRules, so adding a second feature
gate for that code makes it possible to limit the blast radius of bugs in that
code without having to turn off device taints and tolerations entirely.
Add a new `bindingTimeout` field to DynamicResources plugin args and wire it
into PreBind.
Changes:
- API: add `bindingTimeout` to DynamicResourcesArgs (staging + internal types).
- Defaults: default to 600 seconds when BOTH DRADeviceBindingConditions and
DRAResourceClaimDeviceStatus are enabled.
- Validation: require >= 1s; forbid when either feature gate is disabled.
- Plugin: plumbs args into `pl.bindingTimeout` and uses it in
`wait.PollUntilContextTimeout` for binding-condition wait logic.
- Plugin: remove legacy `BindingTimeoutDefaultSeconds`.
Tests:
- Add/adjust unit tests for validation and PreBind timeout path.
- Ensure <1s and negative values are rejected; forbids when gates disabled.
The cache and scheduler event handlers cannot be registered separately in the
informer, that leads to a race (scheduler might schedule based on event before
cache is updated). Chaining event handlers (cache first, then scheduler) avoids
this.
This also ensures that the cache is up-to-date before the scheduler
starts (HasSynced of the handler registration for the cache is checked).
Other changes:
- renamed package to avoid clash with other "cache" packages
- clarified nil handling
- feature gate check before instantiating the cache
- per-test logging
- utilruntime.HandleErrorWithLogger
- simpler cache.DeletedFinalStateUnknown
Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
class mapping
- Add a new interface "DeviceClassResolver" in the scheduler framework
- Add a global cache of mapping between the extended resource and the
device class
- Cache can be leveraged by the k8s api-server, controller-manager along with the scheduler
- This change helps in delegating the requests to the dynamicresource
plugin based on the mapping during the node update events and thus
avoiding an extra scheduling cycle
Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
Previously, the scheduler assumed an extended resource was maintained
by a device plugin if its name was present in the node's Allocatable
map, even if its value was zero. This blocked scheduling when a device
plugin was disconnected or uninstalled, because Kubelet still reported
the resource with Allocatable=0.
This change adds a check for the actual allocatable value in addition
to a key presence check, allowing nodes with uninstalled device
plugins to be considered for scheduling.
This prevents the mistake from 1.34 where the default-on
DRAResourceClaimDeviceStatus feature caused the use of the experimental
allocator implementation. The test fails without a fix for that.
In 1.34, the default feature gate selection picked the "experimental" allocator
implementation when it should have used the "incubating" allocator. No harm
came from that because the experimental allocator has all the necessary if
checks to disable the extra code and no bugs were introduced when implementing
it, but it means that our safety net wasn't there when we expected it to be.
The reason is that the "DRAResourceClaimDeviceStatus" feature gate is on by
default and was only listed as supported by the experimental implementation.
This could be fixed by listing it as supported also by the other
implementation, but that would be a bit odd because there is nothing to support
for it (the reason why this was missed in 1.34!). Instead, the allocator
features are now only indirectly related to feature gates, with a single
boolean controlling the implementation of binding conditions.
Copying from feature.Features to new fields in the plugin got a bit silly with
the long list of features that we have now. Embedding feature.Features is
simpler.
Two fields in feature.Features weren't named according to the feature gate, now
they are named consistently and the fields are sorted.
TestPlugin/multi-claims-binding-conditions-all-success/PreEnqueue
flakes due to the assumed cache not been synced with the initial
store. The test waits until the registered handler used by the
assumed cache has synced before proceeding with the test
Moving Scheduler interfaces to staging: Move PodInfo and NodeInfo interfaces (together with related types) to staging repo, leaving internal implementation in kubernetes/kubernetes/pkg/scheduler