In kubelet admission:
- Remove extended resources from pod requirements if they are either
backed by DRA or not present in node's allocatable resources
In scheduler (fit.go):
- Remove fallback logic that delegated all resources to DRA when
draManager is nil
These changes ensure that:
- DRA-backed extended resources are properly handled during pod admission
- DevicePlugin-backed extended resources still follow standard admission rules
The tests assumed that instantiating a DRAManager followed by
informerFactory.WaitForCacheSync would be enough to have the manager
up-to-date, but that's not correct: the test only waits for informer *caches*
to be synced, but syncing *event handlers* like the one in the manager may
still be going on. The flake rate is low, though:
$ GOPATH/bin/stress -p 256 ./noderesources.test
5s: 0 runs so far, 0 failures, 256 active
10s: 256 runs so far, 0 failures, 256 active
15s: 256 runs so far, 0 failures, 256 active
20s: 512 runs so far, 0 failures, 256 active
25s: 567 runs so far, 0 failures, 256 active
30s: 771 runs so far, 0 failures, 256 active
/tmp/go-stress-20251226T181044-974980161
--- FAIL: TestCalculateResourceAllocatableRequest (0.81s)
--- FAIL: TestCalculateResourceAllocatableRequest/DRA-backed-resource-with-shared-device-allocation (0.00s)
extendedresourcecache.go:197: I1226 18:11:14.431337] Updated extended resource cache for explicit mapping extendedResource="extended.resource.dra.io/something" deviceClass="device-class-name"
extendedresourcecache.go:204: I1226 18:11:14.431380] Updated extended resource cache for default mapping extendedResource="deviceclass.resource.kubernetes.io/device-class-name" deviceClass="device-class-name"
extendedresourcecache.go:220: I1226 18:11:14.431394] Updated device class mapping deviceClass="device-class-name" extendedResource="extended.resource.dra.io/something"
resource_allocation_test.go:595: Expected requested=2, but got requested=1
FAIL
It becomes higher when changing WaitForCacheSync such that it doesn't poll and
therefore returns more promptly, which is where this flake was first observed.
The fix is to run the test in a syntest bubble where Wait can be used to wait
for all background activity, including event handling, to be finished before
proceeding with the test.
synctest is less forgiving about lingering goroutines. A synctest bubble must
wait for gouroutines to stop, which in this case means that there has to be
a way to wait for the metric recorder shutdown. Event handlers have to be
removed.
This could be done with plain Go, but here test/utils/ktesting is used instead
because it offers some advantages:
- less boilerplate code
- automatic cancellation of the context (i.e. less manual context.WithCancel)
- tCtx.SyncTest is a direct substitute for t.Run, which avoids re-indenting
sub-tests. synctest itself needs another anonymous function, which makes
the line too long and forced re-indention:
t.Run(... func(...) {
synctest.Test(... func() {
})
})
For the sake of consistency all tests get updated.
While at it, some code gets improved:
- t.Fatal(err) is not a good way to report an error because
there is no additional markup in the test output that indicates
that there was an unexpected error. It just logs err.Error(),
which might not be very informative and/or obvious.
- newTestDRAManager aborts in case of a failure instead of
returning an error.
* First version of batching w/out signatures.
* First version of pod signatures.
* Integrate batching with signatures.
* Fix merge conflicts.
* Fixes from self-review.
* Test fixes.
* Fix a bug that limited batches to size 2
Also add some new high-level logging and
simplify the pod affinity signature.
* Re-enable batching on perf tests for now.
* fwk.NewStatus(fwk.Success)
* Review feedback.
* Review feedback.
* Comment fix.
* Two plugin specific unit tests.:
* Add cycle state to the sign call, apply to topo spread.
Also add unit tests for several plugi signature
calls.
* Review feedback.
* Switch to distinct stats for hint and store calls.
* Switch signature from string to []byte
* Revert cyclestate in signs. Update node affinity.
Node affinity now sorts all of the various
nested arrays in the structure. CycleState no
longer in signature; revert to signing fewer
cases for pod spread.
* hack/update-vendor.sh
* Disable signatures when extenders are configured.
* Update pkg/scheduler/framework/runtime/batch.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update staging/src/k8s.io/kube-scheduler/framework/interface.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Review feedback.
* Disable node resource signatures when extended DRA enabled.
* Review feedback.
* Update pkg/scheduler/framework/plugins/imagelocality/image_locality.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/interface.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/plugins/nodedeclaredfeatures/nodedeclaredfeatures.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/runtime/batch.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Review feedback.
* Fixes for review suggestions.
* Add integration tests.
* Linter fixes, test fix.
* Whitespace fix.
* Remove broken test.
* Unschedulable test.
* Remove go.mod changes.
---------
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
Extend Fit and BalancedAllocation PreScore state with the the
allocated state, the list of ResourceSlices and the device class
mapping. Gather these once during PreScore and pass them through
the scoring path instead of re-fetching for every scoring call.
This should speed up scoring of DRA extended resources, lowering
scheduling overhead.
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
Co-authored-by: Maciej Skoczeń <mskoczen@google.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
class mapping
- Add a new interface "DeviceClassResolver" in the scheduler framework
- Add a global cache of mapping between the extended resource and the
device class
- Cache can be leveraged by the k8s api-server, controller-manager along with the scheduler
- This change helps in delegating the requests to the dynamicresource
plugin based on the mapping during the node update events and thus
avoiding an extra scheduling cycle
Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
Previously, the scheduler assumed an extended resource was maintained
by a device plugin if its name was present in the node's Allocatable
map, even if its value was zero. This blocked scheduling when a device
plugin was disconnected or uninstalled, because Kubelet still reported
the resource with Allocatable=0.
This change adds a check for the actual allocatable value in addition
to a key presence check, allowing nodes with uninstalled device
plugins to be considered for scheduling.
* Move ClusterEvent type to staging repo, leaving some functions (that contain logic internal to scheduler) in kubernetes/kubernetes
apply review comment and fix linter warning
* update-vendor.sh
* update doc comments
* run update-vendor.sh
Currently, the NodeResourcesFit plugin always returns Unschedulable when a pod's
resource requests exceed a node's available resources. However, when a pod's
requests exceed the node's total allocatable, preemption cannot help since even
an empty node would not have enough resources.
This change modifies the NodeResourcesFit plugin to return UnschedulableAndUnresolvable
when a pod's resource requests exceed the node's total allocatable. This helps
optimize the scheduling process in large clusters by:
1. Reducing the number of candidate nodes that need to be considered for preemption
2. Providing clearer feedback about unresolvable resource constraints
3. Improving scheduling performance by avoiding unnecessary preemption calculations
The change is particularly beneficial in heterogeneous clusters where node sizes
vary significantly, as it helps quickly identify nodes that are fundamentally
too small for certain pods.
Fixes https://github.com/kubernetes/kubernetes/issues/131310
Co-authored-by: Kensei Nakada <handbomusic@gmail.com>
- Refactored `PreScore` method in `balanced_allocation.go` to skip
best-effort pods.
- Updated unit tests in `balanced_allocation_test.go` to check for
the new status codes.
1. Use pod-level resource when feature is enabled and resources are set at pod-level
2. Edge case handling: When a pod defines only CPU or memory limits at pod-level (but not both), and container-level requests/limits are unset, the pod-level requests stay empty for the resource without a pod-limit. The container's request for that resource is then set to the default request value from schedutil.