DeviceTaintRule is off by default because the corresponding v1beta2 API group
is off. When enabled, the potentially still disabled v1alpha3 API version was
used instead of the new v1beta2, causing the scheduler to fail while setting up
informers and then not scheduling pods.
This commit introduces metrics and improves log outputs for
DRA Device Binding Conditions (KEP-5007):
- scheduler_dra_bindingconditions_allocations_total
Counts the number of per-device scheduling attempts
during PreBind where BindingConditions are in use
- scheduler_dra_bindingconditions_wait_duration_seconds
Observes the time spent waiting for BindingConditions
to be satisfied during PreBind.
Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
The code paths for adding AllocationTimestamp were not tested well. None of
the test cases verified that an AllocationTimestamp gets added at all because
go-cmp was instructed to ignore the unpredictable field.
We can do better than that and at least check for existence by normalizing all
non-nil time stamps to the empty time. This affects all tests where the binding
conditions and thus AllocationTimestamp support is enabled.
The retry loop for status updates was untested. The fake client has to return a
conflict status error to trigger it. This enables writing a test case where a
concurrent deallocation would have caused the nil panic without the previous
fix.
For binding conditions, one test case gets added which runs through the full
flow of allocating a claim and trying to bind it. All other test cases seem to
have started with the claim already allocated.
Altogether this increases coverage from 82.4% to 83.7%.
When DRAExtendedResource is enabled, the dynamicresources test setup
registers an event handler for DeviceClasses but was not waiting for it
to sync. This can lead to flaky tests where the cache is not fully
populated when the test starts.
This change captures the event handler registration and includes its
DoneChecker in a WaitFor call.
This fixes a bug that caused log calls involving `klog.Logger` to not be
checked.
As a result we have to fix some code that is now considered faulty:
ERROR: pkg/controller/serviceaccount/tokens_controller.go:382:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (e *TokensController) generateTokenIfNeeded(ctx context.Context, logger klog.Logger, serviceAccount *v1.ServiceAccount, cachedSecret *v1.Secret) ( /* retry */ bool, error) {
ERROR: ^
ERROR: pkg/controller/storageversionmigrator/storageversionmigrator.go:299:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (svmc *SVMController) runMigration(ctx context.Context, logger klog.Logger, gvr schema.GroupVersionResource, resourceMonitor *garbagecollector.Monitor, toBeProcessedSVM *svmv1beta1.StorageVersionMigration, listResourceVersion string) (err error, failed bool) {
ERROR: ^
ERROR: pkg/proxy/node.go:121:3: logging function "Error" should not use format specifier "%q" (logcheck)
ERROR: klog.FromContext(ctx).Error(nil, "Timed out waiting for node %q to exist", nodeName)
ERROR: ^
ERROR: pkg/proxy/node.go:123:3: logging function "Error" should not use format specifier "%q" (logcheck)
ERROR: klog.FromContext(ctx).Error(nil, "Timed out waiting for node %q to be assigned IPs", nodeName)
ERROR: ^
ERROR: pkg/scheduler/backend/queue/scheduling_queue.go:610:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (p *PriorityQueue) runPreEnqueuePlugin(ctx context.Context, logger klog.Logger, pl fwk.PreEnqueuePlugin, pInfo *framework.QueuedPodInfo, shouldRecordMetric bool) *fwk.Status {
ERROR: ^
ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:286:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (pl *DynamicResources) deleteClaim(ctx context.Context, claim *resourceapi.ResourceClaim, logger klog.Logger) error {
ERROR: ^
ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:499:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (pl *DynamicResources) waitForExtendedClaimInAssumeCache(
ERROR: ^
ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:528:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (pl *DynamicResources) createExtendedResourceClaimInAPI(
ERROR: ^
ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:592:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (pl *DynamicResources) unreserveExtendedResourceClaim(ctx context.Context, logger klog.Logger, pod *v1.Pod, state *stateData) {
ERROR: ^
ERROR: pkg/scheduler/framework/runtime/batch.go:171:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (b *OpportunisticBatch) batchStateCompatible(ctx context.Context, logger klog.Logger, pod *v1.Pod, signature fwk.PodSignature, cycleCount int64, state fwk.CycleState, nodeInfos fwk.NodeInfoLister) bool {
ERROR: ^
ERROR: staging/src/k8s.io/component-base/featuregate/feature_gate.go:890:4: Additional arguments to Info should always be Key Value pairs. Please check if there is any key or value missing. (logcheck)
ERROR: logger.Info("Warning: SetEmulationVersionAndMinCompatibilityVersion will change already queried feature", "featureGate", feature, "oldValue", oldVal, newVal)
ERROR: ^
ERROR: test/images/sample-device-plugin/sampledeviceplugin.go:108:2: logging function "Info" should not use format specifier "%s" (logcheck)
ERROR: logger.Info("pluginSocksDir: %s", pluginSocksDir)
ERROR: ^
ERROR: test/images/sample-device-plugin/sampledeviceplugin.go:123:2: logging function "Info" should not use format specifier "%s" (logcheck)
ERROR: logger.Info("CDI_ENABLED: %s", cdiEnabled)
ERROR: ^
While waiting for this to merge, another call was added which also doesn't
follow conventions:
ERROR: pkg/kubelet/kubelet.go:2454:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (kl *Kubelet) deletePod(ctx context.Context, logger klog.Logger, pod *v1.Pod) error {
ERROR: ^
Contextual logging has been beta and enabled by default for several releases
now. It's mostly just a matter of wrapping up and declaring it GA. Therefore
the calls which directly call WithName or WithValues (always have an effect)
are left as-is instead of converting them to use the klog wrappers (support
disabling the effect). To allow that, the linter gets reconfigured to not
complain about this anymore, anywhere.
The calls which would have to be fixed otherwise are:
ERROR: pkg/kubelet/cm/dra/claiminfo.go:170:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger = logger.WithName("dra-claiminfo")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/healthinfo.go:45:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger = logger.WithName("dra-healthinfo")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/healthinfo.go:89:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger = logger.WithName("dra-healthinfo")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/healthinfo.go:157:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger = logger.WithName("dra-healthinfo")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager.go:175:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-manager")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager.go:239:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-manager")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager.go:593:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-manager")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager.go:781:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(context.Background()).WithName("dra-manager")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager.go:898:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-manager")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager_test.go:1638:15: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(streamCtx).WithName(st.Name())
ERROR: ^
ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:77:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-plugin")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:108:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-plugin")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:161:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-plugin")
ERROR: ^
ERROR: staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker/tracker.go:695:14: function "WithValues" should be called through klogr.LoggerWithValues (logcheck)
ERROR: logger := logger.WithValues("device", deviceID)
ERROR: ^
ERROR: test/integration/apiserver/watchcache_test.go:42:54: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: etcd0URL, stopEtcd0, err := framework.RunCustomEtcd(klog.FromContext(ctx).WithName("etcd0"), "etcd_watchcache0", etcdArgs)
ERROR: ^
ERROR: test/integration/apiserver/watchcache_test.go:47:54: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: etcd1URL, stopEtcd1, err := framework.RunCustomEtcd(klog.FromContext(ctx).WithName("etcd1"), "etcd_watchcache1", etcdArgs)
ERROR: ^
ERROR: test/integration/scheduler_perf/scheduler_perf.go:1149:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger = logger.WithName(tCtx.Name())
ERROR: ^
The assume cache logs adding assumed claims at V(4) but there wasn't anything
about in-flight claims in the log for a scheduling failure where the same
device was allocated twice (https://github.com/kubernetes/kubernetes/issues/133602).
Debugging that issue depends on seeing all changes related to assume cache
changes (not just the single "Assumed object") and in-flight claims. We could
make them all V(4) (= "debug level") but they seem more appropriate for V(5) (=
"trace level"), so the assume cache verbosity gets toned down to that.
The previous incubating becomes stable, experimental the new incubating. Now
experimental and incubating are identical until we merge more experimental
changes again.
Specifically, these commands where used:
rm -rf stable
mv incubating stable
mv stable/allocator_incubating.go stable/allocator_stable.go
mv stable/pools_incubating.go stable/pools_stable.go
sed -i -e 's/package incubating/package stable/' stable/*.go
cp -a experimental incubating
mv incubating/allocator_experimental.go incubating/allocator_incubating.go
mv incubating/pools_experimental.go incubating/pools_incubating.go
sed -i -e 's/package experimental/package incubating/' incubating/*.go
Some other packages then need to be adapted, in particular the
TestAllocatorSelection test.
GatherAllocatedState and ListAllAllocatedDevices need to collect information
from different sources (allocated devices, in-flight claims), potentially even
multiple times (GatherAllocatedState first gets allocated devices, then the
capacities).
The underlying assumption that nothing bad happens in parallel is not always
true. The following log snippet shows how an update of the assume
cache (feeding the allocated devices tracker) and in-flight claims lands such
that GatherAllocatedState doesn't see the device in that claim as allocated:
dra_manager.go:263: I0115 15:11:04.407714 18778] scheduler: Starting GatherAllocatedState
...
allocateddevices.go:189: I0115 15:11:04.407945 18066] scheduler: Observed device allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-094" claim="testdra-all-usesallresources-hvs5d/claim-0553"
dynamicresources.go:1150: I0115 15:11:04.407981 89109] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680"
dra_manager.go:201: I0115 15:11:04.408008 89109] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 version="1211"
dynamicresources.go:1157: I0115 15:11:04.408044 89109] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" allocation=<
{
"devices": {
"results": [
{
"request": "req-1",
"driver": "testdra-all-usesallresources-hvs5d.driver",
"pool": "worker-5",
"device": "worker-5-device-094"
}
]
},
"nodeSelector": {
"nodeSelectorTerms": [
{
"matchFields": [
{
"key": "metadata.name",
"operator": "In",
"values": [
"worker-5"
]
}
]
}
]
},
"allocationTimestamp": "2026-01-15T14:11:04Z"
}
>
dra_manager.go:280: I0115 15:11:04.408085 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-095" claim="testdra-all-usesallresources-hvs5d/claim-0086"
dra_manager.go:280: I0115 15:11:04.408137 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-096" claim="testdra-all-usesallresources-hvs5d/claim-0165"
default_binder.go:69: I0115 15:11:04.408175 89109] scheduler: Attempting to bind pod to node pod="testdra-all-usesallresources-hvs5d/my-pod-0553" node="worker-5"
dra_manager.go:265: I0115 15:11:04.408264 18778] scheduler: Finished GatherAllocatedState allocatedDevices=<map[string]interface {} | len:2>: {
Initial state: "worker-5-device-094" is in-flight, not in cache
- goroutine #1: starts GatherAllocatedState, copies cache
- goroutine #2: adds to assume cache, removes from in-flight
- goroutine #1: checks in-flight
=> device never seen as allocated
This is the second reason for double allocation of the same device in two
different claims. The other was timing in the assume cache. Both were
tracked down with an integration test (separate commit). It did not fail
all the time, but enough that regressions should show up as flakes.
Test_isSchedulableAfterClaimChange was sensitive to system load because of the
arbitrary delay when waiting for the assume cache to catch up. Running inside
a synctest bubble avoids this. While at it, the unit tests get converted
to ktesting (nicer failure output, no extra indention needed for
tCtx.SyncTest).
TestPlugin/prebind-fail-with-binding-timeout relied on setting up a claim with
certain time stamps and then getting that test case tested within a certain
real-world time window. It's surprising that this didn't flake more often
because test execution order is random. Now the time stamp gets set right
before the test case is about to be tested. Conversion to a synctest would
be nicer, but synctests cannot have sub-tests, which are used here to track
where log output and failures come from within the larger test case.
Inside the plugin itself some log output gets added to explain why a claim is
unavailable on a node in case of a binding timeout or error during Filter.
For debugging double allocation of the same
device (https://github.com/kubernetes/kubernetes/issues/133602) it is necessary
to have information about pools, devices and in-flight claims. Log calls get
extended and the config for DRA CI jobs updated to enable higher verbosity for
relevant source files.
Log output in such a cluster at verbosity 6 looks like this:
I1215 10:28:54.166872 1 allocator_incubating.go:130] "Gathered pool information" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" pools={"count":1,"devices":["dra-8841.k8s.io/kind-worker2/device-00"],"meta":[{"InvalidReason":"","id":"dra-8841.k8s.io/kind-worker2","isIncomplete":false,"isInvalid":false}]}
I1215 10:28:54.166941 1 allocator_incubating.go:254] "Gathered information about devices" logger="FilterWithNominatedPods.Filter.DynamicResources" pod="dra-8841/tester-3" node="kind-worker2" allocatedDevices={"count":2,"devices":["dra-8841.k8s.io/kind-worker/device-00","dra-8841.k8s.io/kind-worker3/device-00"]} minDevicesToBeAllocated=1
* First version of batching w/out signatures.
* First version of pod signatures.
* Integrate batching with signatures.
* Fix merge conflicts.
* Fixes from self-review.
* Test fixes.
* Fix a bug that limited batches to size 2
Also add some new high-level logging and
simplify the pod affinity signature.
* Re-enable batching on perf tests for now.
* fwk.NewStatus(fwk.Success)
* Review feedback.
* Review feedback.
* Comment fix.
* Two plugin specific unit tests.:
* Add cycle state to the sign call, apply to topo spread.
Also add unit tests for several plugi signature
calls.
* Review feedback.
* Switch to distinct stats for hint and store calls.
* Switch signature from string to []byte
* Revert cyclestate in signs. Update node affinity.
Node affinity now sorts all of the various
nested arrays in the structure. CycleState no
longer in signature; revert to signing fewer
cases for pod spread.
* hack/update-vendor.sh
* Disable signatures when extenders are configured.
* Update pkg/scheduler/framework/runtime/batch.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update staging/src/k8s.io/kube-scheduler/framework/interface.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Review feedback.
* Disable node resource signatures when extended DRA enabled.
* Review feedback.
* Update pkg/scheduler/framework/plugins/imagelocality/image_locality.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/interface.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/plugins/nodedeclaredfeatures/nodedeclaredfeatures.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Update pkg/scheduler/framework/runtime/batch.go
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
* Review feedback.
* Fixes for review suggestions.
* Add integration tests.
* Linter fixes, test fix.
* Whitespace fix.
* Remove broken test.
* Unschedulable test.
* Remove go.mod changes.
---------
Co-authored-by: Maciej Skoczeń <87243939+macsko@users.noreply.github.com>
The test started without waiting for the ResourceSlice informer to have
synced. As a result, the "CEL-runtime-error-for-one-of-three-nodes" test case
failed randomly with a very low flake rate (less than 1% in local runs) because
CEL expressions never got evaluated due to not having the slices (yet).
Other tests also were less reliable, but not known to fail.
Support for DeviceTaintRules depends on a significant amount of
additional code:
- ResourceSlice tracker is a NOP without it.
- Additional informers and corresponding permissions in scheduler and controller.
- Controller code for handling status.
Not all users necessarily need DeviceTaintRules, so adding a second feature
gate for that code makes it possible to limit the blast radius of bugs in that
code without having to turn off device taints and tolerations entirely.
Add a new `bindingTimeout` field to DynamicResources plugin args and wire it
into PreBind.
Changes:
- API: add `bindingTimeout` to DynamicResourcesArgs (staging + internal types).
- Defaults: default to 600 seconds when BOTH DRADeviceBindingConditions and
DRAResourceClaimDeviceStatus are enabled.
- Validation: require >= 1s; forbid when either feature gate is disabled.
- Plugin: plumbs args into `pl.bindingTimeout` and uses it in
`wait.PollUntilContextTimeout` for binding-condition wait logic.
- Plugin: remove legacy `BindingTimeoutDefaultSeconds`.
Tests:
- Add/adjust unit tests for validation and PreBind timeout path.
- Ensure <1s and negative values are rejected; forbids when gates disabled.
The cache and scheduler event handlers cannot be registered separately in the
informer, that leads to a race (scheduler might schedule based on event before
cache is updated). Chaining event handlers (cache first, then scheduler) avoids
this.
This also ensures that the cache is up-to-date before the scheduler
starts (HasSynced of the handler registration for the cache is checked).
Other changes:
- renamed package to avoid clash with other "cache" packages
- clarified nil handling
- feature gate check before instantiating the cache
- per-test logging
- utilruntime.HandleErrorWithLogger
- simpler cache.DeletedFinalStateUnknown
Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
class mapping
- Add a new interface "DeviceClassResolver" in the scheduler framework
- Add a global cache of mapping between the extended resource and the
device class
- Cache can be leveraged by the k8s api-server, controller-manager along with the scheduler
- This change helps in delegating the requests to the dynamicresource
plugin based on the mapping during the node update events and thus
avoiding an extra scheduling cycle
Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>