DeviceTaintRule is off by default because the corresponding v1beta2 API group
is off. When enabled, the potentially still disabled v1alpha3 API version was
used instead of the new v1beta2, causing the scheduler to fail while setting up
informers and then not scheduling pods.
This commit introduces metrics and improves log outputs for
DRA Device Binding Conditions (KEP-5007):
- scheduler_dra_bindingconditions_allocations_total
Counts the number of per-device scheduling attempts
during PreBind where BindingConditions are in use
- scheduler_dra_bindingconditions_wait_duration_seconds
Observes the time spent waiting for BindingConditions
to be satisfied during PreBind.
Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
Gang scheduler will add the pod into the podGroupInfo before pod
enqueue, if there are thousands of pods in a podGroupInfo, The
call of AssumedPods, AssignedPods and AllPods will hold the lock and
clone the map, so that new Pods should wait there for a long time to add into
the podGroupInfo, also it will be a long wait to enqueue the pod.
In our test, a pod will wait seconds to enqueue if we have 50000 pod in
a gang group.
In this PR, we can avoid the traverse and clone of the map by adding
AllPodsCount, AssumedPodsCount, AssignedPodsCount method, and we make
sure that assumed pods and assigned pods are disjoint.
The code paths for adding AllocationTimestamp were not tested well. None of
the test cases verified that an AllocationTimestamp gets added at all because
go-cmp was instructed to ignore the unpredictable field.
We can do better than that and at least check for existence by normalizing all
non-nil time stamps to the empty time. This affects all tests where the binding
conditions and thus AllocationTimestamp support is enabled.
The retry loop for status updates was untested. The fake client has to return a
conflict status error to trigger it. This enables writing a test case where a
concurrent deallocation would have caused the nil panic without the previous
fix.
For binding conditions, one test case gets added which runs through the full
flow of allocating a claim and trying to bind it. All other test cases seem to
have started with the claim already allocated.
Altogether this increases coverage from 82.4% to 83.7%.
* Drop WorkloadRef field and introduce SchedulingGroup field in Pod API
* Introduce v1alpha2 Workload and PodGroup APIs, drop v1alpha1 Workload API
Co-authored-by: yongruilin <yongrlin@outlook.com>
* Run hack/update-codegen.sh
* Adjust kube-scheduler code and integration tests to v1alpha2 API
* Drop v1alpha1 scheduling API group and run make update
---------
Co-authored-by: yongruilin <yongrlin@outlook.com>
When DRAExtendedResource is enabled, the dynamicresources test setup
registers an event handler for DeviceClasses but was not waiting for it
to sync. This can lead to flaky tests where the cache is not fully
populated when the test starts.
This change captures the event handler registration and includes its
DoneChecker in a WaitFor call.
This fixes a bug that caused log calls involving `klog.Logger` to not be
checked.
As a result we have to fix some code that is now considered faulty:
ERROR: pkg/controller/serviceaccount/tokens_controller.go:382:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (e *TokensController) generateTokenIfNeeded(ctx context.Context, logger klog.Logger, serviceAccount *v1.ServiceAccount, cachedSecret *v1.Secret) ( /* retry */ bool, error) {
ERROR: ^
ERROR: pkg/controller/storageversionmigrator/storageversionmigrator.go:299:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (svmc *SVMController) runMigration(ctx context.Context, logger klog.Logger, gvr schema.GroupVersionResource, resourceMonitor *garbagecollector.Monitor, toBeProcessedSVM *svmv1beta1.StorageVersionMigration, listResourceVersion string) (err error, failed bool) {
ERROR: ^
ERROR: pkg/proxy/node.go:121:3: logging function "Error" should not use format specifier "%q" (logcheck)
ERROR: klog.FromContext(ctx).Error(nil, "Timed out waiting for node %q to exist", nodeName)
ERROR: ^
ERROR: pkg/proxy/node.go:123:3: logging function "Error" should not use format specifier "%q" (logcheck)
ERROR: klog.FromContext(ctx).Error(nil, "Timed out waiting for node %q to be assigned IPs", nodeName)
ERROR: ^
ERROR: pkg/scheduler/backend/queue/scheduling_queue.go:610:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (p *PriorityQueue) runPreEnqueuePlugin(ctx context.Context, logger klog.Logger, pl fwk.PreEnqueuePlugin, pInfo *framework.QueuedPodInfo, shouldRecordMetric bool) *fwk.Status {
ERROR: ^
ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:286:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (pl *DynamicResources) deleteClaim(ctx context.Context, claim *resourceapi.ResourceClaim, logger klog.Logger) error {
ERROR: ^
ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:499:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (pl *DynamicResources) waitForExtendedClaimInAssumeCache(
ERROR: ^
ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:528:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (pl *DynamicResources) createExtendedResourceClaimInAPI(
ERROR: ^
ERROR: pkg/scheduler/framework/plugins/dynamicresources/extendeddynamicresources.go:592:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (pl *DynamicResources) unreserveExtendedResourceClaim(ctx context.Context, logger klog.Logger, pod *v1.Pod, state *stateData) {
ERROR: ^
ERROR: pkg/scheduler/framework/runtime/batch.go:171:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (b *OpportunisticBatch) batchStateCompatible(ctx context.Context, logger klog.Logger, pod *v1.Pod, signature fwk.PodSignature, cycleCount int64, state fwk.CycleState, nodeInfos fwk.NodeInfoLister) bool {
ERROR: ^
ERROR: staging/src/k8s.io/component-base/featuregate/feature_gate.go:890:4: Additional arguments to Info should always be Key Value pairs. Please check if there is any key or value missing. (logcheck)
ERROR: logger.Info("Warning: SetEmulationVersionAndMinCompatibilityVersion will change already queried feature", "featureGate", feature, "oldValue", oldVal, newVal)
ERROR: ^
ERROR: test/images/sample-device-plugin/sampledeviceplugin.go:108:2: logging function "Info" should not use format specifier "%s" (logcheck)
ERROR: logger.Info("pluginSocksDir: %s", pluginSocksDir)
ERROR: ^
ERROR: test/images/sample-device-plugin/sampledeviceplugin.go:123:2: logging function "Info" should not use format specifier "%s" (logcheck)
ERROR: logger.Info("CDI_ENABLED: %s", cdiEnabled)
ERROR: ^
While waiting for this to merge, another call was added which also doesn't
follow conventions:
ERROR: pkg/kubelet/kubelet.go:2454:1: A function should accept either a context or a logger, but not both. Having both makes calling the function harder because it must be defined whether the context must contain the logger and callers have to follow that. (logcheck)
ERROR: func (kl *Kubelet) deletePod(ctx context.Context, logger klog.Logger, pod *v1.Pod) error {
ERROR: ^
Contextual logging has been beta and enabled by default for several releases
now. It's mostly just a matter of wrapping up and declaring it GA. Therefore
the calls which directly call WithName or WithValues (always have an effect)
are left as-is instead of converting them to use the klog wrappers (support
disabling the effect). To allow that, the linter gets reconfigured to not
complain about this anymore, anywhere.
The calls which would have to be fixed otherwise are:
ERROR: pkg/kubelet/cm/dra/claiminfo.go:170:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger = logger.WithName("dra-claiminfo")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/healthinfo.go:45:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger = logger.WithName("dra-healthinfo")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/healthinfo.go:89:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger = logger.WithName("dra-healthinfo")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/healthinfo.go:157:11: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger = logger.WithName("dra-healthinfo")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager.go:175:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-manager")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager.go:239:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-manager")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager.go:593:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-manager")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager.go:781:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(context.Background()).WithName("dra-manager")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager.go:898:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-manager")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/manager_test.go:1638:15: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(streamCtx).WithName(st.Name())
ERROR: ^
ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:77:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-plugin")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:108:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-plugin")
ERROR: ^
ERROR: pkg/kubelet/cm/dra/plugin/dra_plugin.go:161:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger := klog.FromContext(ctx).WithName("dra-plugin")
ERROR: ^
ERROR: staging/src/k8s.io/dynamic-resource-allocation/resourceslice/tracker/tracker.go:695:14: function "WithValues" should be called through klogr.LoggerWithValues (logcheck)
ERROR: logger := logger.WithValues("device", deviceID)
ERROR: ^
ERROR: test/integration/apiserver/watchcache_test.go:42:54: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: etcd0URL, stopEtcd0, err := framework.RunCustomEtcd(klog.FromContext(ctx).WithName("etcd0"), "etcd_watchcache0", etcdArgs)
ERROR: ^
ERROR: test/integration/apiserver/watchcache_test.go:47:54: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: etcd1URL, stopEtcd1, err := framework.RunCustomEtcd(klog.FromContext(ctx).WithName("etcd1"), "etcd_watchcache1", etcdArgs)
ERROR: ^
ERROR: test/integration/scheduler_perf/scheduler_perf.go:1149:12: function "WithName" should be called through klogr.LoggerWithName (logcheck)
ERROR: logger = logger.WithName(tCtx.Name())
ERROR: ^