The tests assumed that instantiating a DRAManager followed by
informerFactory.WaitForCacheSync would be enough to have the manager
up-to-date, but that's not correct: the test only waits for informer *caches*
to be synced, but syncing *event handlers* like the one in the manager may
still be going on. The flake rate is low, though:
$ GOPATH/bin/stress -p 256 ./noderesources.test
5s: 0 runs so far, 0 failures, 256 active
10s: 256 runs so far, 0 failures, 256 active
15s: 256 runs so far, 0 failures, 256 active
20s: 512 runs so far, 0 failures, 256 active
25s: 567 runs so far, 0 failures, 256 active
30s: 771 runs so far, 0 failures, 256 active
/tmp/go-stress-20251226T181044-974980161
--- FAIL: TestCalculateResourceAllocatableRequest (0.81s)
--- FAIL: TestCalculateResourceAllocatableRequest/DRA-backed-resource-with-shared-device-allocation (0.00s)
extendedresourcecache.go:197: I1226 18:11:14.431337] Updated extended resource cache for explicit mapping extendedResource="extended.resource.dra.io/something" deviceClass="device-class-name"
extendedresourcecache.go:204: I1226 18:11:14.431380] Updated extended resource cache for default mapping extendedResource="deviceclass.resource.kubernetes.io/device-class-name" deviceClass="device-class-name"
extendedresourcecache.go:220: I1226 18:11:14.431394] Updated device class mapping deviceClass="device-class-name" extendedResource="extended.resource.dra.io/something"
resource_allocation_test.go:595: Expected requested=2, but got requested=1
FAIL
It becomes higher when changing WaitForCacheSync such that it doesn't poll and
therefore returns more promptly, which is where this flake was first observed.
The fix is to run the test in a syntest bubble where Wait can be used to wait
for all background activity, including event handling, to be finished before
proceeding with the test.
synctest is less forgiving about lingering goroutines. A synctest bubble must
wait for gouroutines to stop, which in this case means that there has to be
a way to wait for the metric recorder shutdown. Event handlers have to be
removed.
This could be done with plain Go, but here test/utils/ktesting is used instead
because it offers some advantages:
- less boilerplate code
- automatic cancellation of the context (i.e. less manual context.WithCancel)
- tCtx.SyncTest is a direct substitute for t.Run, which avoids re-indenting
sub-tests. synctest itself needs another anonymous function, which makes
the line too long and forced re-indention:
t.Run(... func(...) {
synctest.Test(... func() {
})
})
For the sake of consistency all tests get updated.
While at it, some code gets improved:
- t.Fatal(err) is not a good way to report an error because
there is no additional markup in the test output that indicates
that there was an unexpected error. It just logs err.Error(),
which might not be very informative and/or obvious.
- newTestDRAManager aborts in case of a failure instead of
returning an error.
- Refactored `PreScore` method in `balanced_allocation.go` to skip
best-effort pods.
- Updated unit tests in `balanced_allocation_test.go` to check for
the new status codes.
* scheduler(NodeResourcesFit): calculatePodResourceRequest in PreScore phase
* scheduler(NodeResourcesFit and NodeResourcesBalancedAllocation): calculatePodResourceRequest in PreScore phase
* modify the comments and tests.
* revert the tests.
* don't need consider nodes.
* use list instead of map.
* add comment for podRequests.
* avoid using negative wording in variable names.
Move scheduler plugin unit tests use testing PodWrapper
where applicable to reduce duplicating pod creation
code and shorten number of lines.
Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
kubernetes#60525 introduced
Balanced attached node volumes feature gate to include volume
count for prioritizing nodes. The reason for introducing this
flag was its usefulness in Red Hat OpenShift Online environment
which is not being used any more. So, removing the flag
as it helps in maintainability of the scheduler code base
as mentioned at kubernetes#101489 (comment)
Given that we give a default CPU/memory requests for containers that don't provide any, the calculated usage can exceed the allocatable.
Change-Id: I72e249652acacfbe8cea0dd6f895dabe43ff6376