Prevents EndpointSlice churn for headless services where
getEndpointPorts returns [] but existing slices from the API
have nil ports, causing different hash values.
- Use netutils.IsIPv6(ip) instead of manual nil/To4 check
- Remove unnecessary ip.To16() call since IPv6 is already 16 bytes
- Remove ipFamily from grep pattern since IP format ensures correctness
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The recent change to support importing ktesting into an E2E suite
without progress reporting was flawed:
- If a Go unit test had a deadline (the default when invoked
by `go test`!), the early return skipped initializing progress
reporting.
- When it didn't, for example when invoking a test binary directly
under stress, a test created goroutines which were kept running,
which broke leak checking in e.g. an integration tests TestMain.
The revised approach uses reference counting: as long as some unit test is
running, the progress reporting with the required goroutines are active.
When the last one ends, they get cleaned up, which keeps the goleak
checker happy.
Partially reverts cb011623c8 from #135369.
Using RepoDigests[0] as image identity causes credential verification
issues because it makes identity location-dependent (registry.io/image@sha256:...)
instead of content-based (sha256:...). This defeats deduplication and
creates separate pull records for identical image content from different
registries.
ImagePulledRecord already handles per-registry credentials via its
two-level design: ImageRef identifies content, CredentialMapping tracks
registry-specific credentials.
Related: #136498, #136549
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
The TestCause tests were already unreliable in the CI. The others failed under
stress.
As synctest we have to be more careful how to construct and clean up the parent
context for TestCause (must happen inside bubble), but once that's handled we
can reliably measure the (fake) time and compare exactly against expected
results.
Change /etc/os-release to /etc/passwd in subPath test to avoid
symlink issues with Alpine 3.21 (kitten:1.8).
Add Feature:ImageVolume tag to properly categorize tests for CI.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
The /proc/net/nf_conntrack file uses fully expanded IPv6 addresses
with leading zeros in each 16-bit group. For example:
fc00:f853:ccd:e793::3 -> fc00:f853:0ccd:e793:0000:0000:0000:0003
Add expandIPv6ForConntrack() helper function to expand IPv6 addresses
to the format used by /proc/net/nf_conntrack before using them in
the grep pattern.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Long running tests like TestDRA/all/DeviceBindingConditions (42.50s)
should run in parallel with other tests, otherwise the overall runtime is too
high.
This then must allow more pods per node to avoid blocking scheduling.
GatherAllocatedState and ListAllAllocatedDevices need to collect information
from different sources (allocated devices, in-flight claims), potentially even
multiple times (GatherAllocatedState first gets allocated devices, then the
capacities).
The underlying assumption that nothing bad happens in parallel is not always
true. The following log snippet shows how an update of the assume
cache (feeding the allocated devices tracker) and in-flight claims lands such
that GatherAllocatedState doesn't see the device in that claim as allocated:
dra_manager.go:263: I0115 15:11:04.407714 18778] scheduler: Starting GatherAllocatedState
...
allocateddevices.go:189: I0115 15:11:04.407945 18066] scheduler: Observed device allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-094" claim="testdra-all-usesallresources-hvs5d/claim-0553"
dynamicresources.go:1150: I0115 15:11:04.407981 89109] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680"
dra_manager.go:201: I0115 15:11:04.408008 89109] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 version="1211"
dynamicresources.go:1157: I0115 15:11:04.408044 89109] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" allocation=<
{
"devices": {
"results": [
{
"request": "req-1",
"driver": "testdra-all-usesallresources-hvs5d.driver",
"pool": "worker-5",
"device": "worker-5-device-094"
}
]
},
"nodeSelector": {
"nodeSelectorTerms": [
{
"matchFields": [
{
"key": "metadata.name",
"operator": "In",
"values": [
"worker-5"
]
}
]
}
]
},
"allocationTimestamp": "2026-01-15T14:11:04Z"
}
>
dra_manager.go:280: I0115 15:11:04.408085 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-095" claim="testdra-all-usesallresources-hvs5d/claim-0086"
dra_manager.go:280: I0115 15:11:04.408137 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-096" claim="testdra-all-usesallresources-hvs5d/claim-0165"
default_binder.go:69: I0115 15:11:04.408175 89109] scheduler: Attempting to bind pod to node pod="testdra-all-usesallresources-hvs5d/my-pod-0553" node="worker-5"
dra_manager.go:265: I0115 15:11:04.408264 18778] scheduler: Finished GatherAllocatedState allocatedDevices=<map[string]interface {} | len:2>: {
Initial state: "worker-5-device-094" is in-flight, not in cache
- goroutine #1: starts GatherAllocatedState, copies cache
- goroutine #2: adds to assume cache, removes from in-flight
- goroutine #1: checks in-flight
=> device never seen as allocated
This is the second reason for double allocation of the same device in two
different claims. The other was timing in the assume cache. Both were
tracked down with an integration test (separate commit). It did not fail
all the time, but enough that regressions should show up as flakes.
The distroless-iptables image no longer includes the conntrack binary
as of v0.8.7 (removed in kubernetes/release#4223 since kube-proxy no
longer needs it after kubernetes#126847).
Update the KubeProxy CLOSE_WAIT timeout test to read /proc/net/nf_conntrack
directly instead of using the conntrack command. The file contains the
same connection tracking data and is accessible from the privileged
host-network pod.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
In the unlikely situation that sharedProcessor.distribute was triggered by a
resync before sharedProcessor.run had a chance to start the listeners, the
sharedProcessor deadlocked: sharedProcessor.distribute held a read/write lock
on listenersLock while being blocked on the write to the listener's
channel. The listeners who would have read from those weren't get started
because sharedProcessor.run was blocked trying to get a read lock for
listenersLock.
This gets fixed by releasing the read/write lock in sharedProcessor.distribute
while waiting for all listeners to be started. Because either all or no
listeners are started, the existing global listenersStarted boolean is
sufficient.
The TestListenerResyncPeriods tests now runs twice, with and without the
artificial delay. It gets converted to a synctest, so it executes quickly
despite the time.Sleep calls and timing is deterministic. The enhanced log
output confirms that with the delay, the initial sync completes later:
=== RUN TestListenerResyncPeriods
shared_informer_test.go:236: 0s: listener3: handle: pod1
shared_informer_test.go:236: 0s: listener3: handle: pod2
shared_informer_test.go:236: 0s: listener1: handle: pod1
shared_informer_test.go:236: 0s: listener1: handle: pod2
shared_informer_test.go:236: 0s: listener2: handle: pod1
shared_informer_test.go:236: 0s: listener2: handle: pod2
shared_informer_test.go:236: 2s: listener2: handle: pod1
shared_informer_test.go:236: 2s: listener2: handle: pod2
shared_informer_test.go:236: 3s: listener3: handle: pod1
shared_informer_test.go:236: 3s: listener3: handle: pod2
--- PASS: TestListenerResyncPeriods (0.00s)
=== RUN TestListenerResyncPeriodsDelayed
shared_informer_test.go:236: 1s: listener1: handle: pod1
shared_informer_test.go:236: 1s: listener1: handle: pod2
shared_informer_test.go:236: 1s: listener2: handle: pod1
shared_informer_test.go:236: 1s: listener2: handle: pod2
shared_informer_test.go:236: 1s: listener3: handle: pod1
shared_informer_test.go:236: 1s: listener3: handle: pod2
shared_informer_test.go:236: 2s: listener2: handle: pod1
shared_informer_test.go:236: 2s: listener2: handle: pod2
shared_informer_test.go:236: 3s: listener3: handle: pod1
shared_informer_test.go:236: 3s: listener3: handle: pod2
--- PASS: TestListenerResyncPeriodsDelayed (0.00s)