Commit graph

127 commits

Author SHA1 Message Date
Ondra Kupka
51ef94c547 controller/nodelifecycle: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:38 +01:00
xigang
574b09b7de nodelifecycle: fix ComputeZoneState method comment
Signed-off-by: xigang <wangxigang2014@gmail.com>
2025-09-28 10:56:56 +08:00
Kubernetes Prow Robot
1eb2c4182d
Merge pull request #134102 from mayank-agrwl/namespace-nodelifecycle-contextual
Replace HandleError with HandleErrorWithContext
2025-09-23 18:50:17 -07:00
Aditi Gupta
f58d1e101f refactor(controller): Use WithContext variants in cloud node controllers
This change refactors the cloud-specific versions of the node lifecycle
and node IPAM controllers to use a context.Context for cancellation and
contextual logging, replacing the legacy stopCh pattern.

This is a follow-up to PR #133985, where these controllers were
separated out due to their use in the legacy Cloud Controller Manager
(CCM).

It is a known issue that the CCM's startup logic does not pass the
controller name via the context. This change proceeds with the
refactoring to unify the cancellation logic across controllers, while
acknowledging that contextual logs will be less detailed when these
controllers are run in the CCM.

Signed-off-by: Aditi Gupta <aditigpta@google.com>
2025-09-17 00:17:38 -07:00
Mayank Agrawal
d12eeb98d0 Replace HandleError with HandleErrorWithContext 2025-09-16 23:47:23 -07:00
Quan Tian
f718096b74 NoExecute taint should be added when a Node's ready condition becomes Unknown
After a Node has stopped posting heartbeats for nodeMonitorGracePeriod,
it will be considered unreachable, its ready condition will be set to
Unknown, NoSchedule taint will be added, all Pods on it will be set to
NotReady, but there is always a delay of 5s before NoExecute taint is
added to the Node, adding 5s to the recovery time of Pods which are
supposed to be evicted by the taint and recreated on other Nodes sooner.

The delay is because processTaintBaseEviction() uses the last observed
ready condition of the Node instead of the current one to determine
whether it should add the Node to the taint queue. When a Node is set to
unreachable due to missing heartbeats, the last observed ready condition
is still true and the current ready condition is unknown, we should use
the latter for processTaintBaseEviction().

Signed-off-by: Quan Tian <qtian@vmware.com>
2025-05-10 17:22:11 +08:00
xigang
5c4948ff31 controller: factor out pod node name indexer helper function
Signed-off-by: xigang <wangxigang2014@gmail.com>
2025-03-17 20:21:30 +08:00
Kubernetes Prow Robot
71523a7db6
Merge pull request #122644 from gyuho/logs-removing-taints
chores(controller/nodelifecycle): make node taint removal logs more a…
2024-10-23 01:17:15 +01:00
guozheng-shen
686ccceba3
Update node_lifecycle_controller.go
remove 'pod-eviction-timeout' comment
2024-09-09 14:36:41 +08:00
devppratik
f8bf6b97b8 Update Node Monitor Grace Period default duration to 50s
Update description

Improve flag comment

Update Test case value to be 50s by default

Update Description

Run make update

Minor description fix
2024-07-24 22:54:44 +05:30
Alvaro Aleman
6d0ac8c561 Use the generic/typed workqueue throughout
This change makes us use the generic workqueue throughout the project in
order to improve type safety and readability of the code.
2024-05-04 14:33:12 -04:00
liyuerich
98dfaed4be drop deprecated workqueue NewNamed package
Signed-off-by: liyuerich <yue.li@daocloud.io>
2024-04-28 14:04:51 +08:00
Kubernetes Prow Robot
67a06c2056
Merge pull request #122293 from mengjiao-liu/controller-reconsider-log-verbosity
kube-controller-manager: readjust log verbosity
2024-02-29 11:55:21 -08:00
Mengjiao Liu
b584b87a94 kube-controller-manager: readjust log verbosity
- Increase the global level for broadcaster's logging to 3 so that users can ignore event messages by lowering the logging level. It reduces information noise.
- Making sure the context is properly injected into the broadcaster, this will allow the -v flag value to be used also in that broadcaster, rather than the above global value.
- test: use cancellation from ktesting
- golangci-hints: checked error return value
2024-02-26 14:51:56 +08:00
fusida
9f6b48f1e7 fix node lifecycle controller panic when conditionType ready is nil 2024-02-26 11:26:45 +08:00
Gyuho Lee
bfc2562d6c
chores(controller/nodelifecycle): make node taint removal logs more accurate
Signed-off-by: Gyuho Lee <gyuho@lepton.ai>
2024-01-08 20:54:05 +08:00
Andrea Tosatto
ccda2d6fd4 kube-controller-manager: Decouple TaintManager from NodeLifeCycleController (KEP-3902) 2023-10-30 12:23:56 +00:00
jackcui
9d8959224c add explanation for large-cluster-size-threshold arg about multiple zones cluster 2023-07-21 17:25:51 +08:00
Ziqi Zhao
dfc1838379 Migrated pkg/controller/volume|util|replicaset|nodeipam to contextual logging
Signed-off-by: Ziqi Zhao <zhaoziqi9146@gmail.com>
2023-07-06 07:39:52 +08:00
Andrea Tosatto
d09842e0ad
node-lifecycle-controller: improve monitorNodeHealth test-coverage (#116687)
* node-lifecycle-controller: refactor monitorNodeHealth tests to improve test-coverage

* address PR review comments

* dedupe test logic
2023-04-12 07:02:33 -07:00
Kubernetes Prow Robot
27e23bad7d
Merge pull request #116529 from pohly/controllers-with-name
kube-controller-manager: convert to structured logging
2023-03-14 14:12:55 -07:00
Ziqi Zhao
d1aa73312c
pkg/controller/util support contextual logging (#115049)
Signed-off-by: Ziqi Zhao <zhaoziqi9146@gmail.com>
2023-03-14 12:38:14 -07:00
Patrick Ohly
99151c39b7 kube-controller-manager: convert to structured logging
Most of the individual controllers were already converted earlier. Some log
calls were missed or added and then not updated during a rebase. Some of those
get updated here to fill those gaps.

Adding of the name to the logger used by each controller gets
consolidated in this commit. By using the name under which the
controller is registered we ensure that the names in the log
are consistent.
2023-03-14 19:16:32 +01:00
Damien Grisonnet
ac394c5c19 Cleanup deprecated metrics
Remove the following deprecated metrics:
- node_collector_evictions_number
- scheduler_e2e_scheduling_duration_seconds

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
2023-03-13 22:55:34 +01:00
Andrea Tosatto
cae19f9e85 Remove deprecated pod-eviction-timeout flag from controller-manager 2023-03-07 18:14:18 +00:00
kerthcet
e5c812bbe7 Remove CLI flag enable-taint-manager
Signed-off-by: kerthcet <kerthcet@gmail.com>
2023-03-07 18:11:49 +00:00
JunYang
780ef3afb0 use klog.InfoS instead of klog.V(0),Info 2023-03-07 15:50:01 +08:00
JunYang
29086e2b04 use klog instead of klog.V(0) 2023-01-14 21:15:50 +08:00
Christopher Broglie
3c88de52c8 controller/nodelifecycle: Make monitorNodeHealth process nodes concurrently
Marking the pods not ready on a node requires looping over them and
updating each pod's status one at a time. This is performed serially,
and can take a while if we're processing each node serially as well.

Since the time is spent waiting on io, there's an opportunity to go
faster by processing multiple nodes concurrently. This change modifies
the loop to process nodes in parallel, using the same number of workers
as doNodeProcessingPassWorker.

This change also introduces histogram metrics to better observe
monitorNodeHealth.
2023-01-11 12:34:39 -08:00
Han Kang
2bbd445f50 remove rate limiter metric as it is not in use
Change-Id: I91157653e3860eeecc3f572aee88da6ffc65faed
2022-10-13 13:07:11 -07:00
Davanum Srinivas
a9593d634c
Generate and format files
- Run hack/update-codegen.sh
- Run hack/update-generated-device-plugin.sh
- Run hack/update-generated-protobuf.sh
- Run hack/update-generated-runtime.sh
- Run hack/update-generated-swagger-docs.sh
- Run hack/update-openapi-spec.sh
- Run hack/update-gofmt.sh

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2022-07-26 13:14:05 -04:00
Kubernetes Prow Robot
a6afdf45dd
Merge pull request #110359 from MadhavJivrajani/remove-api-call-under-lock
controller/nodelifecycle: Refactor to not make API calls under lock
2022-07-25 06:20:34 -07:00
Madhav Jivrajani
3c0bc26d90 controller/nodelifecycle: Refactor to not make API calls under lock
The evictorLock only protects zonePodEvictor and zoneNoExecuteTainter.
processTaintBaseEviction showed indications of increased lock contention
among goroutines (see issue 110341 for more details).

The refactor done is to ensure that all codepaths in that function that
hold the evictorLock AND make API calls under the lock, are now making
API calls outside the lock and the lock is held only for accessing either
zonePodEvictor or zoneNoExecuteTainter or both.

Two other places where the refactor was done is the doEvictionPass and
doNoExecuteTaintingPass functions which make multiple API calls under
the evictorLock.

Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>
2022-07-25 15:16:26 +05:30
Michal Wozniak
4ec8cf08da This PR refactors taint_manager to eliminate the getPod and getNode stubs. 2022-07-14 18:00:44 +02:00
Wojciech Tyczyński
006ff4510b Clean shutdown of nodecontroller integration tests 2022-06-06 20:33:20 +02:00
wangyamei
3808e193d9 fix typo for nodelifecycle controller 2022-02-28 14:30:28 +08:00
Patrick Ohly
9eaa2dc554 avoid klog Info calls without verbosity
In the following code pattern, the log message will get logged with v=0 in JSON
output although conceptually it has a higher verbosity:

   if klog.V(5).Enabled() {
       klog.Info("hello world")
   }

Having the actual verbosity in the JSON output is relevant, for example for
filtering out only the important info messages. The solution is to use
klog.V(5).Info or something similar.

Whether the outer if is necessary at all depends on how complex the parameters
are. The return value of klog.V can be captured in a variable and be used
multiple times to avoid the overhead for that function call and to avoid
repeating the verbosity level.
2022-01-12 07:48:36 +01:00
cyclinder
b88b51c6e5 adding evictions_total metric and marking evictions_number deprecated
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2021-12-13 10:36:02 +08:00
Neha Lohia
fa1b6765d5
move pkg/util/node to component-helpers/node/util (#105347)
Signed-off-by: Neha Lohia <nehapithadiya444@gmail.com>
2021-11-12 07:52:27 -08:00
Mike Dame
4960d0976a Wire contexts to Core controllers 2021-11-01 10:29:00 -04:00
Author cyclinder
e61b901628 remove nodeLease feature GA
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
2021-10-05 12:23:27 +08:00
Mike Dame
af045087d9 Move GetZoneKey function to component-helpers 2021-03-04 10:32:38 -05:00
Michael Beaumont
a5a6762d33
Move pkg/kubelet/apis to k8s.io/kubelet/pkg/apis 2021-02-09 21:37:39 +01:00
Kubernetes Prow Robot
1209c59612
Merge pull request #96876 from howieyuen/no-execute-taint-missing
fix nodelifecyle controller not add NoExecute taint bug
2021-01-13 14:17:03 -08:00
pacoxu
441985afb6 set LegacyNodeRoleBehavior to false and mv ServiceNodeExclusion to GA
Signed-off-by: pacoxu <paco.xu@daocloud.io>
2021-01-05 22:34:18 +08:00
Hao Yuan
5569db4902 fix nodelifecyle controller not add NoExecute taint bug 2020-12-14 09:34:57 +08:00
Kobayashi Daisuke
4ae11dac2e Replace StartLogging(klog.Infof) with StartStructuredLogging(0) 2020-06-15 17:48:35 +09:00
Davanum Srinivas
442a69c3bd
switch over k/k to use klog v2
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2020-05-16 07:54:27 -04:00
wawa0210
54c0f8b677
Remove the 'beta' version of the node label (os and arch types) 2020-05-13 22:51:52 +08:00
Kubernetes Prow Robot
85ee5fdd90
Merge pull request #87743 from u2takey/master
log pod event when node not ready
2020-04-21 17:25:52 -07:00