Version: next

Prime feature only

This feature is only available with a Prime subscription. See plans or contact sales.

Configuring kubriX for High Availability

This document explains how to configure kubriX for high availability (HA).
It outlines how to enable HA and identifies components that are not designed for high availability.

High Availability vs. Restartability

High Availability (HA) ensures that your service continues to operate during common failure scenarios such as:

Node drains or crashes
Rolling updates
Availability Zone (AZ) outages

With only a single replica, there will always be downtime when that pod is unavailable - for example, during rescheduling, image pulling, initialization, or when the underlying node fails.

However, depending on your service level agreements (SLAs), a single replica might still be sufficient for some components, especially if:

The service is not required to be continuously available, but only when users actively access it.
The component performs background or asynchronous processing, where temporary downtime does not impact the user experience.

Taking these considerations into account, the following sections describe the HA configuration for a the different kubriX capabilities.

TopologySpreadContraints and PodDisruptionBudgets

If possible, in HA setup we add also TopologySpreadConstraints with topologyKey: kubernetes.io/hostname, so that pods are spread across nodes to prevent outages on node failures. Also, we add PodDisruptionBudgets so that Pods get started on other nodes before they get removed from a node.

If you need zone aware topologySpreadContraints please contact kubriX support.

Three or two replicas

There is an excellent blog article explaining why three replicas are better than two replicas: https://sookocheff.com/post/kubernetes/why-three-replicas-are-better-than-two/

Still, out-of-the-box we provide values with 2 replicas most of the time to keep the footprint small. See the referenced values files for details.

HA configuration settings

The following table shows you which components can be implemented high available. Either you extend the default.valueFiles list with values-ha-enabled-prime.yaml in the target-chart so every component gets installed in HA-mode or you extend the valueFiles list in each application in the applications list. ( applications.[].valueFiles)

Extend kubriX HA settings

If you want to extend or change the kubriX HA settings, don't change the provided values-ha-enabled-prime.yaml, but configure them in your overlayed customer specifiy values files (e.g. values-customer.yaml).

Observe

Application	Component	HA type	Comments	Docs Link
Grafana	Grafana	Active-Active	switch to external DB needed (done implicitly via ha-enabled.yaml)	HA alerting github docs HA alerting grafana docs Grafana HA docs
k8s-monitoring	alloy-operator	Active-Passive (Leader-Election)	-	Helm Docs
k8s-monitoring	alloy-metrics	Active-Active	-	Grafana Docs
k8s-monitoring	kube-state-metrics	Active-Active	as long as you set `discoveryType: service` (implicitly set via ha-enabled)	Helm Docs
k8s-monitoring	alloy-singleton	not supported	-	because otherwise ClusterEvents would get retrieved multiple times.
Loki	every component	Active-Active	`loki.eplication_factor` needs to be the same number as write replicas	Grafana Docs
Mimir	every component with exceptions (see comments)	Active-Active	override-exporter, rollout-operator and alertmanager don't support scale out; compactor supports, but not scaled per default
Tempo	not scaled yet	-	-	-

Deliver

Application	Component	HA type	Comments	Docs Link
ArgoCD	redis	Active-Active	switched to redis-ha
ArgoCD	server	Active-Active
ArgoCD	repo-server	Active-Active
ArgoCD	dex-server	not supported		not needed when keycloak OIDC is used
ArgoCD	application-controller	not supported		no HA possible, but sharding per cluster for scalability reasons can be implemented if required
ArgoCD	notifications-controller	not supported
ArgoCD	applicationset-controller	not needed		no real benefit, because applicationset creation is asynchronous anyways
Argo-Rollouts	every component	Active-Active	already upstream default
Kargo	every component with exceptions (see comments)	Active-Active	`controller` and `manager-controller` are singletons. Switiching to Distributed configurations to achieve overall scalability
Crossplane	every component (incl. providers)	Active-Passive (Leader-Election)	currently no `PodDisruptionBudget` implemented. Warning for other crossplane providers: If leader election is not implemented in the provider or not enabled, then der pod will consume 100% CPU!	Crossplane docs
KubeVirt	virt-operator	Active-Passive
KubeVirt	KubeVirt	Active-Active
KubeVirt	KubeVirt-Manager	not supported	original KubeVirt Deployment spec
KubeVirt	CDI	not needed		Technically it is possible to scale the CDI resources via the CDI CustomResourse properties `uploadProxyReplicas`, `apiServerReplicas` and `deploymentReplicas` . However, currently we do not see the benefit to have multiple replicas.The `cdi-operator` itself get shipped with one replica out-of-the-box from the original project.

Secure

Application	Component	HA type	Comments	Docs Link
Keycloak	keycloak-operator	not supported		hard coded one replica in provides kubernetes.yaml
Keycloak	kubrix-keycloak	Active-Active
Vault	every component	Active-Active
External-Secrets	every component	Active-Passive(Leader-Election)	if you really need to have active-active setup - it is still possible, but things are going to get complicated really fast
Kyverno	every component	Active-Active, some Active-Passive	All components of Kyverno can be scaled out. However, there are some like `reports-controller` and `background-controller` which are stateful and implement a leader-election (as in external-secrets or crossplane), and sometimes just some functionality in `admission-controller` or `cleanup-controller` like certificate and webhook management uses a leader-election and some not.	Kyverno Docs
Velero		not supported
cert-manager	controller and cainjector	Active-Passive (Leader-Election)		Certmanager HA Docs
cert-manager	webhook	Active-Active		Certmanager HA Docs

Enable

Application	Component	HA type	Comments	Docs Link
Backstage	every component	Active-Active	switch to external DB needed (done implicitly via ha-enabled.yaml)

General

Application	Component	HA type	Docs Link
Ingress-Nginx	every component	Active-Active
External-DNS	-	not supported	currently some concerns against and warnings running multiple replicas of external-dns and so it is hard coded to 1 replica
CNPG	every component	Active-Passive (Leader-Election)

High Availability vs. Restartability​

TopologySpreadContraints and PodDisruptionBudgets​

Three or two replicas​

HA configuration settings​

Extend kubriX HA settings​

Observe​

Deliver​

Secure​

Enable​

General​