Skip to main content
Version: next
Prime feature only
This feature is only available with a Prime subscription. See plans or contact sales.

Configuring kubriX for High Availability

This document explains how to configure kubriX for high availability (HA).
It outlines how to enable HA and identifies components that are not designed for high availability.

High Availability vs. Restartability

High Availability (HA) ensures that your service continues to operate during common failure scenarios such as:

  • Node drains or crashes
  • Rolling updates
  • Availability Zone (AZ) outages

With only a single replica, there will always be downtime when that pod is unavailable - for example, during rescheduling, image pulling, initialization, or when the underlying node fails.

However, depending on your service level agreements (SLAs), a single replica might still be sufficient for some components, especially if:

  • The service is not required to be continuously available, but only when users actively access it.
  • The component performs background or asynchronous processing, where temporary downtime does not impact the user experience.

Taking these considerations into account, the following sections describe the HA configuration for a the different kubriX capabilities.

TopologySpreadContraints and PodDisruptionBudgets

If possible, in HA setup we add also TopologySpreadConstraints with topologyKey: kubernetes.io/hostname, so that pods are spread across nodes to prevent outages on node failures. Also, we add PodDisruptionBudgets so that Pods get started on other nodes before they get removed from a node.

If you need zone aware topologySpreadContraints please contact kubriX support.

Three or two replicas

There is an excellent blog article explaining why three replicas are better than two replicas: https://sookocheff.com/post/kubernetes/why-three-replicas-are-better-than-two/

Still, out-of-the-box we provide values with 2 replicas most of the time to keep the footprint small. See the referenced values files for details.

HA configuration settings

The following table shows you which components can be implemented high available. Either you extend the default.valueFiles list with values-ha-enabled-prime.yaml in the target-chart so every component gets installed in HA-mode or you extend the valueFiles list in each application in the applications list. ( applications.[].valueFiles)

Extend kubriX HA settings

If you want to extend or change the kubriX HA settings, don't change the provided values-ha-enabled-prime.yaml, but configure them in your overlayed customer specifiy values files (e.g. values-customer.yaml).

Observe

ApplicationComponentHA typeCommentsDocs Link
GrafanaGrafanaActive-Activeswitch to external DB needed (done implicitly via ha-enabled.yaml)HA alerting github docs
HA alerting grafana docs
Grafana HA docs
k8s-monitoringalloy-operatorActive-Passive (Leader-Election)-Helm Docs
k8s-monitoringalloy-metricsActive-Active-Grafana Docs
k8s-monitoringkube-state-metricsActive-Activeas long as you set discoveryType: service (implicitly set via ha-enabled)Helm Docs
k8s-monitoringalloy-singletonnot supported-because otherwise ClusterEvents would get retrieved multiple times.
Lokievery componentActive-Activeloki.eplication_factor needs to be the same number as write replicasGrafana Docs
Mimirevery component with exceptions (see comments)Active-Activeoverride-exporter, rollout-operator and alertmanager don't support scale out; compactor supports, but not scaled per default
Temponot scaled yet---

Deliver

ApplicationComponentHA typeCommentsDocs Link
ArgoCDredisActive-Activeswitched to redis-ha
ArgoCDserverActive-Active
ArgoCDrepo-serverActive-Active
ArgoCDdex-servernot supportednot needed when keycloak OIDC is used
ArgoCDapplication-controllernot supportedno HA possible, but sharding per cluster for scalability reasons can be implemented if required
ArgoCDnotifications-controllernot supported
ArgoCDapplicationset-controllernot neededno real benefit, because applicationset creation is asynchronous anyways
Argo-Rolloutsevery componentActive-Activealready upstream default
Kargoevery component with exceptions (see comments)Active-Activecontroller and manager-controller are singletons. Switiching to Distributed configurations to achieve overall scalability
Crossplaneevery component (incl. providers)Active-Passive (Leader-Election)currently no PodDisruptionBudget implemented.
Warning for other crossplane providers: If leader election is not implemented in the provider or not enabled, then der pod will consume 100% CPU!
Crossplane docs
KubeVirtvirt-operatorActive-Passive
KubeVirtKubeVirtActive-Active
KubeVirtKubeVirt-Managernot supportedoriginal KubeVirt Deployment spec
KubeVirtCDInot neededTechnically it is possible to scale the CDI resources via the CDI CustomResourse properties uploadProxyReplicas, apiServerReplicas and deploymentReplicas . However, currently we do not see the benefit to have multiple replicas.The cdi-operator itself get shipped with one replica out-of-the-box from the original project.

Secure

ApplicationComponentHA typeCommentsDocs Link
Keycloakkeycloak-operatornot supportedhard coded one replica in provides kubernetes.yaml
Keycloakkubrix-keycloakActive-Active
Vaultevery componentActive-Active
External-Secretsevery componentActive-Passive(Leader-Election)if you really need to have active-active setup - it is still possible, but things are going to get complicated really fast
Kyvernoevery componentActive-Active, some Active-PassiveAll components of Kyverno can be scaled out. However, there are some like reports-controller and background-controller which are stateful and implement a leader-election (as in external-secrets or crossplane), and sometimes just some functionality in admission-controller or cleanup-controller like certificate and webhook management uses a leader-election and some not.Kyverno Docs
Veleronot supported
cert-managercontroller and cainjectorActive-Passive (Leader-Election)Certmanager HA Docs
cert-managerwebhookActive-ActiveCertmanager HA Docs

Enable

ApplicationComponentHA typeCommentsDocs Link
Backstageevery componentActive-Activeswitch to external DB needed (done implicitly via ha-enabled.yaml)

General

ApplicationComponentHA typeCommentsDocs Link
Ingress-Nginxevery componentActive-Active
External-DNS-not supportedcurrently some concerns against and warnings running multiple replicas of external-dns and so it is hard coded to 1 replica
CNPGevery componentActive-Passive (Leader-Election)