Skip to main content

When 2 CPUs of “Nothing” Turned Into a Deep Mimir Lesson

· 5 min read
Johannes Kleinlercher
kubriX Dev, platform engineer, systems architect

Debugging Store-Gateway CPU Spikes, GC Thrashing, and a Hidden Memory Limit


The symptom

Our Grafana Mimir store-gateway pods suddenly jumped from ~0.2 CPU to nearly 2 full cores each.

No traffic spike. No deployment. No restarts. No errors.

Just CPU.

This is exactly the kind of issue that sends engineers down rabbit holes — because nothing obvious is wrong.

image


Initial assumptions (all wrong)

When CPU spikes without traffic, the usual suspects are:

  • query surge
  • compactor backlog
  • missing sparse index headers
  • object storage latency
  • throttling or node pressure

All plausible.

None correct.


The turning point: profiling instead of guessing

Instead of chasing hypotheses, we captured a CPU profile directly from a running store-gateway.

Mimir exposes Go’s built-in profiling endpoint, so you can sample real CPU usage without restarting anything.

We ran:

go tool pprof -top store-gateway.cpu.pprof

The result:

runtime.gcBgMarkWorker → ~95%

That means:

The CPU wasn’t busy doing useful work. It was almost entirely doing garbage collection.

At that moment the problem category changed completely.

This was not a load issue. This was a memory behavior issue.


Confirming with a heap profile

Next step: inspect memory.

Heap profile result:

  • ~662 MB live heap
  • ~83% used by index cache structures

This told us two important things:

  1. Memory usage was expected, not a leak.
  2. The cache was working normally.

So why was GC running constantly if memory usage was healthy?


The hidden culprit: GOMEMLIMIT

The answer wasn’t in Mimir code. It was in configuration.

The Helm chart automatically sets:

GOMEMLIMIT = memory_request

Our store-gateway configuration:

resources:
requests:
memory: 512Mi

So Go’s runtime believed:

“I must keep heap usage under 512 MiB.”

But the real working set needed ~660 MB.

That creates a classic GC thrash loop:

heap grows → exceeds limit → GC runs aggressively → CPU spikes → repeat

Nothing was broken. The runtime was behaving exactly as instructed.


Why Kubernetes made this subtle

We hadn’t set memory limits — only requests.

So Kubernetes would happily allow the container to use more than 512Mi.

But Go didn’t know that.

To Go, GOMEMLIMIT is the limit, regardless of Kubernetes policy.

This created a hidden mismatch:

LayerBelieved limit
Go runtime512Mi
Kubernetesunlimited

This kind of cross-layer interaction is where many real production problems live.


The fix

Increase memory request.

We changed:

memory: 512Mi

to:

memory: 2Gi

That automatically raised:

GOMEMLIMIT ≈ 2Gi

Result:

  • GC frequency dropped
  • CPU dropped immediately
  • system stabilized

No code changes. No scaling. No tuning.

Just correct sizing.


Why this happens specifically in store-gateway

Store-gateway is intentionally memory heavy.

It caches:

  • index entries
  • postings lists
  • series metadata

These caches reduce latency and object-store reads.

So high memory usage is expected and desirable.

Trying to force it into a tiny memory footprint simply shifts cost to CPU (via GC).


How to capture a CPU profile from Mimir store-gateway

This is safe to do in production.

1) Port-forward to a pod

kubectl -n mimir port-forward pod/mimir-store-gateway 8080:8080

(your pod names may slightly be different)

2) Download a profile

curl -o cpu.pprof \
http://localhost:8080/debug/pprof/profile?seconds=30

3) Analyze locally

go tool pprof -top cpu.pprof

Most useful commands inside pprof:

CommandPurpose
tophottest functions
top -cumcumulative cost
list funcinspect code path

Flame graph view:

go tool pprof -http=:0 cpu.pprof

How to capture heap profile

curl -o heap.pprof \
http://localhost:8080/debug/pprof/heap

Analyze:

go tool pprof -top -inuse_space heap.pprof

Useful modes:

ModeMeaning
inuse_spacelive memory
alloc_spaceallocation churn
alloc_objectsallocation rate

Reading profiles correctly

Common CPU profile signatures:

PatternInterpretation
runtime.gc* dominatesGC thrashing
syscall dominatesIO bound
crypto/tls dominatesTLS overhead
app code dominatesreal workload

Profiles remove guesswork.


Key lessons

1. CPU problems are often memory problems

If GC dominates CPU, look at heap sizing first.


2. Requests matter more than limits for Go apps

When GOMEMLIMIT is tied to requests, the request effectively becomes the runtime memory ceiling.


3. High memory usage isn’t bad

Caches are supposed to use memory. Starving them just moves cost elsewhere.


4. Profiling > dashboards

Metrics tell you that something is wrong. Profiles tell you what is wrong.


5. Most production mysteries aren’t bugs

They’re interactions between layers:

  • runtime behavior
  • container scheduling
  • Helm defaults
  • caching logic

Understanding those interactions is what distinguishes platform engineers from operators.


Final takeaway

Nothing was broken.

The system behaved exactly as configured.

We just didn’t realize how those configurations interacted.

That’s the real lesson:

Production performance issues are often not failures — they’re misunderstandings.

And the fastest way to resolve them is:

Profile first. Tune second.

kubriX 6.0.0 — Our Christmas Present to Platform Engineers

· 2 min read
Johannes Kleinlercher
kubriX Dev, platform engineer, systems architect

Christmas is the time for good food, time with family, and — if you’re a platform engineer — finally having a moment to breathe while everything just works.

This year, we’re wrapping up something special under the kubriX tree: kubriX 6.0.0, a release focused on high availability, resilience, and platform maturity.

No shiny toy features for one demo.
This is the kind of present you still appreciate in February — when clusters upgrade, nodes disappear, and your platform keeps running.

🎁 What’s in the Box?

🚀 High Availability — Built In, Not Bolted On

kubriX 6.0.0 makes high availability a first-class citizen across the platform.

  • PodDisruptionBudgets to survive node drains and upgrades

  • topologySpreadConstraints to distribute workloads across failure domains

  • Dedicated HA values and topology-aware defaults

🔐 GitOps That Behaves While You’re on Holiday

Argo CD got some serious love in this release:

  • Refactored dashboards for better operational clarity

  • Safer rolling updates in HA mode

  • Clearer permission models for platform teams and application teams

  • Admin-only terminal access for controlled troubleshooting

This means GitOps workflows that stay predictable and boring — which is exactly what you want when you’re not at your desk.

🧠 Cleaner Configuration, Less Mental Overhead

One of the biggest internal gifts in 6.0.0 is the new multi-layer values structure:

  • Clear separation of defaults, environment values, and overrides

  • kubrix-default values to stay DRY and reduce unintended diffs

  • Bootstrap and installer aligned to the same structure

This makes kubriX much easier to operate across multiple cloud providers, clusters, stages, and teams — even after a long year.

🎅 Why kubriX 6.0.0 Matters

kubriX 6.0.0 is not about flashy features — it’s about sleeping better:

  • High availability by default

  • Predictable behavior during upgrades

  • Cleaner configuration at scale

  • Safer GitOps workflows

  • A platform that doesn’t need babysitting

With kubriX 6.0.0, setting up and running an internal developer platform is simpler, more secure, and more scalable than ever.

🎄 Unwrap kubriX 6.0.0

Already a kubriX Prime customer? kubriX 6.0.0 is available automatically via your Git update channel — no action needed.

New to kubriX? Let’s talk about how to build a resilient internal developer platform that doesn’t ruin your holidays.

kubriX 6.0.0 — our Christmas present to platform engineers everywhere. 🎁🚀

Introducing KubriX 5.0 - Scalable, Flexible and Team-Centric Platform Engineering

· 4 min read
Johannes Kleinlercher
kubriX Dev, platform engineer, systems architect

We’re excited to announce kubriX 5.0.0 — a release focused on simplicity, resilience, and better day-2 operations. From a brand-new installer to streamlined observability and smarter pipelines, kubriX 5.0.0 is built to give platform engineers and developers a stronger foundation with less friction.

Here’s what’s new.

What’s New in kubriX 5.0?

🚀 A Brand-New Installer

Installing kubriX just got much easier. Instead of running local scripts on your workstation, you now simply run one kubectl apply command, and the kubriX-installer takes care of the rest — directly inside your Kubernetes cluster.

  • No more fragile workstation dependencies.

  • More stable, reproducible installations.

  • Faster setup for demos, PoCs, or production clusters.

⚡ Smarter Bootstrapping

Bootstrapping kubriX (or kubriX-prime) in your GitOps repo is now part of the installer:

  • Set KUBRIX_BOOTSTRAP=true and provide your DNS provider, domain, and Git repo — and you’re ready in minutes.

  • Out-of-the-box support for AWS, Cloudflare, STACKIT, and IONOS (plus any provider supported by external-dns).

  • Works seamlessly on local Kind clusters for quickstarts or on your real production cluster.

🔒 Stronger Defaults & Security

  • All platform services now dynamically generate admin usernames and passwords — even if you don’t customize them, no one can guess the defaults.

  • Secrets handling and password rotation are documented with clear guides.

  • Default Velero backup schedules for critical Kubernetes resources are included, so you always have a safety net.

📊 Better Observability & Alerting

  • False-positive/negative alerts reduced, and alerting secrets are no longer mandatory for teams that don’t need them.

  • Matrix chat integration added as an Alertmanager receiver — a decentralized, open-source alternative to Slack.

  • Mimir cardinality dashboard integrated, so you can track metric series growth and find bottlenecks.

  • Loki topology switch: from SingleBinary to SimpleScalable, making logs more performant for both small and very large clusters.

🔧 Dependency & Service Improvements

  • Switched from the Bitnami Keycloak chart to the official Keycloak Operator - future-proof and open-source-friendly.

  • Smarter dependency detection shows which helm chart dependencies and container images would really be used.

  • Key platform updates:

    • Kargo 1.7 (with a new openPr flag for approval workflows in pipelines)

    • k8s-monitoring v3 (many improvements, new features, stability)

    • external-secrets v0.17

    • Argo CD v3.1.6

  • Nearly every other platform service has been refreshed for stability and security.

🧪 Testing & Reliability

  • Early integration of Testkube for end-to-end platform testing. We already use it internally, and plan to expand it so you can validate custom platform behaviors in every release.

  • Installer hardened with countless under-the-hood improvements to make it battle-ready for real-world environments.

📚 Documentation Upgrades

  • Full restructure of the documentation, with more details on GRC (Governance, Risk, and Compliance).

  • Expanded guides for secrets management, password rotation, and cluster operations.

  • A stronger knowledge base for platform engineers and developers, updated continuously.

Why kubriX 5.0.0 Matters

  • Faster, safer installations — the installer is cluster-native and resilient.

  • Stronger defaults — no weak passwords, built-in backups, and improved alerting.

  • More observability — dashboards, alerts, and logging at scale are just there.

  • Future-proof upgrades — by aligning with official operators and the latest upstreams.

  • Better knowledge base — documentation that empowers platform engineers, not slows them down.

With kubriX 5.0.0, setting up and running an internal developer platform is simpler, more secure, and more scalable than ever.

Get Started with KubriX 5.0

  • Already a KubriX Prime customer? You’re getting KubriX 5.0 automatically via your Git update channel — no action needed.

  • New to KubriX? Schedule a demo to see how we can accelerate your platform engineering journey.

  • Like what we’re building? ⭐ us on GitHub!

KubriX 5.0 is here — let’s build the next generation of internal platforms together.

Introducing KubriX 4.0 - Scalable, Flexible and Team-Centric Platform Engineering

· 3 min read
Philipp Achmueller
kubriX Dev, platform enthusiast

We’re thrilled to announce the release of KubriX 4.0 — our most flexible and team‑centric version yet!

This upgrade delivers major component refreshes, native vcluster integration, and fine‑grained controls that give platform and application teams more autonomy without sacrificing security.

What’s New in KubriX 4.0?

Next-Gen Core Components

KubriX 4.0 brings the heart of the platform to the latest majors:

  • Argo CD 3.0 brings the new UI, faster diff engine, and improved sharding - we implement tighter RBAC too
  • Grafana 12 a major visual refresh plus query caching for lightning-fast dashboards.
  • Kyverno 1.14 policy exceptions & generate-controls for air-tight supply-chain guardrails..
  • Backstage 1.38.1 faster catalog sync, tighter permissions and dynamic scaffolder secrets into vault

Keeping these giants current means less manual patching and an instant security win.

vCluster Integration & Team Self-Service (Prime)

Need ephemeral clusters for tests, proofs-of-concept, or customer demos? With the new vcluster template you can spin up fully-isolated, cost-efficient virtual clusters inside any host in minutes—complete with KubriX guardrails out-of-the-box. Team Members get admin rights inside shared vcluster while platform engineers keep global policy control.

Smarter Hub & Spoke Onboarding (Prime)

Large organisations rarely have a single prod cluster. The new destinationClusters list inside the onboarding workflow lets you declare which team may deploy to which physical or virtual cluster. No more mis-deployments or ticket ping-pong governance and autonomy in a single YAML stanza.

Quality-of-Life Enhancements

  • ignoreDifferences everywhere – fewer false “Out-of-Sync”s after Argo CD 3.0.
  • Auto-bootstrap of KubriX core into fresh customer repos.
  • Namespace label/annotation presets in the onboarding template for better policy targeting.

Granular Permissions Separation

Building on last versions RBAC overhaul, kubriX v4.0 provides sub-team-level scopes across Argo CD, Vault, Backstage, Kargo and Grafana. You can now:

  • Restrict dashboard editing while still allowing query exploration.
  • Delegate environment-specific Argo CD sync privileges to release engineers.
  • Separate catalog write access from Backstage entity ownership.

Breaking changes you must review

  • Argo CD 2.14 → 3.0: check for removed RBAC verbs and new diff options.
  • Grafana 11 → 12: legacy dashboard JSON v1 IDs are no longer accepted.
  • External-Secrets v0.16+: v1alpha1 resources are now unsupported—migrate or prune, see github.com. (For next Release there will be another change requirement for external-secrets, we will inform with next release accordingly).

Upgrade guides for each component are linked in the release notes — read them before hitting helm upgrade.

Why This Release Matters

  • Stay Ahead of Upstream – Ship on the latest Argo CD, Grafana, Kyverno, Kubevirt & Backstage without spending weeks on migration/testing.
  • Accelerate Team Autonomy – vcluster and destinationClusters unlock safe self‑service while keeping guard‑rails intact.
  • Security by Default – Updated dependencies, tighter policies, and CVE tracking reduce risk across the board.
  • Future‑Proof – 4.0 lays the groundwork for upcoming multi‑cluster rollout orchestration and delivery enhancements

Get Started with KubriX 4.0

  • Already a KubriX Prime customer? You’re getting KubriX 4.0 automatically via your Git update channel — no action needed.

  • New to KubriX? Schedule a demo to see how we can accelerate your platform engineering journey.

  • Like what we’re building? ⭐ us on GitHub!

KubriX 4.0 — Your internal developer platform for faster, smarter, and more secure application delivery.

Introducing KubriX 3.0 - Smarter, Safer, and More Secure Platform Engineering

· 3 min read
Johannes Kleinlercher
kubriX Dev, platform engineer, systems architect

We’re thrilled to announce the release of KubriX 3.0 — our most powerful and enterprise-ready version yet!

This release brings granular RBAC with OIDC integration, automated alerting for Kubernetes issues, and a host of internal security upgrades to streamline platform operations at scale.

What’s New in KubriX 3.0?

Enterprise-Grade Team Isolation with OIDC & RBAC

With KubriX 3.0, we’ve introduced centralized identity and access management across all major platform services. Teams and team members are now onboarded with roles like admin, editor, and viewer, giving them access only to what they need — nothing more.

Team-scoped access now works seamlessly across:

  • Backstage
  • ArgoCD
  • Kargo
  • Grafana
  • Vault
  • MinIO

This helps enforce least-privilege access and keeps environments clean, focused, and secure — whether you're part of a delivery team or the platform team.

Automatic Alerting: Be Informed When It Matters

Stop staring at dashboards. KubriX 3.0 brings integrated Grafana Managed Alerts so teams get notified automatically when common Kubernetes issues occur — from misconfigured workloads to resource bottlenecks.

Each team can customize how and where they want to receive alerts — email, Slack, or any alerting backend. Stay ahead of issues before they impact users.

Improved Supply Chain Security

We’ve also tightened security within the KubriX platform itself:

Our internal CI/CD pipeline now tracks CVE changes for every platform service update — helping prioritize critical patches faster.

Secrets management is smarter: more services now pull secrets securely from Vault — either user-defined or auto-generated and injected via push-secrets.

Always Up-to-Date

KubriX 3.0 ships with the latest stable versions of all core platform services, including:

ArgoCD, Backstage, CloudNative-PG, External-Secrets, Falco-Exporter, Grafana, Ingress-nginx, K8s-Monitoring, Keycloak, Velero, Cost-Analyzer, Crossplane, Loki, PGAdmin4, PostgreSQL, Tempo, Trivy-Operator, KubeVirt, KubeVirt-Manager — and more.

Keeping platform services current is hard. With KubriX, it’s effortless.

Why This Release Matters

  • Stronger Access Controls: Least-privilege principles are enforced by design — boosting security and usability for every team.

  • Proactive Operations: Built-in alerting means fewer surprises and faster recovery times.

  • Secure by Default: From CVE tracking to Vault-based secrets, KubriX 3.0 strengthens your software supply chain.

  • Future-Proof: You’re always running the latest and most secure platform stack — without the manual overhead.

Get Started with KubriX 3.0

  • Already a KubriX Prime customer? You’re getting KubriX 3.0 automatically via your Git update channel — no action needed.

  • New to KubriX? Schedule a demo to see how we can accelerate your platform engineering journey.

  • Like what we’re building? ⭐ us on GitHub!

KubriX 3.0 — Your internal developer platform for faster, smarter, and more secure application delivery.

Introducing KubriX 2.1 – Smarter Automation, Stronger Security, Seamless Scaling!

· 3 min read
Johannes Kleinlercher
kubriX Dev, platform engineer, systems architect

Just one month after our major KubriX 2.0 release, we’re back with another power-packed upgrade: KubriX IDP-Distribution 2.1 is here!

This release brings enhanced automation, improved platform stability, stronger team isolation, and security features that help your application teams move faster — with confidence.

What’s New in KubriX 2.1?

Automation, Automation, Automation

We believe in empowering teams to focus on building, not configuring. That’s why we’ve taken automation to the next level:

  • ArgoCD repo credentials are now created automatically for your team repos.
  • Spoke cluster registration in Vault is fully automated, along with SecretStore creation in each team’s namespace. Teams just need to define ExternalSecret resources — no more manual Vault configuration!

Rock-Solid Stability

We’ve tightened the bolts to ensure your GitOps flows are more robust and predictable:

  • Crossplane health checks are now fully integrated into ArgoCD’s status evaluations.
  • ArgoCD application health checks have been extended to verify complete sync status — especially useful when using sync-waves.

Stronger Team Isolation

Secure, scalable, and clean boundaries between teams are key to platform success. With 2.1, we’re one step closer to full multi-tenancy:

  • Each team now gets dedicated AppSet access tokens, eliminating the need for organization-wide tokens.
  • Vault roles and policies are team-specific, ensuring secrets stay where they belong.
  • Kargo Git credentials are scoped per team, isolating promotion pipelines to their respective repositories.

Sneak peek: KubriX 3.0 will bring even more powerful team isolation features!

Built-In Security

Security shouldn’t be optional—it should be default. KubriX 2.1 introduces:

  • A restructured Kyverno policy architecture
  • The ability to auto-generate deny-all network policies to enforce micro-segmentation

Stay tuned — more default policies are coming in future releases to lock down your platform effortlessly.

Updates Galore

We’ve refreshed the entire KubriX stack with the latest upstream Helm charts, so you’re always running the latest and greatest:

  • falco, grafana, loki, trivy-operator, kargo
  • argo-cd, cert-manager, external-dns, external-secrets
  • k8s-monitoring, cost-analyzer, and more

Why This Release Matters

  • Instant secrets access: Teams can immediately use Vault secrets from spoke clusters—no manual config needed.

  • Improved GitOps reliability: ArgoCD now waits for real readiness before marking apps as healthy.

  • Secure by default: Automated deny-all network policies and scoped permissions reduce blast radius and human error.

  • Frictionless onboarding: New teams and clusters can be onboarded and deployed without platform team intervention.

Getting Started with KubriX 2.1

  • Already a KubriX Prime customer? You’ll receive KubriX 2.1 automatically via your Git update channel — upgrade today!

  • Curious about KubriX? Reach out to us to schedule a demo.

  • Love what we’re building? Show your support with a ⭐ on our GitHub repo!

Experience faster, smarter, and more secure application delivery with KubriX 2.1 — your cloud-native developer platform, reimagined.

Announcing KubriX 2.0 – A Major Leap Forward!

· 2 min read
Johannes Kleinlercher
kubriX Dev, platform engineer, systems architect

We are thrilled to announce the release of KubriX IDP-Distribution 2.0! Following the successful launch of version 1.0 in January, our February release takes the platform to the next level with game-changing features, enhanced automation, and seamless enterprise integration.

What’s New in KubriX 2.0?

Cutting-Edge Platform Updates

KubriX 2.0 brings dozens of updates to the latest and greatest versions of our underlying platform services, including ArgoCD, Crossplane, Grafana, Mimir, Tempo, Vault, Velero, and more. Expect improved stability, performance, and security with this release.

Seamless Hub & Spoke Support for Developers

We’ve made Hub & Spoke topologies first-class citizens in KubriX, simplifying application deployment and team collaboration across different environments.

  • Out-of-the-box support for Hub & Spoke setups across team onboarding, app onboarding, and app delivery workflows.

  • Cluster label-based targeting, allowing you to select target clusters effortlessly.

  • Automatic propagation of cluster-specific information (like ingress domains) to apps, removing the need for developers to handle complex configurations.

ArgoCD SSO with Keycloak – Now Built-In

Identity management just got easier! ArgoCD SSO with Keycloak is now integrated out of the box, ensuring a seamless authentication experience for your teams.

Why Does This Matter?

For enterprises managing multiple clusters and applications, a Hub & Spoke architecture is the gold standard. In KubriX 2.0:

  • The central hub hosts core services like KubriX Delivery, KubriX Observability, and KubriX Portal.

  • The spokes run customer applications and KubriX spoke agents, providing clear separation of concerns and scalable operations.

Without KubriX, deploying applications across multiple environments can mean complex and repetitive configurations. Developers often need to manage tedious details like cluster names, API URLs, and ingress domains manually.

With KubriX 2.0, developers only define app deployment stages (e.g., test → nonprod, QA → nonprod, prod → prod) in their GitOps repo — everything else happens automatically. This removes unnecessary complexity, boosting developer productivity and streamlining delivery pipelines.

And of course, KubriX Observability and Security detect new applications automatically, providing instant insights via Grafana dashboards.

How to Get Started

  • Existing KubriX-Prime customers will receive KubriX 2.0 automatically through their Git update channel and can apply the upgrade today.

  • Interested in KubriX? Contact us to learn more and leave us a ⭐ on our GitHub repo!

Experience faster, smarter, and more efficient application delivery with KubriX 2.0 — your cloud-native developer platform, redefined!

kubriX latest oss update issues

· 3 min read
Johannes Kleinlercher
kubriX Dev, platform engineer, systems architect

At kubriX, we hold a strong belief in the transformative power of Open Source software. Our commitment goes beyond just using these tools — we actively contribute to the Open Source projects that drive our platform forward.

However, as anyone familiar with Open Source knows, new releases sometimes come with unexpected bugs. This is a natural part of the development process, and it’s up to the community to address and improve these issues. As the saying goes, "Open Source software is free as in freedom, not free as in free beer."

kubriX is designed to help platform teams manage these risks and reduce the effort required to update OSS platform services. Our goal is to make platform updates smooth and hassle-free for our customers.

Here are two recent examples of how kubriX has proactively supported our customers during Open Source updates:

ArgoCD v2.13.0 Bug Mitigation

The release of ArgoCD v2.13.0 introduced a bug where Pod Disruption Budgets (PDBs) led to degraded applications. This issue, documented in this GitHub issue, had the potential to disrupt application health. At kubriX, we took a proactive approach, ensuring our customers’ environments were not affected. Instead of updating directly to v2.13.0, we held back until the release of ArgoCD v2.13.1, which included a fix for the PDB issue. By doing so, we saved our customers from experiencing this problem and the time-consuming process of troubleshooting it.

Grafana Tempo v2.6.1 Breaking Change

Another instance occurred with the release of Grafana Tempo v2.6.1. This update introduced a breaking change in the configuration for tempo queries. While the Grafana Tempo Helm Chart v1.14.0 updated to the new tempo binary (v2.6.1), it still assumed compatibility with the previous configuration. As a result, deployments failed to start correctly, causing a complete service disruption for affected users.

kubriX identified this issue early and took swift action. We decided to withhold updates to the new Helm Chart version until a proper fix was available. In the meantime, we have been actively supporting the maintainers of Grafana Tempo to resolve the issue.

How kubriX Supports Your Platform Stack

These examples illustrate why kubriX is a vital partner in maintaining and updating your platform stack. We go beyond providing software updates — we provide proactive support, risk mitigation, and community contributions that ensure your platform runs smoothly. With kubriX, your platform teams can focus on innovation instead of firefighting unexpected issues from OSS updates.

Accelerating Internal Developer Platforms with kubriX

· 3 min read
Johannes Kleinlercher
kubriX Dev, platform engineer, systems architect

kubriX is a curated, opinionated, and highly flexible distribution for Internal Developer Platforms (IDPs). But why do platform teams need an IDP distribution in the first place?

If you recognize that platform engineering and modern platforms—often referred to as Internal Developer Platforms—can help you deliver exceptional digital products faster, more securely, reliably, and at a lower cost, the next question is: how do you get started?

Building an Internal Developer Platform (IDP) requires many building blocks. Nowadays, the foundational operational platform is typically Kubernetes or OpenShift, as it provides an excellent abstraction layer for the underlying infrastructure, allows for full automation through APIs, and is highly extensible. However, this foundation alone isn’t sufficient to fully leverage the benefits of a platform.

To address this, an entire ecosystem of projects and products has emerged under the Cloud Native Computing Foundation (CNCF), each catering to various aspects like security, observability, cost management, application delivery, and more.

As a platform or infrastructure team within an organization, you're often faced with a mountain of new—and sometimes quite complex—tasks, along with a completely new way of managing infrastructure and applications. You need to evaluate tools, understand how to install and maintain them, integrate them together, and configure everything so that it functions as intended.

This process can take years before your organization has a fully operational platform that is embraced by developers and engineers and runs smoothly.

With kubriX, we can shorten this timeline from years to weeks or just a few months. We pre-select the tools, integrate them, and configure them to work seamlessly together, creating a comprehensive solution. Our extensive expertise in software delivery, security, compliance, and observability ensures that these tools provide real value to the organization, adhering to best practices and state-of-the-art standards. In essence, kubriX makes numerous open-source projects "production and enterprise ready" for your platform.

Moreover, kubriX reduces ongoing operational costs associated with platform management. We handle component updates, quality assurance, and vulnerability management while also alerting you to breaking changes. When bugs are discovered, we often take on the communication with project maintainers, which can be a significant burden for platform teams.

Despite its comprehensive nature, kubriX remains highly flexible and customizable to meet your specific needs. If you already have an observability tool in place, for instance, we can integrate it and disable our default component.

Much like Red Hat provides a Linux distribution that saves you the hassle of compiling the kernel and selecting necessary additional tools for your operating system, kubriX offers an IDP distribution that can be quickly and easily installed on your infrastructure, allowing you to enjoy the benefits of a modern platform right away.

As for the name: we refer to these predefined, curated components of our platform as "Bricks," and that’s how we arrived at the product name "kubriX."

Reduce risks, time and costs with kubriX

· 3 min read
Johannes Kleinlercher
kubriX Dev, platform engineer, systems architect

An IDP often consists of several products or opensource projects. Updating those components is one thing your platform team needs to do regularly for several reasons:

  • get the newest features
  • get bug fixes to increase stability
  • get security fixes to fulfill your vulnerability mgmt

Those updates can be time-consuming and error-prone. However, kubriX keeps your platform up-to-date and our quality checks dramatically reduce your efforts and deployment risks.

A real life example

As the following real life example shows, sometimes updates can be really hard.

Recently a new version from Grafana Mimir Helm-Chart 5.5.0 was released. I don't know wether you are a person who really evaluates changelogs before updates, but also if you did read the Changelog for 5.5.0, you won't recognize any breaking changes.

However, the first time you would probably recognize a problem is that your Grafana Mimir ArgoCD application stays OutOfSync forever. If you take a closer look to the diff, you will see:

OutofSync example

green (right side) is desired state, left is current state.

You then probably ask yourself and your team mates why the hack this app doesn't sync anymore. Are there some conflicts with another app overriding this configuration? Is the Grafana Agent Operator overriding the GrafanaAgent instance? Is your GrafanaAgent CRD not compatible with this new GrafanaAgent after update? Where is this CRD defined? Do I need to have a look at new sync options ArgoCD provides (and they provide a lot of them)?

Believe me, this problem will take you hours even though you are very experiences, unless you have the right guess in the first place.

The problem was, that the GrafanaAgent CRD supports the attribute topologySpreadConstraints, but the indentation in the GrafanaAgent CR was wrong in Mimir 5.5.0. So it was not compliant to the GrafanaAgent CRD spec.

However, why did ArgoCD just show a OutOfSync problem? This is actually an open issue in ArgoCD.

When you apply the manifests manually, you would see

Error from server (BadRequest): error when creating "STDIN": GrafanaAgent in version "v1alpha1" cannot be handled as a GrafanaAgent: strict decoding error: unknown field "spec.containers[1].topologySpreadConstraints"
and indeed topologySpreadConstraints is not a valid attribute inside containers attribute.

The benefits of kubriX

We do those update tests for you already and will check already lots of things what otherwise your platform team needs to do. Only when our quality gates show a green light we integrate those new versions in our kubriX platform.

When we recognize problems like this above, we will open an issue on the original upstream project and help to solve this problem together with the maintainers.

Only for this problem above you will save at least 3-4 days updates, troubleshooting, communication with the community, and fixing the updates.

You can see our real life tests and investigations in this renovate PR