A few weeks ago, I spent some time with our support and engineering teams helping a customer solve a problem that happened after they enabled Group Managed Service Accounts (gMSA) on Azure Kubernetes Service (AKS).
I decided to write this blog so other customers with the same issue can avoid going through it altogether. Iām writing the blog in the sequence as I experienced it, but if youāre just looking for the solution, feel free to skip to the end.
When that customer enabled gMSA on their cluster, a few things started to happen:
- Any gMSA enabled deployment/container/pod entered a failed state. The events from the deployments would show the pods with the following error: Event Detail: Failed to setup the external credentials for Container ‘<redacted>’: The RPC server is unavailable.
- Any non-gMSA deployment/container/pod using the customerās private images and running on Windows nodes also entered a failed state. The deployments were showing an event of ErrImagePull.
- All other deployments/containers/pods both on Windows and Linux nodes that were not using private images kept their healthy state.
Removing the gMSA configuration from the cluster would automatically revert to a healthy state for the entire cluster.
The error with the gMSA pods took me immediately to other cases in which Iāve seen customers having similar issues because of network connectivity. The most common gMSA issues I have seen so far are:
- Blocked ports: Having a firewall between your AKS cluster and the Active Directory (AD) Domain Controllers (DCs). AD uses multiple protocols for communication between clients and DCs. I even created aĀ simple script that validates the ports.
- Incorrect DNS configuration: AD uses DNS for service discovery. Domain Controllers have a āSRVā entry in the DNS that clients query so they can find not only all DCs, but the closest one. If either the nodes or pods canāt resolve the domain fqdn to a DC, gMSA wonāt work.
- Incorrect secret on Azure Key Vault (AKV): A user account is used by the Window nodes, rather than a computer account as the nodes are not domain-joined. The format of the secret should be <domain dns fqdn>\<user account>:<user password>.
There are other minor issues that Iāve seen, but these are the main ones. In the case of this customers, we reviewed the above and everything seemed to be configured properly.
At that point, I brought other folks and they caught on something that I knew existed, but had not seen using gMSA yet: AKS private clusters.
This customer has a security policy in-place that mandates Azure resources should be using private endpoints whenever possible. That was true for the AKS cluster and therefore, it introduced a behavior that broke the cluster.
I mentioned above that gMSA uses DNS for DC finding. Let me explain what the default config is and what happened after enabling gMSA:
By default, Linux and Windows nodes on AKS will use the Azure vNet DNS server for DNS queries. Windows and Linux pods will use CoreDNS for DNS queries. Azure DNS canāt resolve AD domain FQDNs since these tend to be private to on-premises or non-public cloud networks.
For that reason, when you enable gMSA and pass the parameter of DNS server to be used, two things are changed in the AKS cluster. First, the Windows nodes will start using the DNS server provided. Second, the CoreDNS setting is changed to add a forwarder. This forwards anything related to the domain FQDN to the specified DNS server. With these two configs, Windows nodes and Windows pods can now āfindā the DCs.
Azure Portal showing the CoreDNS configuration with a DNS forwarder after gMSA has been configured.
However, this introduces another issue when combined with a private AKS cluster. Private endpoints are behind a private DNS zone. Azure DNS servers can resolve for those zones, but non-Azure DNS servers canāt. Since now the Windows nodes and Windows pods are using a DNS server outside of Azure, the private zone of the AKS cluster canāt be resolved so the DCs canāt access the Windows nodes and Windows pods.
Not only that, but this customer also had their Azure Container Registry (ACR) behind a private endpoint. The second symptom above was also caused by this configuration, as now the Windows nodes canāt resolve for the private zone of the ACR registry and consequently canāt pull their private images.
For reference, these are the container related services and their private zones:
Private link resource type |
Subresource |
Private DNS zone name |
Public DNS zone forwarders |
Azure Kubernetes Service – Kubernetes API (Microsoft.ContainerService/managedClusters) |
management |
privatelink.{regionName}.azmk8s.io |
{regionName}.azmk8s.io |
Azure Container Apps (Microsoft.App/ManagedEnvironments) |
managedEnvironments |
privatelink.{regionName}.azurecontainerapps.io |
azurecontainerapps.io |
Azure Container Registry (Microsoft.ContainerRegistry/registries) |
registry |
privatelink.azurecr.io |
azurecr.io |
Ā
For a full list of zones, check out the Azure documentation.
The solution here is simple. For the non-Azure DNS servers to resolve Private Endpoint zones, a DNS forwarder can be created.
This customer had a very specific implementation, but in general what you need to configure is a DNS forwarder to the zones related to the services you are using. For example:
–Ā Ā Ā Ā Ā Ā Ā Ā Ā AKS clusters: Create a forwarder of azmk8s.io to 168.63.129.16.
–Ā Ā Ā Ā Ā Ā Ā Ā Ā For ACR registries: Create a forwarder of azurecr.io to 168.63.129.16.
168.63.129.16. is the virtual IP address of the Azure platform that serves as the communication channel to the platform resources. One of its services is DNS. In fact, this is the original service that the Windows nodes and Windows pods were using before gMSA was enabled.
Itās always DNS!
If you are using gMSA on AKS, keep in mind that Windows nodes and Windows pods will start using a DNS server outside of Azure (or that has no visibility into the Azure platform directly, such as Private Endpoint zones). You might need to configure DNS forwarders once you start using gMSA on AKS, although this will be true for any service.
I hope this blog post helps you avoid this issue ā or helps you troubleshoot it. Let us know in the comments!