Networking in the Cloud is becoming more and more streamlined. But that doesn´t mean its not free from vendor-specific peculiarities. In this case, trying to leverage Azure´s Network Security Groups (Azure´s ACLs) under a particular setup showed a behavior that allowed me to learn a thing of two about Azure´s networking.

Intended setup: NSGs to block access from the Internet

The intended setup was straighforward and one we had in other parts of our infrastructure: Limit public access (from the Internet) to only certain IPs for our services. For that, we leveraged NSGs against the VNET´s subnets where the servers resided. Its a tried and tested method, so NSGs where setup and exception IPs where added. But this time it didn´t work: services where still accesible from the Internet to anyone. Why was that? What was different in this setup? Quite a few things, actually.

The initial setup: More components to take into account

The servers where spread in a VNET (let´s call it Example VNET) with four subnets but I´ll reduce them to two for the sake of this example: ServersA_subnet and ServersB_subnet. And then the Azure Application Gateways that controlled access where on another subnet (Gateways_subnet). That makes it 3 subnets where we could apply the NSG to prevent unauthorized direct access to either of the servers´ subnets and the AppGWs that where ready to handle most of the traffic.

The NSG setup was simple enough. Just inbound rules to add, since outbound connectivity was not to be limited:

  1. Allow inbound traffic comming from certain public IPs for HTTPS traffic
  2. Allow inbound traffic comming from VNET (this is a default Azure NSG rule to allow intra-VNET communication)
  3. Deny all other inbound traffic

My IP was clearly not one of the public IPs allowed, yet I could hit the servers. And others could, too. What was wrong in the setup, then? Was the NSG not being applied correctly? Unlikely, but after reviewing and validating that the NSG setup was correct I set out to investigate what this environment had that the others didn´t. Time to trace the network flow!

The inbound flow from browser request until it reached the servers was as follows:

  1. First a CNAME like customer-service.ourdomain.com in our third party DNS service, mapped to an A record (also in the same third party DNS vendor).
  2. A RECORD mapped to the Public IP of one of our Azure Firewalls.
  3. Dynamic NAT (DNAT) rule on the Firewall that translated the Public IP to the correct Application´s Gateway Private IP.
  4. The AppGW sent the request to one of the servers in the correct backend pool.

The setup was different, mainly because it had a Firewall that was also doing NAT. This NSG method was done on other envs which didn´t leverage Azure Firewall. Also, the CNAME was proxied, meaning that once it reached our third party DNS provider it would enter their network and come to our servers as if from one of their Public IPs. See whats happening already?

What was really happening: that was no public traffic anymore

Issue with our approach was first and foremost it was our Firewall performing DNAT translation: Our third party vendor´s Public IPs where allowed through and since it was proxied, any request for that CNAME customer-service.ourdomain.com was coming from those approved public IPs from our third-party DNS. All good there, that should happen. But at that step, the original requester´s IP was already meaningless. But more interesting things where happening that show a few things about Azure´s VNETs inner workings.

When performing the translation for the DNAT rule, Azure Firewall translates origin IPs to private IPs in the range of the Firewall´s own subnet. Per this techcommunity doc:

“When a new flow matches against a DNAT rule on the Azure Firewall, both the source and destination IP addresses will be translated to new values. When the destination is a private IP address in the virtual network, the source IP address will translate to one of the IP addresses in the AzureFirewallSubnet of the virtual network, while the destination IP address will translate to what has been configured in the DNAT rule as the Translated address.”

So technically, after executing the DNAT rule, any request coming from the outside (The Internet) for that CNAME was now part of the Firewall´s subnet. Still, that didn´t mean the request should reach the servers. Actually it should be the opposite, right? On the NSG Rule 1 was to allow a series of Public IPs that no longer meant anything (all request at this point would have a Private IP in the range of the FW´s subnet) and Rule 2 of the NSG said to allow everything coming from the same VNET, and the Firewall was in a whole different VNET. If anything, our original setup should have been denying all requests, even the ones comming from the allowed IPs (which at this point in the request´s flow were irrelevant). But instead it was allowing everything. Why? Well…both the Firewall´s VNET and our Example VNET where peered.

According to official Microsoft docs if the peering is marked like “Allow vnet-1 to access vnet-2” (and it is by default) the VirtualNetwork service tag -used by NSGs- includes both the VNET and peered VNETs. Meaning that, for the NSG, the Firewall´s VNET and the local VNET where it was applied where indistinguishable.

So, logically, the NSG was doing its job. My outside requests where first proxied through our third-party DNS and getting their IPs, which where allowed on our Firewall but where no longer my original Public IP. Then the Firewall executed a DNAT rule to translate said request´s destination IP to a private IP in the correct range for the correct resource (an Application Gateway).

In the process, it also translated the origin IP (again) to be in the range of the Private IPs of its own subnet in the Firewall´s VNET. Now, being part of a VNET peered with Example VNET, my original request looked exactly like any other request coming from inside Example VNET. Per rule 2, the NSG allowed it through. And that was it, mystery solved.

The solution

We couldn´t simply disallow intra-VNET traffic, our servers needed to talk to each other, as well as access other resources. A more complex setup like this wasn´t simply going to work with NSGs. And unproxying the CNAME so request didn´t get our vendor´s Public IPs was out of the question (because it provided many other advantages). Our best bet was trying to solve it as close to the edge as possible, leveraging another kind of ACLs in our third party´s DNS vendor.

Kudos