The operational error that revealed a coding error.

I think we have a tendency to think in incident´s root causes as isolated factors (either the code is wrong, or somebody executed something wrong, or the configuration was wrong…) Even knowing the theory, if we have read dozens of articles about how usually its a combination of factors (errors), our automatic tendency is to look for a single culprit, even the concept itself root cause assumes a monolythic source from where the rest of consequences propagate until they create the problem we where trying to solve.

This next issue is proof that this is not always like that, and we should not use that mental shortcut without a little critical thinking: We had an operational error that, by itself, wouldn´t have caused the issue and I discovered a coding error in a module from the Ansible collection community.windows that, by itself, wouldn´t have caused an incident. What happened when they both concurred?

The source for this post is a bunch of notes that I always like to take when facing interesting troubleshoots. Moreover, it has been my first contribution to an open source project as big as Ansible.

The error: we are probably doing something wrong.

The error occurred in our main deployment pipeline, during the release of a new version. As always, the FQDNs, hostnames and the output in general have been edited to protect the respective clients.

failed: [host-2.domain.com] (item=CUSTOMER_WEBSITE) => 
{     
    "ansible_loop_var": "item",     
    "application_pool": "CUSTOMER-1",     
    "changed": false,    
    "physical_path": "G:\\WEBAPP\\22.2.0.0\\CUSTOMER-1" } 
    MSG: Cannot find path 'G:\\WEBAPP\\22.1.0.0\\CUSTOMER-1' because it does not exist.

The physical_path doesn´t exist? We already have a suspect, displaying another mental shortcut the first thing we thought was “If it has always worked, and now it doesn’t, we must have changed something that causes the error. We have broken that path” (and this is a mental shortcut, Certainly, just because it worked in the past does not guarantee that it will work in the present.) It was a good start.

The path for 22.2existed, but the 22.1 didn´t. Okay, operational error, during that moment in the deployment process, the previous path, since it´s the previous´version path, always exists in case its needed for a rollback. During the deployment we solved the problem by just restoring the missing path, and continued.

During execution I made a note to look into this later: What did the previous´ version path matter if what we were doing was mapping to the newest version path, which did exist? Had we really made an operational error? Or was the code imposing unnecessarily restrictive conditions? The problem went from being a suspect of operational error to something more interesting. The first thing I needed to know whas where that error came from. Ansible its Python, but in Windows hosts Ansible is Powershell and that error string looked very pwsh-ish. In another mental shortcut, I came up with an evidence-sparse theory based on instinct and what the limited data available inspired me. This, which can be called “speculation”, it Esto, que puede llamarse “especular”, its called heuristics in knowledge engineering.

The theory.

The error “Cannot find path $path” didn´t came from Ansible, nor from the pwsh module. It came from PowerShell itself. But first I had to understood a little bit better what had happened. To not stretch this post too much, I will put the facts as a list:

The path that was trying to be mapped (Version 22.2) existed at runtime.
The path of the previous version (22.1) had been deleted earlier than expected, there was an operational error here.
- This was possible because in another step of our automation we stop the IIS and delete all n older versions except n (ie we delete n-1versions to make sure there are no more version folders older than the one immediately before and that do not have value to make a rollback)
The IIS APP Pool still had the path of the previous version mapped (G:\\WEBAPP\\22.1.0.0\\CUSTOMER-1), since it was never updated.

After these facts it was pretty clear what we had done wrong from our side (deleting the latest version prematurely), but not why we got the “Cannot find path 'G:\\WEBAPP\\22.1.0.0\\CUSTOMER-1' because it does not exist.” error when in the module we where trying to map the 'G:\\WEBAPP\\22.2.0.0\\CUSTOMER-1', which did exist.

In my opinion, the greatest advantage of Open Source, and what has allowed it to become the de facto standard for many parts of our industry, is that you can take a look at the code to get a better idea of what is going on.

The bug in the code: An implicit validation.

When taking a look at the source of win_iis_webapplication I came across a very interesting line of code, previous to making a check between the current Physical Path of the webapp and the one received as a parameter ($folder.FullName), to validate if it has to be modified or not and meet the idempotency condition (do nothing if the desired state and the current state are the same).

$app_folder = Get-Item -LiteralPath $application.PhysicalPath
            $folder = Get-Item -LiteralPath $physical_path
            if ($folder.FullName -ne $app_folder.FullName) {
                Set-ItemProperty -LiteralPath "IIS:\Sites\$($site)\$($name)" -name physicalPath -value $physical_path -WhatIf:$check_mode
                $result.changed = $true
            }

There was the error. On the conditional, the module compares between the current path and the given path (if ($folder.FullName -ne $app_folder.FullName)), however, to get the current path it did something a bit strange: $app_folder = Get-Item -LiteralPath $application.PhysicalPath. If you perform a Get-Item for something that does not exist, you get the very descriptive error message Cannot find path 'X' because it does not exist. An IIS App Pool can have in its PhysicalPath any path, even one that does not exist if we´ve managed to delete it while it was stopped. Of course if will fail as soon as you start it again, but it is allowed, you can have a path that does not exist in the filesystem.

It was clear what had happened, when trying to do that Get-Item with the previous PhysicalPath which we have deleted (operational error) PowerShell threw an unrecoverable error (you are asking it to perform a Get of something that doesn´t exist, after all). Mystery solved, yay!

But, should the Ansible module perform a check on the old path, and not just the new one received? If we honor the idea that in Ansible you are defining the final state in which we want to leave our hosts, it doesn´t make much sense to validate the previous state. What if I am actually running my playbook to fix something that is currently wrong? What if somebody or something misconfigures my App Pools? The logical thing to do would be to run the playbook to fix it, and its not very helpful to have Ansible fail because my current configuration (not the one I want to get) is not correct. Moreover, in this particular case it didn´t even look like an explicit check, but an implicit check derived from the way the code was trying to get the existing PhysicalPath.

The fix: An improvement to not be unnecessarily restrictive.

The module shouldn´t care if the current PhysicalPath is valid, or if it exists. It can be ILoveYou, if should only care if the given one is valid and not the same that the current one. That is its call to action. With this approach, applying a fix is quite simple: Remove the implicit validation. And that is exactly what I did in this PR

$app_folder = Get-Item -LiteralPath $application.PhysicalPath
            if ($folder.FullName -ne $application.PhysicalPath) {
                Set-ItemProperty -LiteralPath "IIS:\Sites\$($site)\$($name)" -name physicalPath -value $physical_path -WhatIf:$check_mode
                $result.changed = $true
            }

Kudos

Official docs for the win_iis_webapplication module