Errors
Caveats
Most if not all pages are just rough notes, and these pages as a whole are far from complete. More notes will be added in time, eventually, maybe.
If, from reading these notes, you conclude that I am off my rocker, you won’t be the first, and you may even be right. These pages may simply position me as an acolyte to the late Gene Ray.
No doubt there are a dozen and one reasons why none of this would ever work, but perhaps somewhere deep down there is a tiny fragment that could be used for something.
Contents
Overview
Disagreements on error handling abound eternally. Return codes from function calls don’t work for functions where there is no spare value to use for errors (e.g. integer return where any integer is a valid result). Error codes stored in a global variable are not thread-safe and tend to get overwritten before the program gets around to reading them, leading to the infamous “Error: The operation completed successfully” (the global error number variable got zeroed by a function call that didn’t fail). Exceptions are widely derided for their performance cost, although this is likely to only apply to their repeated use in tight loops.
One advantage of exceptions is that they guarantee that the program receives a structured record containing details of the fault.
The largest problem with error handling is that errors are frequently reduced to a bare number or an absurdly generic message such as “Access is denied” (by whom or what, and to what?) and “Path not found”. Windows Update is the worst offender of all, as most of its error codes are effectively dice rolls and completely meaningless.
All system errors must indicate:
- The object of the error (e.g. file path, hostname, URL)
- A clear description of the operation that failed
- A clear reason indicating why the operation failed
- A distinct error code
This information will be formatted as (ideally) a single sentence or possibly a very short paragraph, with no line breaks.
PowerShell made an attempt to achieve detailed error messages, but in most cases, the details of the operation and the failure reason are so vague as to render the error message meaningless.
There is a careful balance to be found when reporting error messages. If the chosen operation is too fine-grained (e.g. the low-level function call that failed), the message will mean nothing as neither the user nor support technician is going to know why that operation was used. If the chosen operation is too high-level, it may not indicate what was taking place to cause the failure. For example: “reading remote configuration from control server” is better than “connecting to server”, as the latter doesn’t indicate what server is being contacted or why. “Loading configuration” is too vague, as if that fails with a socket timeout, there won’t be any indication of what was taking place that involved a socket. The full message may read, “Error: reading remote configuration from control server control.example.org on port 443: timeout establishing TCP connection”. This will immediately reveal mistakes such as an outdated hostname, wrong port number, offline server or a missing firewall or port forwarding rule. It will also indicate why such a connection is being made, for example if such a server was not meant to be in use in the first place.
The granularity conundrum in theory is resolved by stack traces, but most real-world software involves so many nested function calls to achieve even the most trivial task that the stack trace ends up impenetrable. A single function that performs several semi-high-level operations (several groups of instructions) in sequence won’t resolve this conundrum either as you want to know which of the instruction groups was the one that failed.
Stack traces are also inappropriate for error messages due to their length, complexity and presence of line breaks, although some programs do show them in an expandable portion of a dialog box or write them to a more loosely-formatted log file or to the Windows event log (where complex multi-line log entries are accepted). Logging stack traces is beneficial for more serious errors, but not appropriate for routine errors unless there is reason to believe that the program itself is at fault (trying to do something it is not supposed to be).
Lazy errors
This was previously posted on Bug of the Moment — a particular computer could no longer open Disk Defragmenter (dfrg.msc):
(This was a test with a known-good dfrg.msc from another computer.)
The message reads:
MMC cannot open the file C:\dfrg.msc.
This may be because the file does not exist, is not an MMC console, or was created by a later version of MMC. This may also be because you do not have sufficient access rights to the file.
OK, so … The computer is able to determine for itself whether the file exists, so this cannot be a mystery. The computer is also perfectly capable of determining whether it could open the file, so this could also be reported definitively. If the MMC version written into the XML file was from a later version of Microsoft Management Console (MMC), then MMC could have reported this too. MMC is also capable of ascertaining whether the file was a real MMC console, and it most definitely was.
The MSC file did exist, was perfectly accessible and was a genuine MMC console of the correct version. It was not possible to determine why Disk Defragmenter could not be loaded. It wasn’t that the underlying MMC snap-in was faulty, as it was possible to create a blank console and add the snap-in.
This message is completely lazy. MMC had complete access to the real cause, and kept it a secret. Exactly what programming technique must be adopted to avoid this in future cannot be known without observing the source code and the possible code paths that led to this error message. What is clear however is that error messages must not be suggestions. Report the exact problem, not a vague list of unrelated possibilities. Do not report any suggested cause that could have been checked and ruled out programmatically. Make good use of library routines to offload validation work, and ensure that the language framework and operating system provide as much help as possible so that error handling can be centralised. Use standard file formats with a provided schema definition so that the contents can be validated using common code. In other words, minimise the amount of custom error handling needed, so that there are fewer ways to misreport problems.
Error numbers
Avoid the confusion found in Microsoft’s error numbers: they are intended to be reported in hexadecimal but are widely reported instead as both unsigned and signed decimal. Nobody needs an error number that can be presented in three different ways. Hexadecimal is better for visual classification (more combinations with a shorter and fixed number of digits) but is too widely presented as decimal. Avoid negative numbers as a hyphen-minus in Google search indicates a term to be excluded (Apple in particular made use of negative numbers).
Error hierarchy
An error hierarchy may prove useful for technical purposes, although it has little meaning to users and administrators.
Suggestions
E_PROHIBITED: the request was rejected.
E_SOFT_PROHIBITED: sub-type of E_PROHIBITED, indicating that the request was rejected for user convenience and safety, and that the user may override the rejection.
E_HARD_PROHIBITED: the user does not have the privilege to issue the request.
E_FILE_READ_ONLY: sub-type of E_SOFT_PROHIBITED (the user may override a file’s read-only status at their discretion)
E_ACL_DISALLOW: sub-type of E_HARD_PROHIBITED (the ACL on the object does not allow the user to make the specified request)
The screenshot below shows PowerShell pretending that deleting a read-only file is impossible. Add “-Force” and somehow you can sidestep the system security.
E_SOFT_PROHIBITED errors cover this gap, indicating that the rejection of the request was not due to administrative restrictions but simply a safeguard to stop the user making a mistake. A good example of such an error is the requirement under Acorn DFS (Disk Filing System) to issue *ENABLE before certain dangerous commands such as *BACKUP (sector-level disk duplication). As seen in the screenshot below, the error “Not enabled” is issued if the user attempts such an operation without first enabling it:
In DFS 1.20, this was replaced with a Y/N prompt.