Monday, November 18, 2013

Dunwoody 2.0?


Whenever version numbers are truly meaningful and not just a marketing tactic a major release has some key attributes beyond feature additions or changes to the UX.

Often the new version is incompatible with data and configuration files used by the previous version. In almost every case an upgrade tool or import plugin facilitates migration to the new version but it is labourious and not without risk. Such incompatibilities are an inherent element of forward progress.

An important side effect of these large increments in functionality is the load placed on the existing platform. It is not at all unusual to find after a major upgrade that the system is slow. Often to the point of being unusable. With each major upgrade there is an associated increase in the minimum performance and capacities to support this newer and better version. In cases where a system upgrade is driven by the adoption of a newer version of software it is not at all uncommon to replace the underlying platform and hardware at the same time the applications are upgraded. Experience has taught many a techie that is ultimately what occurs so you might as well cut to the chase.

Though some may think we have a major version upgrade the facts cannot support the assertion. What we've done is swap out one of our key apps and it isn't clear if that is a bug fix version, a minor feature release or even a feature downgrade for the sake of system stability. It is very unlikely that app is a major version upgrade and if so it remains only one of seven apps.

And since initial boot up we've done virtually nothing to the underlying platform and hardware. Now some may contend that replacement of an improperly installed platform component was an upgrade but this was really just routine maintenance like swapping out a failed spindle. We didn't switch to SSD or cloud storage and while we may have installed a bigger replacement the only discernable difference is that the whole operation cost a lot of money and was only a little more inconvenient than it was avoidable. At the end of the day the total cost was much more than would have been required to replace the entire platform at the time. Had we only known.

But it does point to a circumstance all computer users know: systems are not immune to the second law of thermodynamics and move constantly towards increased entropy. Systems rot. Platforms rot. As these platforms rot otherwise perfectly well behaved applications become slow, produce errors and ultimately fail. Often not due to any failure in the app itself but because the underlying platform is failing.

There are a couple of common ways to address platform rot. The first is to weed out the failing components and replace them with new components that are compatible with the remainder of the pre-existing platform--there is a market for EIDE drives albeit on ebay. This approach retains whatever perceived value exists in the platform but locks the overall system into a constant decline towards obsolescence. This is why an entire platform evaluation is often done when an individual platform component fails. It has so often proven to be a false economy to scrimp by spending on a single component when the incremental cost of a platform replacement is dwarfed by the generational increases in capabilities and performance a new platform offers. In other cases the root cause of specific component failure lies with another defective component--if you must keep replacing the fuse then the problem isn't the fuse. Consequently it is not uncommon to replace the underlying platform while retaining the existing applications which are then only upgraded as user-facing situations warrant.

We already have one sign of platform rot and responded in a very limited fashion by removing the defective component. Now we have a new and more disturbing sign of rot. Two of our key apps attempted to access data regarding the genesis of Project Renaissance and unrelated budget item details. Both sets of data should have been readily available but platform errors indicated these were missing with no indication that they had ever been stored or reliably backed up. As these are unrelated data items this is indicative of systemic failure. It is also clear that this platform does not support ECC memory and it appears parity checks have been disabled. What is not clear is how long our key apps have been operating with incomplete and inaccurate data. However there is no doubt that as this situation progresses it will continue to degrade ultimately resulting in errors of such consequence that the previous platform failure will pale in comparison.

Given the current state of affairs we are not only not at Dunwoody 2.0 we don't even have the platform in place to get there.