I have been a software developer for over 30 years now.  That includes time as a kid playing with technology, time as a student, and professional and non-professional work.  Over that time I have encountered a lot of bad designs, and some good designs in systems and software.  I also have, like every person in the engineering or operations side of technology companies, lost a lot of sleep from time to time, due to midnight emergencies or insecurity about having a stable system during peak usage.

I’ve distilled my overall experience into a list of what I consider the fundamental requirements of a system (enterprise or, more simply, multi-server) to

  1. Ensure it is survivable (more than just the original developer can make it run)
  2. Keeps its ability to evolve over time
  3. Minimizes the amount of technical debt created by its design
  4. Minimizes the amount of disjointed interaction, and maximizes discovery and effectiveness of community effort.

I am focusing here on the big picture: the system as a whole.  Best practices for code styles, library use, etc are only implementations of these suggestions which are up to you as a business to decide.  These requirements are all agnostic to a particular platform or technology.

#1: Keep configurations, to the fullest extent possible, out of local .config/.xml/.ini/.cfg files, registries, and other application/service/server-specific locations.  Centralize it, with a good user interface that configuration specialists can use to review and adjust everything.  

This avoids the nightmare of relocating hardware or services, and potentially not having the entire set of configurations needing changes due to the migration.  There is nothing worse than a series of inconsistent tools, services and applications written by several different developers over time, each having their own philosophy or practice of how to store configurations for various items.

Enforce this requirement: it will give your system a chance at longevity and retain scalability, and drastically reduce technical debt.

#2: Log what your application is doing, not only to a specified level of detail set in a configuration, but log it to the RIGHT location.  The RIGHT location has broad centralized access with search and filtering abilities.

This avoids the nightmare of not being able to properly troubleshoot a problem, because the details of an error or step in a process are not recorded.  Personally, I am a strong advocate of chatty log entries.  In my experience, deep-detail logging is useful for the most recent 96 to 120 hours of operation.  After that time, the log entries can be pruned down to a summary level for process accounting.

By keeping the details for 96 to 120 hours, someone can return from a long weekend and have enough information to troubleshoot some problem occurring over the weekend.  The most important details to log are information about a specific step being done, error trace and stack info, and breadcrumb information (specific file names, URI or other resources).  Logging should also provide some information about the server itself (at a minimum it’s system name), to ensure that the reader is aware of where the process occurred.

If the logging system is well-defined and well-exposed, it becomes an excellent foundation for others systems that will be of great value: metrics and alerting.

#3: Have a way to uniquely identify your application or component, within the collection of applications and services on a server, and also within a sea of servers in an infrastructure.

This partially applies to logging, and partially to configuration.  It is also intended to apply to contracts, permissions, service enabling, etc, which are related to the business side of things.  Unix and Linux engineers love deep paths or dotted identifiers.  While it does have to be this technique, you get the idea.

#4: Have a place to validate a business relationship for a customer consuming the process.

As software engineers and system designers, one of the critical considerations is the ability to allow the business administration to control what is available to a client.  If accounting has not been able to get a client to pay, how is the service for that client disabled?  If the system does not have an inherent ability to check this, outside of system configurations it uses for operation of the system itself, then some form of “workaround” develops that may be more for technically disabling the service, then denying access to it.

It is not uncommon to see a service for a client disabled (as part of a bulk action) in order for technical maintenance to be done, then that service is mistakenly enabled for the client even though there was a business reason not to.  If there are not separate switches designed for technical disable versus business disable, it’s easy to cross the streams and get confused.

#5: Implement Flight Tracking

While logging tells you what a process has done (right or wrong), it will not tell you whether a process either ran when it should not, did not run when it should, or … just stopped processing.  While the latter (stopped processing) can be determined from a log, it must be discovered.

Flight Tracking is a concept borrowed from the aviation industry.  When a pilot plans to fly his aircraft from point A to point B, he files a plan with his intended departure time, his intended route, and his destination.  The pilot can cancel the flight plan if needed before departure, or he can make changes as needed.  But the flight plan’s purpose is to know that the pilot and his place is where he said he would be, and react if the aircraft is overdue and out of communication for a period of time, or did not even depart as scheduled.

This is a good practice in an enterprise system.  A specific process should report its launch to a central location, and send periodic updates that it is still running and processing.  Ultimately, it should report its completion.  By doing this, a layer of monitoring can be added to the flight status compared against the flight plan, to report processes which have not provided updates (hung or crashed), or which have not launched as scheduled.  This is an important feature in a system which has defined SLA’s.

While there are a number of passive monitors available (Nagios, etc), there are times where the passive monitor will report the application or service as running, but the app/service is actually doing nothing.  By writing active flight status reporting in the application code itself, the confidence level is higher.  Think of it as an aircraft on autopilot.  Even if the pilot has passed out at the controls on autopilot, the plane will look fine on radar for a while (passive monitoring).  Only direct communication from the pilot via the radio will ensure confidence that the flight is going as intended.

* * * * *

There are a slew of other issues that need to be addressed in design, but these items are the core of protecting your sleep (and sometimes, even your sanity).  These 5 core principles all establish a standard, broad-based view of  a system that keeps everyday operation as simple as possible–and keeps the developer focused on developing.