Thursday, October 30, 2008

The Exception-Tolerant Organization - Part 2

As explained in my introductory post on the Exception-Tolerant Organization, ETOs build the communication pathways, business processes, and technology features to handle exceptions systematically. Where the business processes and technology features are largely matters of construction, the essential communication pathways can often be the component most elusive to an organization.

To relate this to a recent business event: an organization was forming a new business venture, bringing mature and existing technologies and processes in house from one of their partners. In the suite of this venture's features, there was one particularly public-facing feature that was not even close to being on par with the rest. Very early on in the business venture, one principal surmised that this feature had the strong potential to sully the reputation of the organization, and told the other principals of his analysis and recommendations for a course of action. A few principals were sorely disappointed that this was brought to their attention so early in the business venture, as they felt that this "negative" analysis was not what was needed during the venture's "honeymoon" period.

About one year later, the other principals began to see the strongly negative customer reactions to this public-facing feature. When they approached the principal who originally gave his analysis, they said to him "You didn't tell me it was THAT bad." In this case, the communication pathways were cut off early in the venture, perhaps when they were needed the most. And when they were opened again a year later, it was done in a reactive fashion that does not lend itself towards a systematic way to handle exceptions.

So how can you create these communication pathways? You can begin an email thread, as the principal above did, but these threads either are ignored early, or become so long that the interest drops off quickly. You can appoint a single point of contact to receive and dispatch the exceptions, but that person ends up being a single choke point more often than not. Or you can set up a system purely to capture issues and exceptions for review, which relies on people willing to take the few necessary minutes to input their issues into the system. This system can provide an effective entry point for exceptions and issues, if your organization is culturally adaptable to putting an entry, approval, and management process in place. The organization above did go and implement such a system.

Whether this type of system is a cultural fit for your organization or not, a regularly-scheduled gathering of the principals solely for the purpose of reviewing exceptions would go a long way towards getting the effort off the ground. Your organization may need to lighten the "status meeting" load somewhat to make room for this type of gathering. But because of the regular schedule, you are that much closer to systematically handling your exceptions.

As stated above, the business processes and technology features are largely matters of construction. But where business processes are concerned, many organizations focus on constructing processes for project lifecycles, development lifecycles, change control, and administration. These can be helpful in measuring performance when they are not over-engineered, but they are all are intended to handle the "normal" course of business. This still leaves out the what-just-happened, where-do-we-go, and what-do-we-do when an exception does occur.

As a starting point, take the lead from technology support teams. Great technology support teams have an initial point of contact, an exception entry and tracking system, and "run books" that contain procedures written in detailed step-wise fashion documenting EXACTLY what to do when a particular exception occurs. When an exception occurs and is identified, the support team executes the procedures step for step. If there is an exception that they cannot identify, they still have a procedure for this case that may involve contacting someone with more knowledge of the systems.

Handling exceptions in this way leaves little guesswork as to how to initially react. For many exceptions, recovery is a matter of reacting safely and sanely first, and then following the steps. So for your organization's exception-handling process, write down in step-wise fashion what people should do when there is an emergency exception, a impactful exception, and a minor exception. Every detail, including which people/departments should be contacted, the appropriate time frames to wait for responses, and any system or documentation entries should be noted. As a bonus, when you are out of the office and someone is covering for you, they can cover for you as effectively as possible during times of exception by following the steps.

For those of you who dislike process and procedure, keep in mind that the exception procedures exist to assist you, not to burden you. Your brain-power is most needed for problem-solving during the exception period - not remembering to make entries in systems A, B, and C; and worse, not remembering to notify people critical to your business. Another point of assistance is rendered when you can collect data from your exception-handling system and analyze the types and frequencies of the exceptions that occur in your organization. This may lead to some business insights your organization may otherwise not have reached.

Technology features for handling exceptions will be covered in a later post, but it is sufficient to say for now that your technology features that handle exceptions should interface and operate just like the technology features that run the normal business.



Tuesday, October 28, 2008

2008 Fairfield/Westchester .NET Code Camp

I will be presenting at the 2008 Fairfield/Westchester .NET Code Camp on Saturday November 8, 2008. The Code Camp is an all-day conference that allows allows professional developers and students the opportunity to hear from experts in the field on a variety of topics, from programming language tools to Web 2.0 development to the latest and greatest in Microsoft technologies. Several Microsoft MVPs, Evangelists, and some of my past colleagues will be presenting. The camp will be held at UCONN Stamford Campus in Stamford, CT, and you can get all the details here.

As part of the Web 2.0/Agile track, I will be leading a live interactive Test-Driven Development session, which will allow you to observe a test-driven approach to solving a real-world problem. I hope to see you there!

Friday, October 17, 2008

Does Our Technology Equate To Lies?

I recently had a conversation with a CTO with experience heading large and global teams. His recent work had concentrated on installing SOA-based systems into his organizations. He brought a viewpoint of his to my attention that related custom software development to a lie. The lie that you tell yourself, he said, was that a custom or one-off module or block of code solves your business problem the way you want it to. You're lying to yourself because the problem wasn't solved in an configurable, and maintainable, and service-oriented way.

He went on to support this statement by relating a story about lining up the dates in a fiscal calendar of a system that did not lend itself easily to change. A custom solution working around the constraints of an existing system was needed to meet the demands of the business, and he was not quite satisfied that the solution had to be of a custom nature. In his view, he preferred the solution to be service-oriented and configurable.

While I understand the CTO's view towards a service-oriented and configurable architecture, I don't think that his lie is attached to the appropriate concepts. Service-oriented or not, you can effectively eliminate the lie he speaks of by managing the lifecycle of the customized solutions.

If your business is sufficiently pained by the problem where you need a customized solution implemented and deployed urgently, and the solution has been appropriately constructed and tested, then by all means deploy it. But if the custom solution is not the most optimal, configurable, or maintainable, then put an expiration date on it and schedule the time and resources to implement the optimal solution by the expiration date. This way you would provide urgently-needed relief of the pain your business is experiencing, while having an outlook toward the optimal future.

But it is crucial that the follow-up happens. The common occurrence, and the real fear here, is that once a solution is implemented and deployed, people move on to solving the next problem or working on the next great project without coming back to readdress that sub-optimal solution. This custom development often ends up being yet another buried non-catalogued nugget of logic.

In the CTO's anecdote above, the custom solution is necessary to work around the larger system's constraints. But what if this custom development was configurable, maintainable, and service-oriented from the start? It is still a separate custom component, one more component to be maintained in the catalog of assets. In maintaining a service-oriented architecture, having an up-to-date and discoverable catalog of components, the contracts they satisfy, and their development artifacts is key to performing effective maintenance. SOA implementations typically have more components than non-service solutions, not less. SOA implementations without these catalogs can quickly become burdensome and error-prone to maintain.

So I think that the real ways organizations lie about custom software development are:

- not assigning an expiration date to sub-optimal solutions, and not managing the transition process from sub-optimal to optimal

- not maintaining an effective catalog of solutions, contracts, and components

- expecting a silver bullet or all-inclusive solution to eliminate the need for custom development

- not actually solving the organization's problems because no custom (or market-leading) solution is considered to be the optimal solution for the business

Unlike the CTO's organization, I've seen organizations with very painful problems put off implementing solutions for years because the solution candidates don't fit into the "optimal" category, service-oriented or not. If the solutions are not considered to be the holy grail, then nothing will be implemented at all. And the business continues to limp along in pain without further technology assistance.

Wrapping up my conversation with the CTO, he was considering implementing one of the larger off-the-shelf tools that can effectively assist in service orientation. He left me with the impression that these solutions, their price tag, and their large deployment footprint offered him great comfort in removing the custom development lies from his organization. But alas, that is a lie for another day.

Sunday, October 12, 2008

Invited To An Idea

Once upon a time a company had an idea - an idea whose direction was generally contrary to the overall company's market. This company made their idea wildly successful, as the great sages of the company built systems and processes to effectively implement the company's idea. But the great sages became trapped by their systems, because they required so much of their hands-on manual effort and control to execute successfully.

Enter our hero. Early in our hero's career, a few great sages invited our hero to become part of this company's idea. They did this by inviting our hero into their office and explaining the reports they needed to systematize and automate their manual systems. The great sages also explained that having these reports in place would free them to continue implementing the company's idea.

The great sages took their time with our hero, explaining why the reports were important and badly needed, and how the reports fit into supporting the company's idea. They always took time to answer our hero's questions and concerns about the reports.

These reports solved certain problems with supporting the company's idea, and freed the great sages to pursue supporting the company's idea further. This led to building robust systems, and led to freedom for many to further implement the company's idea. Over time our hero became a great sage at the company, one who would be free to enlist others and further implement the company's idea.

But after the years of work successfully implementing the company's idea, our hero was still not free.

There were certain functions in the systems that required only our hero. Our hero had willingly executed these functions because our hero believed strongly in the company's idea. And when there were problems with the systems requiring corrective action, our hero was there to the rescue nearly ever single time.

Over the years our hero grew to be responsible for executing more than double the number of functions than was originally intended. The company's great sages became used to the excellent service of our hero, and cheered our hero on all the way. Our hero had been invited to become part of the company's idea, but instead our hero had unwittingly been self-installed as a key cog in the systems implementing the idea. Under this arrangement, our hero could not be free.

When our hero finally realized the situation, our hero took corrective action. Our hero systematized and automated greater and faster than ever before - freeing our hero, and many others, from having to execute the functions that our hero solely used to perform.

But by this time, the company hit some turbulent times, lost sight of its idea, and lost one-third of its people. The company was no longer interested in inviting people to become part of its idea. But people were still needed to handle exceptions and support the company's idea using the systems the great sages and our hero had put in place. And without these people, our hero could not be free.

Our hero realized that at this point the only way to achieve true freedom to pursue the company's idea was by freeing himself from the company. And so our hero, exhausted but thoroughly grateful for the experience, moved on from the company.

Has your organization invited you to become part of its idea? Have you become stuck in the execution of your organization's systems like our hero was? Can you see a way to systematize and automate to free yourself, before moving on becomes the only path to freedom?

Monday, October 6, 2008

The Exception-Tolerant Organization – Part 1

As explained in my introductory post on the Exception-Tolerant Organization (ETO), ETO’s can embrace change and uncertainty while systematically executing their business. There are two keys here to making this a practicality:

- Being able to systematically execute the business

- Having an entry point in each business process for welcoming the change and uncertainty once an exception event occurs

First, ETO’s are able to systematically execute their business. There is a system for each business process that the ETO’s people follow to execute, manage, and report on their business. This is not as complicated as it sounds, as there are systems everywhere in business: accounts payable, software development, computer machine and image preparation, accounting. And when the ETO’s people understand that the systems exist to support the major ideas and goals of the organization, no system is considered too mundane to be ignored, improved, allowed to decay, or allowed to bloat in size. There are many books and references on the web related to understanding the importance of systems and business processes, so I will not expand further here.

If your organization is not systematically executing your business, but rather executing in an ad-hoc and undisciplined fashion, it can be difficult to embrace any changes or external events. Your organization is already dealing with so much noise and individual solutions to regular business issues ---that it will not be able to differentiate an exception event from a regular business event. Note that this can be an advantage when looking to create a system, in that your best ad-hoc process may work just as well for handling an exception event as it does for conducting your regular business. If you find your organization in this situation, use your best ad-hoc process as a starting point for implementing a system that can be executed repeatedly without fail.

Second, each one of an ETO’s systems and business processes has at least one entry point for addressing an exception or a change. Organizations need a way for someone to bring an event warranting change or representing uncertainty to the business’ attention. As an example, Toyota production workers are able to stop the line when they see a problem during the production process. Stopping the line is their entry point.

Entry points for processes that must be executed daily and on-time can be more difficult to see with the naked eye. As an example, an investment portfolio that must be valued on a daily basis - one or two issues with the price of an investment can make the portfolio's reported value wildly inaccurate, and become grossly misleading to a fund manager's investors. The system of valuating the portfolio must have an entry point where exceptions with prices can be raised, diagnosed, and handled. The identifying and handling of these exceptions is a systematic process itself, often assisted by robust technology. The entry point can be an automated review of the portfolio, followed by an exception reporting tool with a pricing exception report as a backup.

An entry point for an organization experiencing a problem with internally-developed software is a help desk department, which has rules and procedures around when it is available to take calls, and its expected turnaround time when responding to issues and exceptions. In other words, it is a system. Other organizations may have a developer, system administrator, DBA, infrastructure engineer, or manager as the entry point for problems. All of these technologists have different schedules and structures to their day, and can serve as an effective entry point - when they are available. Business principals may even feel more comfortable going to them directly, as they feel that the technologists are closer to the solutions than the help desk professionals.

This often ends up being more effective from time to time, but less systematic. Technologists are often steered away from their scheduled and time-sensitive work to handle exceptions, particularly after-hours and overnight. But here is a situation that calls out for leveraging a system so that effectiveness can be assured every single time the entry point is used. Being exception-tolerant allows us to handle these exceptions without negatively impacting regular business. This will be continued in Part 2.

Thursday, October 2, 2008

Leadership During Downtime

When I browsed to LinkedIn early this morning, this is what showed up on a web page entitled "Oops!":

Sorry, we can't display this page right now.

Something unexpected has gone wrong. Please wait a few seconds and try again by hitting the reload button.

We apologize for the inconvenience. An error report has been filed and our team is working on fixing the problem.

If you have any questions, please email us at customer_service@linkedin.com.


For many developers and business users of web applications, this is an all-too-familiar sight when a web site is experiencing problems. But does it need to be?

LinkedIn is no doubt a leader in on-line networking and community. But while there is a link to contact customer service via email, the web page is pretty much out of character with the rest of the web site: no ads, no links to its user community's services and sites, no information about LinkedIn itself - in other words, nothing useful. We might as well have received the standard error page from the browser.

In a period of downtime, LinkedIn is missing out on an opportunity to continue leading the way as a premier networking and community portal. Just a few links and paragraphs of text can make all the difference, so that during downtime LinkedIn would never be completely offline.

Wednesday, October 1, 2008

A Breakdown In System Testing

The latest release of the middle-office system has gone live, but somehow the pricing analysis screen, the most important screen in the entire system, will not accept prices entered for equity swap positions. It is at the end of the trading day, and the traders are getting heatedly upset.

Ask the developers, who are newcomers to developing for this system, if they tested the screen, and they say "Yes, but we didn't modify anything on that screen, so we didn't test it fully." Ask the business analyst if he tested the system, and he says "Yes, but I didn't test that screen because the developers said they didn't modify anything on that screen, so I just entered a few prices as a litmus test and that was it." Ask the development manager what went wrong, and he says "Didn't anyone bother to test the system? If I go into the system and try to enter this price, it's clear to me that it doesn't work! Are you all blind?"

Did the developers modify the screen directly? No. Is the business analyst lazy? No. Is everyone in the development manager's group blind? No. But is this breakdown of testing a common occurrence in software development? Yes. (A change to a pricing validator component shared by multiple components of the system, including the pricing analysis screen, was the cause of this situation.) And is everyone feeling sore about it? Absolutely.

So what is to be done? In this particular environment there is no systematic procedure for testing. The development group is weeks or months away from being fully up-and-running with automated development and testing practices and facilities, if they can carve out the time away from their regular responsibilities. Other than the business analyst, there is no budget for a dedicated QA/testing group. Yet something must be put in place quickly so that a system with shared yet inter-connected components can be reliably tested without negatively affecting the business users upon release.

This is the perfect time for this group to begin a testing practice built upon a foundation of acceptance testing. Acceptance testing proves the real-world conditions that every feature and function of the system must satisfy correctly and repeatedly.

Acceptance tests represent the common point of understanding and agreement between the business users and the technologists responsible for a system. If the system can satisfy all these tests consistently, every time the tests are run, then any changes to components can be readily verified as not having a detrimental effect on the system. And since acceptance tests are satisfying real-world conditions, the business users are given some measure of the system's reliability before the system is released. *

It does take a bit of work to get started and enumerate the acceptance test cases. It also takes some work to get both the business users and IT developers and analysts to buy into the process and realize the benefits. But ask the traders above and their staff if it would be better to be frustrated by a malfunctioning system. But building the foundation takes less time than you may think. You can start with one simple spreadsheet listing the test cases, but the point is to start somewhere. The journey of 1000 miles begins with a single step.

Once you have this foundation, then your developers can branch out into Test-Driven-Development and other testing practices, and your business users can be more self-sufficient in setting up test cases. There are even open-source tools that can translate Excel files and "natural language" test cases into actionable code (FIT, for example). Imagine the situation where your business users can keep up with changing business conditions by submitting test cases in Excel on their own, without having to know XML or a cryptic language. To receive the most up-to-date feedback, automate the tests so that their execution is a convenience to be enjoyed by all, not a burdensome task to be carried out by a lone savior/scapegoat. You may even find your organization creating defect-free releases before too long.

And the next time the middle-office system is tested, the only things taking shots are the components, business processes, and assumptions made about the features and functions - but NOT the people developing, testing, or using the system.


*BONUS FEATURE: Acceptance tests provide some very valuable documentation of the functions and features of the system. The value of documentation will be addressed in a later post.