For over two months now, we've been preparing to migrate our services to a new datacenter. This has been, without a doubt, the most important project within the history of our company, for a number of reasons. We sell, host, and support websites for a single vertical market. A few thousand sites, all operating off a single shared framework/codebase. This is up from the few hundred sites we were hosting when I joined the company almost three years back. Our sites have been a testament to the power of the ColdFusion server platform, and I thought it important to share with you all the importance of our move, the challenges we overcame, and the benefits we've incurred.
I want to give you a little background here, so that you'll have an understanding on the history and growth of our application, and how critical it is to think in terms of scale. In talking with several developers, while at WebManiacs, I was surprised to find out just how many had come across similar issues at one time or another.
Humble Beginnings
Eight years ago, our company was started by a few sales guys. They had an idea, ran with it, and got some great results. Being tied to a single vertical market, they needed a way to produce sites that were highly customized, with a few key applications, but had a consistent underlying architecture. They contracted a company to build for them a framework, a system of interfaces where these sales guys could enter in some details and the framework would generate the 'bones' of a site for a client, something that could then be customized with graphics and data, all input through web based interfaces. What they built was basically a market specific Content Management System, complete with custom theme uploaders, navigation and page editors, a user management system, the works. For it's time (circa ColdFusion 4.0/4.5 with an MS Excel spreadsheet for a database) the system was great. The company didn't need to hire any HTML coders, just a few graphics/Flash guys and data entry people to help customize sites.
As time went on, the company needed to add new applications and features to the system. The contractors that had built the original framework had long since been let go. One of the Flash designers was ambitious enough to pick up a book and teach himself ColdFusion. After a while it was obvious that Access wasn't going to sustain their growth, so the same designer picked up a book on MS SQL Server (they had already upgraded from Excel to Access by this point) and began that transition. Growth was quick, and staffing low, so new applications and features were written quickly, hobbled together with the existing framework, tested to 'work', and put to market. Systems were added, and finally moved out of a a hosting facility and moved to a datacenter downtown. The server platforms were upgraded to ColdFusion 6.1. Multi-talented developers were added to our development staff, writing Flash based applications using FlashRemoting with the relatively new ColdFusion Components. The system grew.
The Issues At Hand
After several years of "throwing the mud against the wall and praying it will stick," the initial framework was nearly nonexistent and the codebase had expanded to several thousand templates of double conditionals and unscoped variables, with includes nested ten levels deep and SELECT * everywhere. Four or five different ColdFusion developers, each with varying degrees of skill (or lack thereof, depending on the individual) had contributed to the codebase, and only sub-applications written in the last three to four years had a modular design, well written SQL statements, and properly scoped variables. Each domain created a separate application on the server, with it's own set of APPLICATION or SESSION variables controlling application flow. As more clients came on, system resources would dwindle, but ColdFusion kept chugging away. Additional servers were added, and a load balancer to spread the wealth. The clients kept coming, and so did their users. A separate SQL Server was added just to track and report on user statistics. Growth was incredible.
Growth was also debilitating. A terrific challenge to overcome, from a business perspective, is to grow as a business beyond your infrastructure's capacity to sustain it. As time progressed, it became obvious (quickly) that changes would be required to continue to support our growth. The rack was a mess of wires and outdated equipment, and the time had come to have more up-to-date equipment that was prepared to scale. On top of this, due to a number of issues we had eventually begun to run our systems on top of a web platform running on multiple versions of ColdFusion, which needed to be standardized on to a single platform.
Planning a wholesale upgrade to systems architecture can be daunting. Not only are you evaluating the necessary components to remain running, you also have to pad for the needs of growth and scale. Months are spent identifying the best server resources for the task, designing the network infrastructure, and determining the technologies that will be placed in play. After this (when owned by a higher corporate entity) you also have to write justifications for your decisions, strong immutable arguments that will get you what you need from those who don't truly know what your needs are. This process can take three times as long as the initial planning, with seemingly endless rounds of justification documentation, and technology can, and does, change during this process. That means that adjustments are made to your initial architecture plans, and testing of new technologies must be completed to ensure they are the right 'fit' for your company's needs.
Meanwhile, we kept getting new clients, a greater number of users, and quickly began to feel the strain on our application performance. ColdFusion's JVM utilization began to build throughout the day, as more and more traffic came in. The SQL server was working overtime to process the tons of requests, and page response time began to lag. Clients started calling, support tickets went up, and you have to deal with it. Now you start to seriously review your applications. With a few thousand templates of code, only a small percentage of which had been refactored in the past few years, it was time to really deep dive internal storage mechanisms. Where are the bottlenecks? We moved dozen of variables from the SESSION scope into the APPLICATION scope. We moved variables from the APPLICATION scope into to SERVER scope. We var scoped every 'object' (read: CFC) within our codebase. There were nice gains, but traffic continued to increase. Then we spent time learning JVM tuning. I learned more about JRun in a nine month period than I ever thought I would need to know. Growth can be debilitating.
The Big Switch
After months of planning, cajoling, mounting, and testing, our new systems were finally in-place and ready to "go live." We were moving from a multi-version ColdFusion platform (6.1, 7, and 8) to ColdFusion 8.0.1 with all hotfixes. We were moving from SQL 2000 Standard to SQL 2005 Enterprise. We were moving from direct attached storage to a SAN. And we were moving into virtualization. Drastic changes in architecture, but built for us to succeed and scale. After days (and nights) of final load testing the day was finally upon us. We flipped the switch, changing the DNS settings of our client sites to point to our new systems, and watched as traffic slowly switched from one facility to another.
Over two weeks have passed. There were minor initial hurdles, typically related to changes in network topology or file system permissions, but overall everything has gone very smooth (knock on wood). One of the greatest indicators to date has been the JVM performance. Our code barely changed between facilities, with only minor changes to accommodate changes in network topology, but JVM performance has improved by large numbers. In fact, the overall performance gains have been pretty outstanding. Currently we can only attribute this to better equipment, and the switch to SQL 2005 Enterprise (which is smokin' fast!). Don't ever let anyone tell you that your application's performance isn't tied to your database tier. Now we're stepping out of 'firefighter' mode, and back to the task of product improvement and enhancement and new application development. It will be good to work in CFEclipse again, writing code, rather than staring at Fusion Reactor's AIR Dashboard (what a godsend product).
The Next Phase
Now that we sit on a standardized platform, with servers that perform well and less time monitoring server health, it's time to think about application architecture. To support a more agile development model, changes must be made to core application architecture, to streamline, simplify, and get with best practice standards. The first phase of this will be to document our current architecture, one piece at a time, to truly understand all aspects of the related business logic and end run goals of each and every piece of our applications. This will make it easier to identify areas of over-complexity, catalog what works well and what doesn't, and begin to develop better, and scalable, methods of achieving the same goals, with an eye on the ability to expand through modularization and reusability of code.
The lesson we learned was that there is a very fine line, at times, between being a small business and being an enterprise class business. Defining a timeline on when to refactor isn't really as important as doing it right the first time, but sometimes you don't have that luxury. Make the time to reassess your application's performance on a regular basis, tune your JVM to fit your application (read Mike Brunt), and rewrite anything that needs to be written better whenever you get a few minutes.


#1 by Mike Brunt on 7/21/08 - 3:45 PM