CI Tools and Best Practices in the Cloud

Continuous Integration

Subscribe to Continuous Integration: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Continuous Integration: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Continuous Integration Authors: Stackify Blog, Aruna Ravichandran, Plutora Blog, Dalibor Siroky, PagerDuty Blog

Related Topics: Continuous Integration, Application Performance Management (APM), DevOps Journal

Article

The Rule of Log for DevOps | @DevOpsSummit #DevOps #Microservices

Logs have made many heroes. I know, because I’ve been one. But let’s understand why

"This is interesting-didn't we see this same thing in the logs just before we crashed last time?"
-
an anonymous and somewhat embarrassed DevOps team member

Unfortunately the "same thing" is an obscure, otherwise uninteresting log entry, in a directory structure that's rarely checked. More unfortunate is that the "last time" was two months ago, when their eCommerce site went down on Black Friday costing hundreds of thousands of dollars, and they had noticed this same entry but just now made the correlation. Companies have made great strides in attempting to speed up correlations like this and improve MTTR. The question that always arises (usually not long after the CFO sees the price tag) goes something like, "Is it really worth deploying an agent to all of our key app servers with the sole purpose of gathering and merging disconnected artifacts (logs) with the hope of finding an interesting correlation when it counts?"

Early in my career I remember finding a "too many open files" error in MS-DOS. I knew how to fix that. As the designer and programmer, I would either be more judicious with my file handles, or do the simple thing and increase the allotment directly in the OS instance. The reality is that today, with agile dev cycles and outsourced staff, the immediate fix is not always known, and in production it gets tricky to get buy-off for a change request that you are not 100 percent confident in. Further, microservices teams are taking over the role of ops, and supporting the DevOps culture as small teams write and manage their own code well into production. Great, now we have someone who knows how to fix it right? Kind of. What about two years from now when they are redeployed to different work streams, or what if their webservice was a really small piece of a larger system that is completely down, due only in part to their code?

Logs have made many heroes. I know, because I've been one. But let's understand why. It's typically because some programmer 10 years ago put some small artifact into their crappy code that provided an inkling that the app was in a bad state. Perhaps they even told you what part of the application was in trouble or the service it was attempting to connect to. The reality is that never, not ever, does a programmer write out to a log something that's immediately actionable. If they did, they would have written the code to remedy it. That said, I would love someday to see the log entry that states,

10 SEP 2015, 05:03:23 hey guys, sorry about that poorly formed parameterization for that web service, and if you're seeing this, it's likely because I forgot to account for special characters in the password. So, please refer to line number 407 and tell dev to fix it. Oh, and you'll need to do the same thing on the other four parallel modules and restart all services.

Companies who have opted to tackle the very difficult task of understanding app server logic take much of the burden and correlation guessing games away from DevOps. Some APM vendors are now realizing the magic in automatic correlation, anomaly detection, and smart analytics based on data within the running code. Because of native code profiling techniques and process inspection, code written in Java, .NET, PHP, node.js, Python, and C/C++ can be monitored in prod without dev intervention - no code change required - and with very little overhead.

Screenshot: AppDynamics resides within appserver code to create dynamic visualizations of related systems

Given new solutions like this, do logs still matter? Yes, but much less so. It is more valuable for an ops team to know where to look, being taken directly to the exact SQL query, webservice call, method, or line number where the problem manifested itself, while the supporting system logs are still there for audit and validation. The key is context - context that ties interesting log information with what actually happened for that single, key transaction.

Screenshot: AppDynamics keeps log information associated with key business transactions in context

Looking at the question (and cost) of deploying an agent: Does it make sense to deploy a single-purpose log aggregator to all related systems? Complicate that question by containerization within VMs within clusters, and more evolved DevOps groups are finding that getting into the code is key. With logs at each container layer, OS, and app level, it's difficult to keep the context and know what's relevant, especially when scaled to any significant deployment.

So stop looking at artifacts, stop examining remnants, stop the manual correlation. The technology now exists where you can truly understand your modern apps in ways never before possible. If you're not in the code, you do not have true context, no matter how detailed the log.

More Stories By Jason Trunk

Jason Trunk is Field Chief Technology Officer at AppDynamics.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.