A role at my workplace that I've noticed a huge need for is one of a performance architect/analyst, which is a role I have been attempting to play a more active part in. By the very nature of the .Net framework and how much work it does for us, I think many of us throw performance considerations by the wayside and just assume that it will work when rolled out into production. Perhaps more times than not this is indeed the case initially, but inevitably performance issues are going to crop up and it will be time to fire up a production debugger/profiler and some of the CLR performance monitor counters. A little background may be in order...and this is from a corporate IT slant.
There are developers that get their jollies by designing slick interfaces/GUI's as invariably this is the only part of the application that the end user is going to see, and 9 times out of 8 when giving a demo to a corporate sponsor, the only remarks that will be made about the application concerns the GUI, i.e. “it looks great!” or “can it do this?”, etc. I am not one of these developers, mainly because I cannot design my way out of a cardboard box. I am not saying that frontend design/performance isn't important, but usually that kind of stuff is in the “here and now”, and can be dynamic in nature and easily changed. What can't be changed down the line is the plumbing of an application, and this is where I feel most at home. Every application needs built in tracing/logging functionality, diagnostics abilities, and performance monitoring facilities. Peformance guidelines need to be adhered to, tests in labs will never simulate the real environment in which your application will roll out.
Recently, an IIS cluster of ours started misbehaving badly and the aspnet_wp needs to be killed and restarted half a dozen times a day to get everything running smoothly again. The only way our infrastructure team knew what was going on is when the phone calls started coming in. This kind of reactive “user based monitoring” simply isn't acceptable. The culprit of this was the aspnet_wp racing the processor and consuming quite a bit of memory on the machines, and the server becoming unresponsive. Through no fault of their own, our infrastructure team simply didn't have the knowledge as to how to troubleshoot .Net CLR/ASP.NET production issues. Given that this is something I've always been interested in, I decided to step in and have a look around. There are numerous ways an ASP.NET application can be tuned up post-deployment via the machine.config file, in addition to guidelines that should be followed pre-deplyoment while coding. I will go over a handful of both. In this situation, the first step we took was to try and isolate the issue by removing layers of complexity one by one (we don't have a full blown test lab here, so we had to do this in production, both after hours and during a full load to get reliable numbers). Our cluster is only 2 machines watched over by Application Center 2000 (of which I am most certainly not an expert, however one thing I did learn about app center is that while it does replicate the IIS 5.0 metabase around, it does not replicate the machine.config to member machines...I wonder if this is different with IIS 6.0 as the machine.config file's processModel section is built in, my guess is that these settings would indeed be replicated in IIS 6.0), so the first thought was to remove app center and isolate the issue down to one machine. Enter the machine.config file, and the processModel section. Here is some useful information concerning machine.config and how it relates to ASP.NET performance tuning, followed by a brief overview of some of the .NET CLR performance counters that ship with the .NET framework.
Reactive Performance Monitoring Guidelines
ASP.NET Configuration Settings:
The processModel section cannot be overridden in either an App.config or Web.config file, it is system wide and applies to every web application on IIS 5.0 (6.0 introduces application pools which alleviates this constraint). Also, when any other section of the machine.config file is changed, aspnet_wp will recycle itself...this is not true with the processModel section as it's read by the unmanaged aspnet_isapi.dll. Editing the processModel section requires an IISRESET command to be issued on that machine. Standard HTML comment syntax within the processModel section is not allowed, and will result in a parsing error, I learned this the hard way a while back, so it's worth mentioning here.
I'll be honest, the default values for each processModel process recycling related attribute suck. These should not be used in a production environment as they lend themselves to the aspnet_wp spiraling rapidly out of control. In fact, the only attribute with even some semblence to a normal configuration value is memoryLimit, and even this is way off (the default value is 80% of physical RAM before the aspnet_wp recycles itself, for a bare bones web server that *might* be enough, but I doubt it). Lowering this value to 60% or less is what I would recommend. The aspnet_wp recycling itself (although resource intensive in it's own right) is much more graceful then getting out of memory errors. A quick note about the aspnet_wp recycling itself: any requests executing on the recycled thread will be served before the aspnet_wp is quietly killed off, and all new requests are immediately dispatched to the newly spawned thread, this is completely transparent to the end user as they won't lose session state, and applications state is transferred over to the new thread. For initial troubleshooting in my case, I set this value to 20%, which on this machine is ~200 megs of memory. The issue still persisted though, more on that later. There are a couple of other recycling related attributes worth mentioning, as their default values IMO are unacceptable (bear in mind that while our web apps are 24/7, they are only heavily traficked for about 12-15 of those hours).
The most common processModel attributes to tweak are:
- timeout -- the default value is “infinite“, meaning the aspnet_wp will never be restarted automatically. Again, while it is expensive resource-wise to spawn a new aspnet_wp process, in order to look out for its own health, the aspnet_wp needs to be recycled on a pre-determined basis. When left to its own devices, something can and will eventually go wrong. Only careful monitoring can determine this value, though I recommend nothing less than 60 minutes unless things are really going haywire, in which case it is probably time to visit other steps that I will mention in a bit. This is probably the best proactive setting in the processModel section.
- idleTimeout -- the default value is “infinite“, meaning the aspnet_wp will never be restarted, even if it's at a standstill. In my case, I know when an acceptable time to kill off the aspnet_wp is; it's a safe bet that after 20-30 minutes of inactivity on our machines, no one is hitting them (and it's probably late in the night), so I can safely shut down the aspnet_wp after this amount of time. Obviously, this would not work for a full blown 24/7 operation, thus this attribute is meaningless for that scenario.
- requestLimit -- once again, the default value is “infinite“, meaning the aspnet_wp will serve as many requests as are coming in. For obvious reasons, this value needs to be profiled for and set accordingly per server resources. In my case, I knew this didn't need too much fine tuning as this box never saw more than 200 concurrent requests (given the resources on the machine, 200 cocurrent requests should be handled easily).
- responseDeadlockInterval -- I personally have only run into issues with this setting once, and it was due to a long running SQL query running on the aspnet_wp thread. The default for this setting is 3 minutes, however the query ram longer than that, so eventually it timed out. Of course the real fix for the issue there would have been running the query as a .bat file, firing it off asynchronously, etc. but I thought it was worth mentioning; by default, any process taking longer than 3 minutes on the aspnet_wp thread will cause issues.
Our web server is a dual processor machine, however due to the fact that the machine.config file was in its' default state, this meant that the webGarden attribute was set to “false“ and only one processor was being utilized. Glancing at the Processor performance counter verified this. This is not something that can just be switched on in production though, there are a couple of caveats of enabling a web garden for ASP.NET, the most important of which is InProc session state will not work on a webGarden, you must use either StateServer (aspnet_state.exe) or SqlServer to persist session state. This is due to the threading model introduced in SMP systems, and is not a shortcoming of ASP.NET. This also means that the Session_End event in Globas.asax isn't fired correctly, so if you have any cleanup code there, it will need to be moved elsewhere, as a caveat of this, Session.Abandon of course will not fire Session_End. A nice side effect of using either of the out of process models is that they are durable through aspnet_wp recycles and IISRESETS. If you choose to use a StateServer, make sure you configure the ASP.NET State Server Service to start up automatically, and to use either a domain account or a network account for logging on to the network. Some other considerations for using a StateServer are:
- In a web farm, make sure you have the same in all your web servers. See KB 313091 on how to do it.
- For session state to be maintained across different web servers in the web farm, the Application Path of the website (For example \LM\W3SVC\2) in the IIS Metabase should be identical in all the web servers in the web farm. See KB 325056 for details
If you decide to go with SQLServer session state, take note of the following:
- If you specify integrated security in the connection string (e.g. "trusted_connection=true", or "integrated security=sspi"), it won't work if you also turn on impersonation in asp.net. Unfortunately, this bug
isn't reported in KB yet. (There is a QFE fix for it.)
- For session state to be maintained across different web servers in the web farm, the Application Path of the website (For example \LM\W3SVC\2) in the IIS Metabase should be identical in all the web servers in the web farm.
See KB 325056 for details.
Offloading session state to another machine will also free up valuable resources on the web servers, albeit at a performance hit for cross process marshalling calls. Of course, SqlServer session state is the most durable, but suffers from the biggest performance hit, this would only be used in the most critical of applications. In any event, choosing to enable the webGarden attribute will need to be tested thoroughly, and if you are on 1.0 of the framework, be mindful of this article (if an object is not marked as Serializable, trying to session it will result in an error...this has been remedied in 1.1). We are still in the conceptual phase for rolling this out at work, though I have successfully implemented it elsewhere.
ASP.NET/CLR Performance Counters:
After tuning up your processModel section, the need for benchmarking is apparent. The .Net CLR ships w/ dozens of performance monitoring counters. I will attempt to give a high level overview of what counters to use, what to look for, and guidelines that have served me well.
In our case, it appeared to be a memory issue, so after throttling aspnet_wp to only consume 20% of physical memory it was time to see what was going on under the hood memory wise. A common assumption is that since the CLR is garbage collected, memory “leaks“ cannot happen. Generally speaking this is correct, however if improper allocation of objects is made via code, or improper caching implemented, or InProc session state/ViewState abused, memory issues will indeed crop up. The best counter to monitor for memory consumption is Process/aspnet_wp/Private Bytes. These bytes are privy only to the aspnet_wp and cannot be shared among other processes. If this number rises and the .Net CLR Memory #Bytes stays flat, this is a good indication of a memory leak. An excellent paper by Thomas Marquardt should be read by anyone needing to gather information from .Net performance counters, and does a more thorough job of explaining the monitoring process than I could ever hope to do.
Essential performance counters include the following:
- .NET CLR Exceptions/# of Exceptions Thrown and # of Exceptions Thrown per Second. Exception handling is expensive, and the old adage of only throw exceptional exceptions holds true here. Be advised that HttpResponse.Redirect, Server.Transfer, and Response.End all throw an uncatchable ThreadAbortException, so use these counters in conjunction with ASP.NET Applications/Errors Total to see if the exceptions being thrown actually bring the application down. Generally speaking, the # of Exceptions per Second should be around 5% of the ASP.NET Applications/Requests per Second counter.
- .NET CLR Memory/# Bytes in All Heaps (use this in conjunction with Private Bytes) /% Time in GC (this average should be 5%, though spikes are not uncommon) and /Large Object Heap Size (any object over 20kb at time of creation is stored on this heap, and they are not promoted through the GC's generations of collection, an abnormal ratio of this size to #Bytes in All Heaps should set off a flag). In addition, the # Gen 0,1,2 Collections counters should be monitored. The ratio of collections should be 10 gen 0's per every gen 1, and 10 gen 1's per every gen 2. Gen 2 collections are very expensive, and also suspend all threads until it has completed. Note that there is a way around this, which I will mention later (the framework actually ships with 2 GC's, one geared towards workstations, and one geared towards servers, workstation is used by default).
- ASP.NET/Request Excecution Time and Worker Process Restarts. These are pretty obvious in their functionality and can be used to identify a bottleneck/processModel inflicted restarts respectively. In particular, Worker Process Restarts can be used to fine grain the processModel tuning mentioned earlier.
- ASP.NET Applications/Errors Total, Pipeline Instance Count, Requests Executing, and Requests per Second. Errors Total represents the number of unhandled exceptions resulting in a runtime error, this number should be low in comparison to # Exceptions Thrown. Pipeline Instance Count can be almost any reasonable number, however it should remain virtually unchanged, if this number climbs rapidly it's a sign of bottlenecks downstream (i.e. SQL Server denying connections). Requests Executing and Requests per Second can be used to identify server issues, these numbers should be low and consistent.
- ASP.NET Applications/Cache [all]. These counters will give some insight as to how effeciently Caching is being utilized, and is a good indicator as to where the application can be further tuned for caching effenciency. I will speak more about page caching in a future post, but it bears mentioning that if the Output Cache Hit Ratio is low (below 80%), the application needs some work on Page caching. The Cache Total counters reflect both internal ASP.NET caching as well as external use of the cache via the exposed API's. The Cache API entries reflect only the external use of the exposed caching API's. The main key to remember here is high ratio, and low turnover. Caching objects/pages is a good thing, but if taken to an extreme can do just as much harm as good.
- Memory/Available MBytes. If the GC is doing its' job correctly, this number should fluctuate around a consistent number (+- 100 megs or so), this counter is impacted by number of connections, and the thread pool for database connections (among other things). A spike in number of web connections or excessive database connections will cause this number to climb. A sharp spike from time to time won't be uncommon, just monitor the GC counters if you notice an influx of activity on this counter.
- Process/aspnet_wp/% Processor Time, Private Bytes, and Thread Count. % Processor Time can be used to identify cases in which the aspnet_wp is racing the processor. Prolonged values above 75% should be monitored closely. I believe if you have more than one processor, the value can actually exceed 100%, but this needs to be divided by the number of processors (i.e. if you have an 8 processor box, and see a value of 400%, it's actually utilizing 50% of the cumulative processors' time). Private Bytes should remain steady, a slow increase in this number needs to be watched very carefully and correlated with .NET CLR Memory counters. Thread count should remain at steady number if your application is single-threaded. Excessive locks or asynchronous calls in your code could cause this number to fluctuate...too many threads will severely hinder performance.
- Web Service/Current Connections. This is a baseline to see just how busy your server really is. In the controlled corporate IT world, this number is predictable and can be used to calculate various server resource consumption metrics.
- It's also a good idea to monitor all of the .NET CLR Data counters. There is a known issue of these counters freezing, if you close the monitoring window for 1-2 minutes, they should start back up again. This is a good way to check and see if connection pooling is behaving correctly. The number of connection pools should roughly equal the ASP.NET/Applications Running counter, if it's substantially higher than this, something is wrong with the way your applications are building the pool. Remember, .Net uses a kind of primitive way to lump connections together into a pool; it looks for connections using the same connections string. If you are dynamically building connections strings and they don't match 100% to another string (even if it's hitting the exact same database), .Net will spawn a new thread pool, and this is expensive resource-wise.
In addition to monitoring these counters:
- A stable benchmark for comparison needs to be gathered. Do not do this in the lab or a controlled environment, use your production server(s). Performance counters are indeed invasive, but this is something that must be done prior to monitoring an unstable machine. If you only monitor a server when it is misbehaving, you won't have anything to compare it to. I recommend 2 baseline benchmarks; one at a near idle (in my case, I came in on a Sunday and sampled 2 hours of data), and one during a normal stress load (don't do it before lunchtime...towards the end of the day is usually good). Microsoft performance counters can be exported to Excel, so you have a wealth of data manipulation at your fingertips right there.
- If you are running a multi-processor server, use the server GC that ships with the .Net Famework (mscorsvr.dll). Two articles here and here give more details on when/how/why to do this. Here is a newsgroup posting on how to do it with an environmental variable. I have no experience with this and only just learned of it recently, but it seems like a good idea, and I know of at least one blogger who has had some luck with it.
- In the end, experience monitoring both healthy servers and unhealthy servers is the key. While I'm not suggesting that you run out and attach every performance counter you can to your servers, it's a good idea to start monitoring early and regularly. You can schedule logs to run at given intervals and notify operators of issues using the performance monitoring tool (perfmon). And of course, .Net provides classes to build your own counters if you wish, but that is outside the realm of this post.
I still haven't isolated exactly what is going with the server in question, but through a series of monitoring sessions it's been easy to eliminate what is not the issue. In the end, I believe the issue is going to relate to this Q article. If not, then it's back to the drawing board. At least I'm armed with the tools I need to track down whatever the underlying issue really is.
In a future post I will go over some proactive coding best practices to prevent some of the situations mentioned in this post.
Useful Links:
CLR Memory Performance Counters
Performance Counters for ASP.NET
Microsoft .NET Framework Performance: Tips, Tools, and Techniques
Performance and Diagnostics
processModel Element
Configuring Session Cache Servers
ASP.NET Performance Monitoring
Performance Considerations
Operating .NET-Based Applications
Production Debugging for .NET Framework Applications
Posted
Mar 26 2004, 11:41 PM
by
Jayson Knight