Sunday, September 23, 2012

ASP.NET IIS Issue and Resolution



Our team is responsible for backend process of supply and delivery of customer online shopping for one of the worlds leading retailers.

Recently we were working on the new project release and also system upgrade across all the servers where we operate for day to day activities. Interestingly we had a very peculiar issue and took some time out of our schedules and halted our rollout operations.  Our new server build is ready and our application is installed on our OPS environment for the signoff of the build with the latest application versions.

Our Applications deployment specifications were like this:

1)DotNet Framework  1.1 / 2.0 / 3.5  
2)IIS 6.0 
3)We have lot of applications with Windows / web  -  Asp.net website , WCF and Webservers also. 
4)Only One App-pool has been configured on IIS.

Intermittently we faced the server error occurring  and it was stopping our testing activities. Initially we thought it might be some memory or pagesize issue or app-pool recycle issue.

Client Application (Windows Service) consuming IIS hosted WCF service is failing with 0x80070006 (The handle is invalid) error, once App Pool is recycled.

And the event log captured says something like this :-

Event code: 3005
Event message: An unhandled exception has occurred.
Event time: 06/07/2012 17:48:54
Event time (UTC): 06/07/2012 16:48:54
Event ID: 9bae3be46e794c248586e5ee0f66504f
Event sequence: 3
Event occurrence: 2
Event detail code: 0

Application information:
Application domain: /LM/W3SVC/1/Root/GHS_PCS_WS-1-129860669341875000
Trust level: Full
Application Virtual Path: /GHS_PCS_WS
Application Path: D:\Inetpub\wwwroot\GHS_PCS_WS\
Machine name: GHS9159FSR01

Process information:
Process ID: 22360
Process name: w3wp.exe
Account name: NT AUTHORITY\NETWORK SERVICE
Exception information:
Exception type: COMException
Exception message: The handle is invalid. (Exception from HRESULT: 0x80070006 (E_HANDLE))

Request information:
Request URL: http://localhost/GHS_PCS_WS/DeliveryWebService.svc
Request path: /GHS_PCS_WS/DeliveryWebService.svc
User host address: 127.0.0.1
User: NT AUTHORITY\SYSTEM
Is authenticated: True
Authentication Type: Negotiate
Thread account name: NT AUTHORITY\NETWORK SERVICE

Thread information:
Thread ID: 6
Thread account name: NT AUTHORITY\NETWORK SERVICE
Is impersonating: False

Stack trace:

Custom event details:
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
What is the exact text of any error messages that you received that are associated with this problem? Event code: 3005
Event message: An unhandled exception has occurred.
Event time: 06/07/2012 17:48:54
Event time (UTC): 06/07/2012 16:48:54
Event ID: 9bae3be46e794c248586e5ee0f66504f
Event sequence: 3
Event occurrence: 2
Event detail code: 0

Application information:
Application domain: /LM/W3SVC/1/Root/GHS_PCS_WS-1-129860669341875000
Trust level: Full
Application Virtual Path: /GHS_PCS_WS
Application Path: D:\Inetpub\wwwroot\GHS_PCS_WS\
Machine name: GHS9159FSR01

Process information:
Process ID: 22360
Process name: w3wp.exe
Account name: NT AUTHORITY\NETWORK SERVICE
Exception information:
Exception type: COMException
Exception message: The handle is invalid. (Exception from HRESULT: 0x80070006 (E_HANDLE))

Request information:
Request URL: http://localhost/GHS_PCS_WS/DeliveryWebService.svc
Request path: /GHS_PCS_WS/DeliveryWebService.svc
User host address: 127.0.0.1
User: NT AUTHORITY\SYSTEM
Is authenticated: True
Authentication Type: Negotiate
Thread account name: NT AUTHORITY\NETWORK SERVICE

Thread information:
Thread ID: 6
Thread account name: NT AUTHORITY\NETWORK SERVICE
Is impersonating: False


We could replicate the issue when we do a apppool recycle and after resetting iis it was working allright. But in Live server we could not sustain such a uncertainty as it involves huge loss.

A team of 3 engineers started analyzing the issue and step by step opting out the possibilities and the options tried are

1.            Iis reset
2.            Force App-pool recycle
3.            Scheduled time App-pool recycle.
4.            Stop App-pool recycling.
5.            Change the Page Size limit
6.            Change Authentication types.
7.            App Pool separation for different framework applications.


But in spite of all the above changes and implementing all the suggestions on some of the blogs we could not over come the issue and we have to raise a ticket to Microsoft team as we didn’t had much time to do more R&D on the issue.

This is the first time in my career where I interacted with Microsoft  for technical assistance.

A assignee of the ticket has started the next hour to investigate the issue. He has called me many times to understand our project the system specifications, and also learnt how to replicate the issue.

We had already sent the Event logs to the team for  their  analysis. But they had come back with more log requirements for their investigation

1.            Application Event Log
2.            IIS Log
3.            Process Monitor Log
4.            ETW Trace Log
5.            WCF Trace Log
6.            Com+ and Com Diagonostic Tracing
7.            AdPlus FullOnFirst Crash Dump.



We had to replicate the issue and capture all the above logs at the same time.

After this they analyzed the logs and came up with the suggestions.

1.            Remove impersonation in WCF web site.
2.            Separation of  Webservies and WCF services website
3.            App-Pool Separation of all the Framework and different type of application


When we implement these changes, those issues disappeared and now its smooth operation.

Some Information on the logs:


1) AddPlus- FullOnFirst Crash Dump:

A user mode memory dump is a file typically with a DMP extension that contains a portion of virtual memory and CPU registers.
Two important items captured by the memory dump include the call stack for each thread and the exception record.  This information is captured for the moment in time that the memory dump is taken.
A call stack is the way the operating system keeps track of the currently executing functions within the thread.  Each thread within the process will have its own call stack.  The call stack is a LIFO queue with the currently executing function being at the top of the stack.
The call stack reported by the tools used to capture the memory dump will only be as good as the symbols that are present.  Without symbols for a module the reports output by the tools will show the memory offset from the beginning of the module, or try to get an export from the binaries, in which case it will be from the nearest function call.  In the event that the memory dump is captured on a machine that does not have the symbols, or the symbols are not setup correctly they will be recalculated by memory address when it is analyzed on a machine with the proper files.

2) Process Monitor Logs:


Process Monitor is an advanced monitoring tool that shows real-time file system, registry and process activity. It combines the features of two legacy Sysinternals utilities, Filemon and Regmon, and adds a number of other enhancements.
Process Monitor can be used to track system and software activity to troubleshoot some of product issues, especially when it is necessary to track what particular application or process accesses a file or a registry key.
The main Process Monitor window lists all system operations along with their exact time, process name, ID and the result for each single operation:


To access advanced information on any single operation right-click on the operation line and choose Properties:

When analyzing a Process Monitor log, it is recommended to filter out entries. For example you can right click on Successes under Results, and exclude it. You can also filter out Processes and generally any field you like.

3) IIS Logs

The IIS logs provide a great deal of information about the activity of a Web application. You can find the IIS logs in systemroot\System32\LogFiles\W3SVCnumber, where numberis the site ID for the Web site.

4)ETW Tracing

Event Tracing for Windows (ETW) is a feature of the Windows operating system that allows you to troubleshoot issues with HTTP requests in IIS components like Active Server Pages (ASP), Internet Server Application Programming Interface (ISAPI) extensions, and the Secure Sockets Layer (SSL) Filter service. If an HTTP request in IIS fails or becomes unresponsive while ETW is enabled, you can view ETW trace data, called events, to determine which component caused the failure
Difference between IIS Logs and ETW Tracing
In general, IIS log is more traditonal way to collect data, and ETW provides more dynamic and verbose information.
For performance tracking, both provide time-taken information per request.
By using ETW, we can collect data in real time or file. But, you need to write consumer to get it in real-time.

5) Com+ and Com Diagnostic Tracing.

COM tracing can help you troubleshoot issues that are related to permission, activation, or registration.

The tracing mechanism produces a binary file. The binary file must be formatted by Microsoft so that it can be analyzed.

You can trace the following COM+ and COM providers:
COMSVCS
COMADMIN
DCOMSCM (RPCSS)
OLE32

6) WCF Trace Log:

Windows Communication Foundation (WCF) outputs the following data for diagnostic tracing:
Traces for process milestones across all components of the applications, such as operation calls, code exceptions, warnings and other significant processing events.
Windows error events when the tracing feature malfunctions. See Event Logging.
WCF tracing is built on top of System.Diagnostics. To use tracing, you should define trace sources in the configuration file or in code. WCF defines a trace source for each WCF assembly. The System.ServiceModel trace source is the most general WCF trace source, and records processing milestones across the WCF communication stack, from entering/leaving transport to entering/leaving user code. The System.ServiceModel.MessageLogging trace source records all messages that flow through the system.

Tracing is not enabled by default. To activate tracing, you must create a trace listener and set a trace level other than "Off" for the selected trace source in configuration; otherwise, WCF does not generate any traces. If you do not specify a listener, tracing is automatically disabled. If a listener is defined, but no level is specified, the level is set to "Off" by default, which means that no trace is emitted.


7)Event Logs

The application Event log contains events logged by programs. For example, a database program may record a file error in the application log. Events that are written to the application log are determined by the developers of the software program.



References:

1) http://blogs.msdn.com/b/cclayton/archive/2010/02/11/how-to-collect-a-memory-dump-using-adplus-and-debugdiag.aspx
2)http://msdn.microsoft.com/en-us/library/ms733025.aspx
3)http://support.microsoft.com/kb/308427