When the going gets tough, the tough get WinDbg

03/18/2012 -

When I forced myself to use WinDbg for the first time, it looked to me like some alien jibberish. I was never into the low lever assembler kind of stuff, but as I was slowly learning its basics, it turned to be a reliable companion over the years and saved me more than a couple of times in some difficult, project breaking situations. Last week a client reported a problem in one of the ASP.NET apps we developed – a button on one particular screen caused unusually long ajax callbacks which lasted on average about 30 seconds or so. The callback caused the whole screen to rebind and it involved quite a bit of code, but after examining it closely, everything seemed to be in order. Whenever there is a process hang or crash problem, WinDbg is the best and, once you learn its ways, the fastest tool to pinpoint the cause of the problem. Thus I delved into it once again.

Since these breaking situations don’t happen often and I don’t use it on a daily basis, I tend to forget some of the necessary steps. Hence I created myself a little tutorial, just to get me through the basic stuff. I’ll share it here so maybe I’ll save someone a couple of hours. Of course, these are only the basics, for some of the advanced stuff the best place to visit is If broken it is, fix it you should site by Tess Hernandez, an absolute debugging guru. When I think about all the after hours I spent digging through her posts, I feel almost like I know her in person. I hope she won’t mind saying this, but  she looks hot and knows her WinDbg, what more could a man wish for…eh if only I was single and a bit younger…

Anyway, here’s how to quickly start with WinDbg. The first thing one needs is a dump file from a problematic process. For the process hang or intensive CPU utilization problems that can be easily reproduced, one can just go to Windows task manager, right click on a process and select Create Dump File option during the problematic event occurrence. For the crash problems or hang problems which are not easy to spot or reproduce, the things are not that simple, one needs a listener ADPlus.vbs script which would automatically create a dump file in case of a crash or hang. For more info check the Ok, now how do I capture my dump? article.

The second step is, of course, to install the WinDbg. Now comes the first tricky part. A couple of years ago I was still working on a 32 bit system and had the 32 bit version installed. When I moved to the 64 bit Windows 7, I naturally thought that I need the 64 bit version. It actually depends on the dump one wants to debug. For dumps created for a 32 bit process, 32 bit WinDbg version is needed and the 64 bit one is needed for the 64 bit process dumps. So, I actually have both installed, since there are still people in my company working on 32 bit machines and sometimes I need to debug their dumps. Both versions currently can be found here, by running an install of Windows SDK and selecting the Debugging Tools for Windows option. After a successful install both 32 and 64 bit installation packages can be found in C:\Program Files\Microsoft SDKs\Windows\v7.1\Redist\Debugging Tools for Windows (default location).

When a dump file is opened in WinDbg, before any managed code can be analyzed, the SOS debugging extension  (SOS.dll) needs to be loaded. For .NET versions prior to 4.0 the loading statement is:

.loadby sos mscorwks

For .NET 4.0 debugging, the line is:

.loadby sos clr

It’s also worth mentioning that instead of this, one can use a psscor Windows debugger extension (psscor2 for .NET 2.0 - 3.5 and psscor4 for .NET 4.0), which is the Microsoft official add-on to the SOS, it's way more powerful, provides nicer formatting of memory addresses and  some neat extra commands and information (for example an extra column showing which GC generation an object belongs to). Here’s how to load it:

.load InstalledPath\psscor2.dll

it’s advisable that if the psscor is used, then plain SOS shouldn’t be loaded as well. One can check what is loaded by issuing .chain command and unload extensions with the .unload command.

Now, there is quite a number of commands one can issue depending on the type of problem, I advise further reading and searching trough previously mentioned Tess’s blog posts, she has all of the basics covered, sample problems described, learning labs, etc. I’ll focus here on the problem that I was faced with, as a real life example. I created a dump via task manager after causing the problematic button click action. Issuing a !threadpool command as per Tess’s hang troubleshooting instructions produces the following:

0:000> !threadpool
Work Request in Queue: 0
Number of Timers: 17
CPU utilization 100%
Worker Thread: Total: 2 Running: 1 Idle: 1 MaxLimit: 200 MinLimit: 2
Completion Port Thread:Total: 1 Free: 1 MaxFree: 4 CurrentLimit: 0 MaxLimit: 200 MinLimit: 2

This confirms that I captured the dump at the right time, the CPU was indeed at 100% at the time of capturing. Executing the !runaway command helps me to get a grasp of the threads that consumed most of the CPU time:

0:000> !runaway
 User Mode Time
  Thread       Time
  18:213c      0 days 0:00:55.661
  19:22dc      0 days 0:00:42.338
  13:1f68      0 days 0:00:01.045
   7:1ea0      0 days 0:00:00.920
  14:134       0 days 0:00:00.873
   4:230c      0 days 0:00:00.655
   0:21f0      0 days 0:00:00.031
   6:1964      0 days 0:00:00.015
  25:8fc       0 days 0:00:00.000
  24:150c      0 days 0:00:00.000
  23:1780      0 days 0:00:00.000
  22:1fa0      0 days 0:00:00.000
  21:1eb0      0 days 0:00:00.000
  20:137c      0 days 0:00:00.000
  17:227c      0 days 0:00:00.000
  16:10f0      0 days 0:00:00.000
  15:1f2c      0 days 0:00:00.000
  12:1858      0 days 0:00:00.000
  11:1bf4      0 days 0:00:00.000
  10:232c      0 days 0:00:00.000
   9:15e4      0 days 0:00:00.000
   8:155c      0 days 0:00:00.000
   5:1d44      0 days 0:00:00.000
   3:146c      0 days 0:00:00.000
   2:18b4      0 days 0:00:00.000
   1:1d98      0 days 0:00:00.000

It looks like the threads 18 and 19 need to be checked, since they are standing out in the crowd with their times. A ~THREAD_IDs command is used to select the specified thread and !clrstack is used to check the running managed code if one exists in the thread. Let’s check the thread 19 first:

0:000> ~19s
00000000`774e135a c3              ret
0:019> !clrstack
OS Thread Id: 0x22dc (19)
Unable to walk the managed stack. The current thread is likely not a
managed thread. You can run !threads to get a list of managed threads in
the process

So, this one is not a managed thread, it’s most likely an idle, waiting CLR thread as indicated by the !threadpool results above and explained in this article. Running kb command confirms it:

0:019> kb
RetAddr           : Args to Child                                                           : Call Site
000007fe`fdc510dc : 00000000`02fccff0 ffffffff`ffffffff 00000000`00000001 000007fe`ee12bde6 : ntdll!ZwWaitForSingleObject+0xa
000007fe`edfb1b40 : 00000000`ffffffff 00000000`00009c40 00000000`00000000 00000000`000002c8 : KERNELBASE!WaitForSingleObjectEx+0x79
000007fe`edfb1a2f : 00000000`00000000 00000000`02fccff0 00000000`00000000 00000000`00009c40 : mscorwks!CLREvent::WaitEx+0x174
000007fe`edfb178b : 000007fe`ee0cc0c8 00000000`00000001 00000000`02fccff0 000007fe`edfab953 : mscorwks!CLREvent::WaitEx+0x63
000007fe`edfbe71a : 00000000`00000003 00000000`02fccff0 00000000`00000001 00000000`00000000 : mscorwks!ThreadpoolMgr::SafeWait+0x7b
000007fe`edf7b0dc : 00000000`00000000 00000000`00000000 00000000`0a0afcb0 00000000`00000000 : mscorwks!ThreadpoolMgr::WorkerThreadStart+0x11a
00000000`7738652d : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : mscorwks!Thread::intermediateThreadProc+0x78
00000000`774bc521 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d

That leaves the thread 18, the !clrstack works there, here are the results:

0:019> ~18s
000007fe`ee14cdc0 488bc1          mov     rax,rcx
0:018> !clrstack
OS Thread Id: 0x213c (18)
*** WARNING: Unable to verify checksum for System.Core.ni.dll
*** WARNING: Unable to verify checksum for System.Web.ni.dll
Child-SP         RetAddr          Call Site
000000000b3abf90 000007fee3bc66bd Monops2.Web.Controls.PogDogInputForm.<BindPosition>b__d(Monops2.DAL.EntityClasses.PositionEntity)
000000000b3ac070 000007fee3bb8142 System.Linq.Enumerable+WhereEnumerableIterator`1[[System.__Canon, mscorlib]].MoveNext()
000000000b3ac0c0 000007fee3bb82c1 System.Linq.Enumerable.Contains[[System.__Canon, mscorlib]](System.Collections.Generic.IEnumerable`1<System.__Canon>, System.__Canon, System.Collections.Generic.IEqualityComparer`1<System.__Canon>)
000000000b3ac150 000007ff0282c793 System.Linq.Enumerable.Contains[[System.__Canon, mscorlib]](System.Collections.Generic.IEnumerable`1<System.__Canon>, System.__Canon)
000000000b3ac1b0 000007ff00a83b4d MonoSoftware.Web.Workspace.Lookup`1+#c.#p[[System.__Canon, mscorlib]].#oC(System.Collections.Generic.KeyValuePair`2<System.Object,System.String>)
000000000b3ac200 000007ff00a8317d System.Linq.Enumerable+WhereEnumerableIterator`1[[System.Collections.Generic.KeyValuePair`2[[System.__Canon, mscorlib],[System.__Canon, mscorlib]], mscorlib]].MoveNext()
000000000b3ac280 000007ff0270fdb5 System.Linq.Enumerable+<ConcatIterator>d__71`1[[System.Collections.Generic.KeyValuePair`2[[System.__Canon, mscorlib],[System.__Canon, mscorlib]], mscorlib]].MoveNext()
000000000b3ac320 000007ff02705fc5 Telerik.Web.UI.ControlDataBinder.BindToEnumerableData(System.Collections.IEnumerable)
000000000b3ac3a0 000007ff02705cd2 Telerik.Web.UI.ControlItemContainer.PerformDataBinding(System.Collections.IEnumerable)
000000000b3ac410 000007ff02705c82 Telerik.Web.UI.RadComboBox.PerformDataBinding(System.Collections.IEnumerable)
000000000b3ac450 000007ff0270580e Telerik.Web.UI.RadComboBox.OnDataSourceViewSelectCallback(System.Collections.IEnumerable)
000000000b3ac4a0 000007ff027052d2 Telerik.Web.UI.RadComboBox.OnDataBinding(System.EventArgs)
000000000b3ac520 000007ff02705201 Telerik.Web.UI.RadComboBox.PerformSelect()
000000000b3ac560 000007ff0282c675 Telerik.Web.UI.RadComboBox.DataBind()
000000000b3ac590 000007ff0282da3b Monops2.Web.Controls.PogDogInputForm.BindPosition()
000000000b3ac640 000007ff02827005 Monops2.Web.Controls.PogDogInputForm.Monops2.Web.Controls.ICustomInputControl.WriteFormValues(SD.LLBLGen.Pro.ORMSupportClasses.IEntity2)
000000000b3ace40 000007ff028267c2 Monops2.Web.Controls.EntityManager+InputFormTemplate.InstantiateIn(System.Web.UI.Control)
000000000b3ad940 000007ff0282641c Telerik.Web.UI.GridEditFormItem.InitializeTemplateForm(Telerik.Web.UI.GridColumn[], System.Web.UI.Control, Telerik.Web.UI.GridEditFormSettings)
000000000b3ad980 000007ff02816d5c Telerik.Web.UI.GridEditFormItem.InitializeEditForm(Telerik.Web.UI.GridColumn[])
000000000b3adb40 000007ff0271a706 Telerik.Web.UI.GridEditFormItem.SetupItem(Boolean, System.Object, Telerik.Web.UI.GridColumn[], System.Web.UI.ControlCollection)
000000000b3adba0 000007ff026f0078 Telerik.Web.UI.GridItemBuilder.CreateItems(Telerik.Web.UI.GridGroupingContext)
000000000b3adc50 000007ff026e8772 Telerik.Web.UI.GridTableView.CreateControlHierarchy(Boolean)
000000000b3adcf0 000007fee88f0588 Telerik.Web.UI.GridTableView.CreateChildControls(System.Collections.IEnumerable, Boolean)
000000000b3add50 000007fee88bb9f3 System.Web.UI.WebControls.CompositeDataBoundControl.PerformDataBinding(System.Collections.IEnumerable)
000000000b3adda0 000007fee88bbb45 System.Web.UI.WebControls.DataBoundControl.OnDataSourceViewSelectCallback(System.Collections.IEnumerable)
000000000b3adde0 000007ff026e7e17 System.Web.UI.WebControls.DataBoundControl.PerformSelect()
000000000b3ade40 000007ff026e7bfb Telerik.Web.UI.GridTableView.PerformSelect()
000000000b3ade70 000007ff026e2419 Telerik.Web.UI.GridTableView.DataBind()
000000000b3aded0 000007ff026e1a18 Telerik.Web.UI.RadGrid.DataBind()
000000000b3adf10 000007ff02823a68 Telerik.Web.UI.RadGrid.AutoDataBind(Telerik.Web.UI.GridRebindReason)
000000000b3adfc0 000007ff02821ea9 Telerik.Web.UI.GridCommandEventArgs.ExecuteCommand(System.Object)
000000000b3ae080 000007fee8063bd6 Telerik.Web.UI.RadGrid.OnBubbleEvent(System.Object, System.EventArgs)
000000000b3ae0d0 000007ff02821765 System.Web.UI.Control.RaiseBubbleEvent(System.Object, System.EventArgs)
000000000b3ae110 000007fee8063bd6 Telerik.Web.UI.GridItem.OnBubbleEvent(System.Object, System.EventArgs)
000000000b3ae150 000007ff028217a3 System.Web.UI.Control.RaiseBubbleEvent(System.Object, System.EventArgs)
000000000b3ae190 000007fee8063bd6 Telerik.Web.UI.GridItem.OnBubbleEvent(System.Object, System.EventArgs)
000000000b3ae1d0 000007fee80638cd System.Web.UI.Control.RaiseBubbleEvent(System.Object, System.EventArgs)
000000000b3ae210 000007fee807cd15 System.Web.UI.Page.RaisePostBackEvent(System.Web.UI.IPostBackEventHandler, System.String)
000000000b3ae240 000007fee807bff0 System.Web.UI.Page.ProcessRequestMain(Boolean, Boolean)
000000000b3ae310 000007fee807bf1b System.Web.UI.Page.ProcessRequest(Boolean, Boolean)
000000000b3ae370 000007fee807beb0 System.Web.UI.Page.ProcessRequest()
000000000b3ae3d0 000007ff025c31f4 System.Web.UI.Page.ProcessRequest(System.Web.HttpContext)
000000000b3ae430 000007fee8083357 ASP.stranice_pogdog_pogonskidogadjaj_aspx.ProcessRequest(System.Web.HttpContext)
000000000b3ae460 000007fee804671b System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
000000000b3ae510 000007fee8733971 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
000000000b3ae5b0 000007fee87243b2 System.Web.HttpApplication+PipelineStepManager.ResumeSteps(System.Exception)
000000000b3ae740 000007fee8705df9 System.Web.HttpApplication.BeginProcessRequestNotification(System.Web.HttpContext, System.AsyncCallback)
000000000b3ae790 000007fee882dde1 System.Web.HttpRuntime.ProcessRequestNotificationPrivate(System.Web.Hosting.IIS7WorkerRequest, System.Web.HttpContext)
000000000b3ae8b0 000007fee882d9ab System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr, IntPtr, IntPtr, Int32)
000000000b3aea30 000007fee882cc14 System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr, IntPtr, IntPtr, Int32)
000000000b3aea90 000007feee14e0ba DomainNeutralILStubClass.IL_STUB(Int64, Int64, Int64, Int32)
000000000b3af2c0 000007fee882df10 DomainNeutralILStubClass.IL_STUB(IntPtr, System.Web.RequestNotificationStatus ByRef)
000000000b3af3a0 000007fee882d9ab System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr, IntPtr, IntPtr, Int32)
000000000b3af520 000007fee882cc14 System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr, IntPtr, IntPtr, Int32)
000000000b3af580 000007feee14e30b DomainNeutralILStubClass.IL_STUB(Int64, Int64, Int64, Int32)

The listing shows the whole sequence of methods and the interesting point is obviously the last one (bolded at the top) called BindPosition. It is actually quite a simple method, just a few lines of code:

private void BindPosition()
    cboPosition.SelectedValue = String.Empty;
    cboPosition.DataSource = PositionEntity.GetLookup().GetBindingSource(entity => entity.LocationId == (!String.IsNullOrEmpty(cboLocation.SelectedValue) ? new Guid(cboLocation.SelectedValue) : Guid.Empty));
    cboPosition.DataSource = null;

Without going into too much details, the cboPosition is a Telerik RadComboBox control that is dependent on another RadComboBox cboLocation. Its selected value determines the data source of cboPosition, which is fetched from cache containing all positions. The only suspicious line there is a lambda expression which filters out positions according to a parent location. I just changed the code to this:

private void BindPosition()
    cboPosition.SelectedValue = String.Empty;
    Guid locationId = !String.IsNullOrEmpty(cboLocation.SelectedValue) ? new Guid(cboLocation.SelectedValue) : Guid.Empty;
    cboPosition.DataSource = PositionEntity.GetLookup().GetBindingSource(Resources.EntityManager_EmptyValue_Undefined, entity => entity.LocationId == locationId);
    cboPosition.DataSource = null;

and voila - the callback was reduced from 30 seconds to a glimpse of a second. Now, there are close to 3000 cached position entities and lambda expression is evaluated for each of them. I still wouldn’t expect that the lambda piece of code even repeated 3000 times would cause such a delay. I further discovered that the part responsible for creating new guid from string caused about 5 seconds of delay, the remaining part was caused by Telerik RadComboBox SelectedValue property getter, but I didn’t get deeper into what causes it. The solution was sufficient for my purposes. So, as a note to myself and an advice to anyone that encounters something similar – it’s safer to keep lambda expressions as short as possible and have everything that can be calculated/set outside of it done that way. As for the WinDbg, yet again it hasn’t failed me and I’d recommend it to everyone. The learning curve is a bit steep, but the pay off is more than worthwhile.

Rated 4.40, 5 vote(s).