Closed Bug 712109 Opened 13 years ago Closed 12 years ago

Implement non-fatal chromehang by turning on frame-pointers on a branch

Categories

(Core :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla13
Tracking Status
firefox13 --- wontfix
firefox14 --- fixed

People

(Reporter: taras.mozilla, Assigned: vladan)

References

Details

(Whiteboard: [Snappy:P1])

Attachments

(5 files, 9 obsolete files)

20.69 KB, text/plain
Details
28.01 KB, patch
ehsan.akhgari
: review+
jorendorff
: feedback+
Details | Diff | Splinter Review
30.34 KB, patch
vladan
: checkin+
Details | Diff | Splinter Review
9.37 KB, patch
ehsan.akhgari
: review+
Details | Diff | Splinter Review
32.14 KB, patch
Details | Diff | Splinter Review
Chromehang has a great way to identify causes and severity of pauses in Firefox. Crashing the browser meant it would only capture a single pause per run, we should do better than this. In the last snappy meeting we discussed maintaining a parallel branch of Firefox that we can occasionally switch nightly users to. It would be identical to the mainline minus -omit-frame-pointer. This should allow us to "easily" walk the stack and submit multiple stack frames per firefox run via telemetry.
The profiling branch is mostly working now.  I'll publicize it early January.
Depends on: 711221
Assignee: nobody → vdjeric
Attached patch Report chrome hangs to Telemetry (obsolete) — Splinter Review
First draft of patch
Attachment #586140 - Flags: review?(ehsan)
Comment on attachment 586140 [details] [diff] [review]
Report chrome hangs to Telemetry

Review of attachment 586140 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +47,5 @@
>  #include "jsapi.h" 
>  #include "nsStringGlue.h"
>  #include "nsITelemetry.h"
>  #include "Telemetry.h" 
> +#include "nsTArray.h"

Not needed, you already include this in the header file.  :-)

@@ +456,5 @@
> +    jsval v = STRING_TO_JSVAL(str);
> +    if (!JS_SetElement(cx, report_array_obj, i, &v)) {
> +      return NS_ERROR_FAILURE;
> +    }
> +  }

I don't know these js APIs.  While the code looks sane, please ask somebody who knows this stuff to rubber-stamp this as well.

@@ +578,5 @@
> +  nsCAutoString hangReportStr;
> +  PRUint32 stackSize = callStack.Length();
> +  for (size_t i = 0; i < stackSize; ++i) {
> +    hangReportStr.AppendPrintf("0x%p", callStack[i]);
> +    if (i + 1 < stackSize) {

Nit: please do the arithmetic at the right side of the < operator.

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +89,5 @@
> +   * An array of chrome hang reports. Each element is the main thread's PC stack during a hang.
> +   * The PC stack is formatted as a comma-delimited string of program counters (in hexadecimal).
> +   */
> +  [implicit_jscontext]
> +  readonly attribute jsval chromeHangs;

Please rev the uuid of this interface as well.

::: xpcom/base/nsStackWalk.cpp
@@ +337,5 @@
>  
>  struct WalkStackData {
>    PRUint32 skipFrames;
>    HANDLE thread;
> +  bool walkCallingThread;

Can you please split this into its own bug?  The code looks good, but making this live inside this patch will cause us problems if for example we back this patch out for some reason, etc.

::: xpcom/threads/HangMonitor.cpp
@@ +79,5 @@
>  volatile PRIntervalTime gTimestamp;
>  
> +// Main thread ID used in reporting Chrome hangs under Windows
> +#ifdef XP_WIN
> +HANDLE winMainThreadHandle = NULL;

Nit: make this static please.

@@ +125,5 @@
> +  MOZ_ASSERT(aClosure);
> +
> +  char buf[1024];
> +  nsCodeAddressDetails details;
> +  NS_DescribeCodeAddress(aPC, &details);

I don't know if this is an issue here or not, but note that the first time you call NS_DescribeCodeAddress, if it finds the PDB files, it reads them synchronously from the disk.  In my part reincarnations when I was instrumenting the Firefox startup, I remember this taking 100-150ms on a machine without an SSD, which would totally screw up my measurements.

@@ +126,5 @@
> +
> +  char buf[1024];
> +  nsCodeAddressDetails details;
> +  NS_DescribeCodeAddress(aPC, &details);
> +  NS_FormatCodeAddressDetails(aPC, &details, buf, sizeof(buf));

Nit: please use mozilla::ArrayLength.

@@ +127,5 @@
> +  char buf[1024];
> +  nsCodeAddressDetails details;
> +  NS_DescribeCodeAddress(aPC, &details);
> +  NS_FormatCodeAddressDetails(aPC, &details, buf, sizeof(buf));
> +  fputs(buf, stderr);

I'm not sure if this code is experimental or not, but in general writing stuff to the stderr on Windows is not a great idea, since if the process does not have a console (which our opt builds don't), this will go to /dev/null unless you redirect stderr to a file.  Even that wouldn't work if you hit a case where Firefox restarts itself...
Attachment #586140 - Flags: review?(ehsan)
Depends on: 716590
Applied ehsan's comments + ifdef'd under MOZ_PROFILING + enabled chrome hang monitoring by default on profiling branch when Telemetry is enabled
Attachment #586140 - Attachment is obsolete: true
Attachment #587049 - Attachment is obsolete: true
Attachment #587519 - Flags: review?(ehsan)
Comment on attachment 587519 [details] [diff] [review]
Report chrome hangs to Telemetry, v2

Review of attachment 587519 [details] [diff] [review]:
-----------------------------------------------------------------

r=me

::: xpcom/threads/HangMonitor.cpp
@@ +191,5 @@
>            MonitorAutoUnlock unlock(*gMonitor);
> +#if defined(XP_WIN) && defined(MOZ_PROFILING)
> +          ReportChromeHang();
> +#else
> +           Crash();

Nit: indentation.
Attachment #587519 - Flags: review?(ehsan) → review+
Please test before landing, cause I sure did not!
Added reporting of information about modules in memory for server side symbolication
Attachment #587519 - Attachment is obsolete: true
Attachment #591321 - Flags: review?(ehsan)
Added reporting of information about modules in memory for server side symbolication

Merged into latest profiling branch, fixed a typo with an #ifdef
Attachment #591321 - Attachment is obsolete: true
Attachment #591321 - Flags: review?(ehsan)
Attachment #591496 - Flags: review?(ehsan)
Thanks for the sample.

irc summary:
Please don't use key/values for memory mapping. An array will do.
Don't include identical heap maps.
Comment on attachment 591496 [details] [diff] [review]
Report chrome hangs to Telemetry, v3

Vladan told me that this is not quite ready for review yet.
Attachment #591496 - Flags: review?(ehsan)
Leaner hang report format.
A) Removes modules not referenced on the call stack from the report
B) If there are multiple hang reports, the module info in the reports after the first one is a diff against the module info in the first report
Attachment #591496 - Attachment is obsolete: true
Attachment #593492 - Flags: review?(ehsan)
Note that unreferenced modules are not reported in the "memoryMap" sections of these chrome hangs.

I also modified firefox.exe's reported start address with a debugger during the second hang in order to show the "diff" format used for module info. The 2nd chrome hang report only has the one (modified) memory module because all its other modules are the same as the ones in the first hang report.
Attachment #591912 - Attachment is obsolete: true
Comment on attachment 593492 [details] [diff] [review]
Report chrome hangs to Telemetry, v4

Review of attachment 593492 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +80,5 @@
>                                    const nsACString &dbName,
>                                    PRUint32 delay);
> +  static void RecordChromeHang(PRUint32 duration,
> +                               const nsTArray<uintptr_t> &callStack,
> +                               nsTArray<mozilla::Telemetry::ModuleInfo> &moduleMap);

You should be able to say Telemetry::foo here and in the rest of the places in this file.  Same with the changes in HangMonitor.cpp.

@@ +89,5 @@
>    };
>    typedef nsBaseHashtableET<nsCStringHashKey, StmtStats> SlowSQLEntryType;
> +  struct HangReport {
> +    PRUint32 duration;
> +    nsTArray<uintptr_t> callStack;

This type recurs throughout your patch.  Maybe it's worth typedef'ing it?

@@ +664,5 @@
>    return NS_OK;
>  }
>  
>  NS_IMETHODIMP
> +TelemetryImpl::GetChromeHangs(JSContext *cx, jsval *ret)

I only skimmed over this function.  Please get somebody who knows a bit about the JS API to review this function.

There's one very bad thing about this function though.  You hit an error case, and you leak everything that you've allocated so far!

::: toolkit/components/telemetry/Telemetry.h
@@ +168,5 @@
> +            (size == other.size) &&
> +            (pdbAge == other.pdbAge) &&
> +            (pdbSignature.Equals(other.pdbSignature)));
> +  }
> +};

We have some code similar to this in the profiler.  Let's not create a second version of this.  Please switch to using http://mxr.mozilla.org/mozilla-central/source/tools/profiler/sps/shared-libraries.h.  Of course you would need to extend it because it currently doesn't have the notion of pdbAge and pdbSignature.  You should also implement an operator== for it, but that shouldn't be very hard.  ;-)

::: xpcom/threads/HangMonitor.cpp
@@ +144,5 @@
> +  callStack->AppendElement(reinterpret_cast<uintptr_t>(aPC));
> +}
> +
> +static void
> +GetModuleMap(nsTArray<mozilla::Telemetry::ModuleInfo> &moduleMap)

Please remove this code and rely on the one existing in the profiler.

@@ +173,5 @@
> +      moduleInfo.pdbAge = pdbInfo.age;
> +      moduleInfo.pdbSignature.m0 = pdbInfo.guid.Data1;
> +      moduleInfo.pdbSignature.m1 = pdbInfo.guid.Data2;
> +      moduleInfo.pdbSignature.m2 = pdbInfo.guid.Data3;
> +      memcpy(moduleInfo.pdbSignature.m3, pdbInfo.guid.Data4, sizeof(pdbInfo.guid.Data4));

You could just memcpy the entire GUID wholesale.

@@ +207,5 @@
> +    return;
> +  }
> +  nsTArray<uintptr_t> callStack;
> +  NS_StackWalk(ChromeStackWalker, 0, &callStack,
> +               reinterpret_cast<uintptr_t>(winMainThreadHandle));

OK, I should have probably complained about this sooner, so sorry about that!

When you hang, sending only the stack of the main thread is not helpful, because for example there's a chance that the hang is a deadlock involving all of the threads.  So you should really send the stack of all of the threads in the application (including the current one).

The r- is mainly coming from this point.

@@ +213,5 @@
> +  if (ret == -1) {
> +    return;
> +  }
> +
> +  // Remove all modules not referenced by a PC on the stack

Why are you doing this?  I think having a list of all of the loaded modules can be useful, even if they don't occur anywhere in our captured stack(s).

@@ +285,5 @@
>      if (timestamp != PR_INTERVAL_NO_WAIT &&
>          timestamp == lastTimestamp &&
>          gTimeout > 0) {
>        ++waitCount;
> +#if !(defined(MOZ_PROFILING) && defined(XP_WIN))

Please define a constant such as USE_NS_STACKWALK based on these macros so that you don't need to repeat this all over the place.  See the example in nsTableTicker.cpp.
Attachment #593492 - Flags: review?(ehsan) → review-
Ehsan, I'm not sure that increasing the size of the telemetry report to get all the stacks is a good first choice. If we discover that we could actually use that data later, we could include it, but for now I tend to think that we should focus just on the stack of the main thread, and not worry about the unrelated module list or other threads.
How big would this data be?  Maybe my intuition is off, but I was thinking that it's less than 1KB per thread?
(In reply to Ehsan Akhgari [:ehsan] from comment #17)
> How big would this data be?  Maybe my intuition is off, but I was thinking
> that it's less than 1KB per thread?

- A single thread's stack is about 1KB, we have ~25 threads at any given time
- For the attached sample, the full module map is about 10KB vs 0.5KB for modules needed to symbolicate main thread call stack

We can decide if we need to report on all threads and modules after we start getting hang data from the current design.
Can you suggest a person for looking over my JS API code?

As a sidenote, this chrome hang reporter will only be reporting on transient hangs... Telemetry is sent to our servers on the main thread, so deadlocks involving the main thread will never get reported regardless.
(In reply to Vladan Djeric (:vladan) from comment #18)
> As a sidenote, this chrome hang reporter will only be reporting on transient
> hangs... Telemetry is sent to our servers on the main thread, so deadlocks
> involving the main thread will never get reported regardless.

Windows crash reporter gives us info about main-thread hangs after the user kills the process, right?
I know that telemetry is the cool cool and stuff, but I still wonder a lot why we need to do this there and why we can't just end up using the infrastructure we already have in breakpad and Socorro and somehow channel reports there. Sure, we need to find a way to send data through those channels without actually crashing and we need to figure out privacy etc. (i.e. what prefs/decisions to base sending on) - but all the difficult infrastructure for both sending the kind of data we can act on and for analyzing the reports is already there.
Can we usefully explain why this must run outside this infrastructure and in telemetry instead?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #20)
> I know that telemetry is the cool cool and stuff, but I still wonder a lot
> why we need to do this there and why we can't just end up using the
> infrastructure we already have in breakpad and Socorro and somehow channel
> reports there. Sure, we need to find a way to send data through those
> channels without actually crashing and we need to figure out privacy etc.
> (i.e. what prefs/decisions to base sending on) - but all the difficult
> infrastructure for both sending the kind of data we can act on and for
> analyzing the reports is already there.

Actually privacy/etc is a difficult thing which telemetry takes care of. 

Another reason is that we can have this data inline with other telemetry metrics to help us diagnose what is going on. It doesn't make sense to have some telemetry for measuring responsiveness...and then do the rest via breakpad.
Whiteboard: [Snappy] → [Snappy:P1]
jorendorff: Could I ask you to take a quick look at my use of JSAPI in the TelemetryImpl::GetChromeHangs function?

From what I understand, since jsval *_retval is rooted by its caller and all the JS objects I allocate are connected to it I should be safe w.r.t. leaks and GC
Attachment #593492 - Attachment is obsolete: true
Attachment #597262 - Flags: review?(ehsan)
Attachment #597262 - Flags: feedback?(jorendorff)
In the latest patch (version 5), I reuse code from SPS's shared-libraries.h and apply Ehsan's style comments.

(In reply to Ehsan Akhgari [:ehsan] from comment #15)
> I only skimmed over this function.  Please get somebody who knows a bit
> about the JS API to review this function.
> 
> There's one very bad thing about this function though.  You hit an error
> case, and you leak everything that you've allocated so far!

Wouldn't GC take care of those allocations?

> @@ +173,5 @@
> > +      moduleInfo.pdbAge = pdbInfo.age;
> > +      moduleInfo.pdbSignature.m0 = pdbInfo.guid.Data1;
> > +      moduleInfo.pdbSignature.m1 = pdbInfo.guid.Data2;
> > +      moduleInfo.pdbSignature.m2 = pdbInfo.guid.Data3;
> > +      memcpy(moduleInfo.pdbSignature.m3, pdbInfo.guid.Data4, sizeof(pdbInfo.guid.Data4));
> 
> You could just memcpy the entire GUID wholesale.

Is it safe to assume the Windows GUID structure will always be bit-compatible with our nsID structure?
(In reply to Vladan Djeric (:vladan) from comment #23)
> In the latest patch (version 5), I reuse code from SPS's shared-libraries.h
> and apply Ehsan's style comments.
> 
> (In reply to Ehsan Akhgari [:ehsan] from comment #15)
> > I only skimmed over this function.  Please get somebody who knows a bit
> > about the JS API to review this function.
> > 
> > There's one very bad thing about this function though.  You hit an error
> > case, and you leak everything that you've allocated so far!
> 
> Wouldn't GC take care of those allocations?

I don't know, you need to check.

> > @@ +173,5 @@
> > > +      moduleInfo.pdbAge = pdbInfo.age;
> > > +      moduleInfo.pdbSignature.m0 = pdbInfo.guid.Data1;
> > > +      moduleInfo.pdbSignature.m1 = pdbInfo.guid.Data2;
> > > +      moduleInfo.pdbSignature.m2 = pdbInfo.guid.Data3;
> > > +      memcpy(moduleInfo.pdbSignature.m3, pdbInfo.guid.Data4, sizeof(pdbInfo.guid.Data4));
> > 
> > You could just memcpy the entire GUID wholesale.
> 
> Is it safe to assume the Windows GUID structure will always be
> bit-compatible with our nsID structure?

Ah, right, then no, that's not a good assumption.  So I retract my comment here.
Comment on attachment 597262 [details] [diff] [review]
Report chrome hangs to Telemetry, v5

Review of attachment 597262 [details] [diff] [review]:
-----------------------------------------------------------------

Looks really good!

::: toolkit/components/telemetry/Telemetry.cpp
@@ +920,5 @@
> +  }
> +
> +  HangReport newReport = { duration, callStack, moduleMap };
> +  sTelemetry->mHangReports.AppendElement(newReport);
> +  return;

I'm pretty sure the compiler will take care of this.  ;-)

::: tools/profiler/shared-libraries.h
@@ +120,5 @@
> +#ifdef XP_WIN
> +  nsID GetPdbSignature() const { return mPdbSignature; }
> +  uint32_t GetPdbAge() const { return mPdbAge; }
> +#endif
> +  char* GetName() const { return mName; }

Thanks for making these const.

@@ +177,5 @@
>    }
> +
> +  void SortByAddress()
> +  {
> +    std::sort(mEntries.begin(), mEntries.end(), CompareAddresses);

Seeing this and std::find almost made me cry happy tears!
Attachment #597262 - Flags: review?(ehsan) → review+
Comment on attachment 597262 [details] [diff] [review]
Report chrome hangs to Telemetry, v5

Yep, this is fine. Just minor comments.

>+    JSObject *pcArray = JS_NewArrayObject(cx, reportCount, nsnull);

The second argument is reportCount, which isn't what you meant. The same thing happens in one other JS_NewArrayObject call. Passing 0 as the second argument is best.

JS_DefineElement is a little better than JS_SetElement. It's a bit faster and more robust against scripts doing stupid things to Array.prototype. You might not want to bother changing it; JS_SetElement will work fine.

>+      val = INT_TO_JSVAL(module.GetEnd() - module.GetStart());

INT_TO_JSVAL takes an int32_t argument, and this is an uintptr_t. This might cause a warning on some platforms, so:

        val = INT_TO_JSVAL(int32_t(module.GetEnd() - module.GetStart()));

or even:

        val = UINT_TO_JSVAL(uint32_t(module.GetEnd() - module.GetStart()));
Attachment #597262 - Flags: feedback?(jorendorff) → feedback+
Ehsan: I removed the SymSrvGetFileIndexInfo API call with a function that manually walks the PE image to get the PDB version info.  The API call was not available in WinXP & Win2k3 versions of dbghelp.dll and I didn't want to mess with redistributing a new version of dbghelp.dll. 

The only change in this patch is the GetPdbInfo function in tools/profiler/shared-libraries-win32.cc.
Attachment #601312 - Flags: review?(ehsan)
Attachment #601312 - Attachment is obsolete: true
Attachment #601312 - Flags: review?(ehsan)
Attachment #601315 - Flags: review?(ehsan)
Attachment #601315 - Attachment is obsolete: true
Attachment #601315 - Flags: review?(ehsan)
Attachment #601336 - Flags: review?(ehsan)
Attachment #601404 - Flags: review+
Attachment #601336 - Flags: review?(ehsan)
Why is this not landing?
https://hg.mozilla.org/integration/mozilla-inbound/rev/35c41def4857
OS: Windows 7 → All
Hardware: x86 → All
Attachment #601336 - Flags: checkin+
Target Milestone: --- → mozilla13
https://hg.mozilla.org/mozilla-central/rev/35c41def4857
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Depends on: 735422
Blocks: 735829
I would like to back out chrome hang from Aurora to make a few changes and give it time to soak on mozilla-central.
Attachment #605916 - Flags: approval-mozilla-aurora?
Comment on attachment 605916 [details] [diff] [review]
Back out from mozilla-aurora

[Triage Comment]
Approved for Aurora 13 - if it's still early in the Aurora cycle when we feel this has had enough time to bake, please consider nominating for uplift.
Attachment #605916 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Blocks: 722368
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: