16:42 January 31, 2012

Talos Signal from Noise: analyzing the data

Recently, a change was pushed as part of the Signal from Noise effort in order to make Talos statistics better: https://bugzilla.mozilla.org/show_bug.cgi?id=710484 The idea being that the way were are doing things is skewing the data and not helping with noise.

Currently, pageloader calculates the median after throwing out the highest point: http://hg.mozilla.org/build/pageloader/file/beca399c3a16/chrome/report.js#l114 We introduced --ignoreFirst to instead ignore the first point and calculate the median of the remaining runs.

However, after introducing the change we noticed that our distribution had gone bimodal during side by side staging:

Were we doing something other than what we thought we were doing? Were our calculations wrong? Or was something else going on?

So jmaher and I dove in to take a look at the data. jmaher dug up a high-mode and low-mode case from the TBPL logs corresponding to the push sets displayed on graphserver

high point:
NOISE: __start_tp_report
NOISE: _x_x_mozilla_page_load,109,NaN,NaN
NOISE: _x_x_mozilla_page_load_details,avgmedian|109|average|354.25|minimum|NaN|maximum|NaN|stddev|NaN
NOISE: |i|pagename|median|mean|min|max|runs|
NOISE: |0;big-optimizable-group-opacity-2500.svg;123.5;354.25;92;1130;147;1130;1078;92;100
NOISE: |1;small-group-opacity-2500.svg;109;2333.25;103;9247;103;9012;9247;111;107
NOISE: __end_tp_report
low point:
NOISE: __start_tp_report
NOISE: _x_x_mozilla_page_load,108,NaN,NaN
NOISE: _x_x_mozilla_page_load_details,avgmedian|108|average|113.00|minimum|NaN|maximum|NaN|stddev|NaN
NOISE: |i|pagename|median|mean|min|max|runs|
NOISE: |0;big-optimizable-group-opacity-2500.svg;119;353.75;91;1132;139;1132;1086;91;99
NOISE: |1;small-group-opacity-2500.svg;108;113;103;9116;103;133;9116;108;108
NOISE: __end_tp_report

From http://pastebin.mozilla.org/1470000 .

Since I can't really read this being a mere human being, I modified results.py to parse this data:

+if __name__ == '__main__':
+    import sys
+    string_high = """
+    string_low = """
+    big = PageloaderResults(string_high)
+    small = PageloaderResults(string_low)
+    import pdb; pdb.set_trace()

This makes some explorable PageloaderResults objects that explorable with pdb . While I did this for a one-off hack, this is something we'll probably generally want as part of Signal from Noise: https://bugzilla.mozilla.org/show_bug.cgi?id=722915

Then I looked at the data:

(Pdb) pp(small.results)
[{'index': '|0',
  'max': 1132.0,
  'mean': 353.75,
  'median': 119.0,
  'min': 91.0,
  'page': 'big-optimizable-group-opacity-2500.svg',
  'runs': [139.0, 1132.0, 1086.0, 91.0, 99.0]},
 {'index': '|1',
  'max': 9116.0,
  'mean': 113.0,
  'median': 108.0,
  'min': 103.0,
  'page': 'small-group-opacity-2500.svg',
  'runs': [103.0, 133.0, 9116.0, 108.0, 108.0]}]
(Pdb) pp(big.results)
[{'index': '|0',
  'max': 1130.0,
  'mean': 354.25,
  'median': 123.5,
  'min': 92.0,
  'page': 'big-optimizable-group-opacity-2500.svg',
  'runs': [147.0, 1130.0, 1078.0, 92.0, 100.0]},
 {'index': '|1',
  'max': 9247.0,
  'mean': 2333.25,
  'median': 109.0,
  'min': 103.0,
  'page': 'small-group-opacity-2500.svg',
  'runs': [103.0, 9012.0, 9247.0, 111.0, 107.0]}]

You'll notice that a few things from the runs data:

  • the runs data is indeed bifurcated. In all case there is a low value, around a hundred, and a high value in the thousands
  • contrary to the assumption that the first datapoint may be biased and high, you can't really see any bias, at least compared to the magnitude of the bifurcation

So how does this compare to the graphserver results? http://graphs-new.mozilla.org/graph.html#tests=[[170,1,21],[57,1,21]]&sel=1327791635000,1328041307110&displayrange=7&datatype=running

For the old data and the low value of the new data, we see times around 110-120ms. The high value of the new data is around 590ms. Are these numbers what we'd expect?

Throwing away the high value and taking the median for both data sets gives a number of the order of 100 or so (the old algorithm). Taking the median functions as a filter for the bifurcated results towards the majorant population. Since the low population is slightly more majorant, dropping the highest number in the way that pageloader does further biases towards it. It is not surprising we see no bifurcation in the old data.

For the new data, we drop the first run. Coincidentally or not, for the cases studied the first run was part of the low population, so that tends towards bifurcation. Taking the median of the remaining data points gives

High case:

  • big-optimizable-group-opacity-2500.svg : (1078 + 100) / 2 = 589
  • small-group-opacity-2500.svg : (9012 + 111) / 2 = 4561.5

Low case:

  • big-optimizable-group-opacity-2500.svg (99 + 1086) / 2 = 592.5
  • small-group-opacity-2500.svg : (133 + 108) / 2  =  120.5

So why does high case come out high and the low case come out low? So there is even more magic. Graphserver reports an average by take the mean of all the pages but discarding the high result: http://hg.mozilla.org/graphs/file/d93235e751c1/server/pyfomatic/collect.py#l208 (from http://hg.mozilla.org/graphs/file/d93235e751c1/server/pyfomatic/collect.py#l265 from http://hg.mozilla.org/graphs/file/d93235e751c1/server/collect.cgi ). Since both of the runs exhibit the high value of the bifurcation in the high case, you report the lower of the two bifurcated values: 589, from big-optimizable-group-opacity-2500.svg. Since in the low case only one of the values is bifurcated, you get the low value: 120.5, from small-group-opacity-2500.svg .

Okay mystery solved. We know why graphserver is reporting what data it is reporting and we also know that our algorithm is doing what we think it is doing. However, this is the beginning instead of the end of the problem.

By taking the average and discarding the high value of two data points, we are doing something weird and wrong. We are effectively only reporting one of the two pages. Note for the high and the low case what we are actually viewing data from the different pages! This is misleading and probably outright wrong. We essentially have two pages just to throw one of them away and then we have no confidence at what we are looking at. I'm not sure if the code at http://hg.mozilla.org/graphs/file/d93235e751c1/server/pyfomatic/collect.py#l208 would even work for a single page. Probably not. In general I grow increasingly skeptical of our amalgamation of results. We need increasingly to be able to get to and manipulate the raw data. We certainly need a way of digging into the stats and know what we're looking at and have confidence in it. In general, talos, pageloader, and graphserver need to be made such that it is both easier to try new filters as well as more transparent to what is actually happening.

We have been trying to bias towards the low numbers. Looking at the data for the four tests show that there are 13 low-state numbers and 7 high-state numbers. While there are more numbers in the low state, it is not an overwhelming majority.

This leaves the big elephant in the room: why are these runs bifurcated? Are we seeing a code path, or is something else happening on these builders that leads to bifurcated results? While this will be challenging to investigate, IMHO we should know why this happens. While our method of throwing out the highest data point, getting the median, throwing the data to graphserver, then getting the average of the whole pageset back, has a positive effect of minimizing noise (which is important), it is also sweeping a lot under the rug. We need to have confidence that what we're ignoring is okay to ignore. I don't have that confidence yet.