15:48 January 8, 2013

Mozilla Automation and Testing: Signal from Noise, 2012

We've written up what we've been doing as part of the huge effort of the Signal from Noise project.

Look at:

10:20 November 13, 2012

Perils of Version Pegging in Python Packaging

When working on an ecosystem of python packages where some packages depend on other packages, it becomes a question what versions of the dependencies to require. There are three basic choices:

  1. Unpegged: If foo depends on bar, allow any version of bar to be used.
  2. Exactly pegged: If foo depends on bar, require a specific version of bar. This is done in python with the string bar == 3.14 to require version 3.14 of bar.
  3. Forward compatible: If foo depends on bar, require a minimum version of bar. This is done in python with the string bar >= 3.14 to require at least version 3.14 of bar.

There is no magic bullet: all of these strategies have advantages and disadvantages. In general, the API of dependencies will change and a consumer of a particular version will only work with a certain range of versions of the dependency. Because it is in general unknown whether the next version of a dependency will break the API for consuming software, there is not a blanket strategy whereby compatability can be guaranteed via a setup.py file.

Considering the cases, case 1. allows for the most flexibility: if any version of dependency (bar) is registered, the dependency is satisfied. (Otherwise, the latest version of the dependency will be downloaded and installed from e.g. http://pypi.python.org/ .) However, case 1. is very vulnerable to API changes in the dependency: it does nothing to ensure that the dependency is compatible with the consuming software. Assuming that the latest versions of a set of packages are internally compatible, a fresh install will give an internally compatible set of packages. However, if a package is updated there is nothing to guarantee that the API is compatible.

Case 2 is the most strict: the consuming package demands a particular version of a dependency. If this strategy is followed for all dependencies, it can be assured that for a particular version of the consuming software (foo) that a compatible version of the dependency (bar) is used. However, this is done at the price of losing forwards compatability. If a new version of the dependency (bar) is available, it will not be used regardless of compatability.

Case 3 seeks to balance the alternatives: the consuming packages demands a version of a dependency of at least a given version. This protects from using an API that is too old for the package of interest. This strategy also allows newer versions of the software to be installed without complaining. If the API hasn't changed, then this is good. However, this still does not protect from API changes. If the newest version of bar has a different API from the minimum version specified in foo's setup.py, while setup.py won't complain, the software will not work. Ideally, one would be able to note post-facto that there was an API-breaking change in the new version and that all software pegged to bar >= 0.1 should really be bar >= 0.1, bar < 1.0. However, once a distribution of (e.g.) foo is released, it cannot meaningfully be re-released.

15:10 August 29, 2012

Mozilla Automation and Testing: How Talos Works and Why SfN is Hard

I've had several conversations since starting the signal from noise project about enhancing the statistical fidelity of Talos numbers about "Why is this hard?". From a developer point of view, you look at http://graphs.mozilla.org/ for a particular test, you see a nice number per changeset. The numbers might be a little rough (or very rough), but things are good enough, right? We just need to make the numbers a little better and turn TBPL orange on failure.

The truth of the matter is that those nice series of numbers hide a whole story behind them. For complex software like Firefox , performance testing is not an easy problem. Talos performance testing has historically been done by engineers who wanted to have some numbers to compare. While this is often how software starts -- throwing things together -- it is not to be mistaken for rigorous or extensible.

Where we are now

I debated whether to start with how things currently look or how things should look. While starting with how thing should look gives an unfettered view of Firefox performance testing, I've decided to start with how things currently work for those familiar with the current system and to emphasize the challenges getting from here to where we need to go. I'm not justifying (or contesting, for that matter) the decisions as to why its done the way that it was done. I'm just trying to explain it.

To start off, we have two kinds of tests: startup tests and page load tests . In the interest of time and simplicity, let's pretend that it is true that startup tests start the browser, load a URL, and then measure the time at an event (onload or mozAfterPaint), then shutdown the browser; and for page load tests we start the browser, load a list of pages from a manifest (each page N times), and then likewise usually measures at some event. There are many variations on this theme: tests can compute their own metrics, you can load the pages in different order, etc., but the above is the basic idea.

From this, we get a series of numbers. For startup tests, it is just a list of (e.g.) times. For page load tests, you get a series (e.g.) of N numbers per page.

Outside of a little streamlining and a lot of details I'm glossing over, we mostly want to keep the above procedure. The disparity begins with what we do with those numbers.

In order to send data to our graphserver , we have to get the data for each test into a format that graphserver likes. Since the startup test results are just a list of page (e.g.) load times, these can be directly translated to the graph server format, using NULL for page names. For page load tests, on the other hand, we have N numbers for each page. So Talos averages the values for each page and sends a list of averages. Not that this average may not be a straight mean. The default is to ignore the maximum value (per page) and take the mean of the remaining iterations. But this is configurable per-test.

The list of numbers and page names is uploaded to the graphserver. When you look at graphserver, you see a single point for each test for each changeset, not a list. This is because graphserver does additional averaging across the page set , ignoring the maximum value and taking the mean of the remaining numbers.

This is the crux of the problem. Every time an average (of any form) is taken, you're reducing a spectrum to a single scalar. While having a single number makes it easy to read and deal with, what that number means is obscured. The graphserver averaging is particularly hazardous. Since we average across pages, we are averaging numbers that may be of very different scales. So pages that take longer to load/render have more weight than pages that take less time to load...EXCEPT we throw away the most expensive page. Which if you think about it is strange: the most expensive page is likely to be consistent from run to run (if its not, then other strange things could happen in this averaging). We run this page many times and upload it and then ignore it.

The averaging on the Talos side is also problematic, though more subtly. As documented in Larres' thesis , load/render times for a particular page do not follow a bell curve distribution. Multi-modal distributions are often seen in practice and dropping the maximum value was (probably) done in order to nudge the data towards the lowest mode. However, when this doesn't work, the averaging is just misleading. While several hypotheses have been proposed, no one ultimately knows what conditions cause the multi-modality. This would be a worthy field of study of its own right.

So now we've reduced (in the page load test case) a 2d array of numbers into a single number per test (or page set, depending on your perspective) per changeset for display on http://graphs.mozilla.org/ . Now how do we detect regressions?

A regression:

http://graphs.mozilla.org/graph.html#tests=[[230,131,12]]&sel=1346117321000,1346290121000&displayrange=7&datatype=running

According to our documentation, "to determine whether a good point is "good" or "bad", we take 20-30 points of historical data, and 5 points of future data." Of course, it doesn't tell how we use this data. Larres tells us more here: https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=74 Essentially, we make two windows: one before the data point, and one after the data point. We use a t-test to see if there is statistical significance between the two series. If a regression is detected, the dev-tree-management list is emailed. Then mbrubeck actually looks at the data and tries to figure out if its an actual regression. I get fifty or more of these emails per day. Most of them don't appear to be actual regressions, at least to the naked eye given the amount of noise in the various data sets.

Using the "past" and "future" window is intended as a "before" and "after" picture. However, a big implicit assumption is that the numbers are flat for each segment: that is to say that nothing is pushed in the range before alters performance and nothing pushed pushed in the range after alters performance. This is a pretty brazen assumption and has certainly been wrong long enough in practice.

While looking at a single number of graphserver is very convenient, it is also misleading. The statistics applied to Talos data to determine if a performance regression or improvement is seen is a good example of a very engineer-y metric: various tactics are tried until something is found that is sorta stable, "looks right", but it isn't clear at all what it measures or how rigorous it is.

We can do better.

Where we want to be

The most important part of solving a problem is to get people to care about the problem. A critical part of getting software engineers to care about a problem is to build a system that is easy (and maybe even fun) to use. We need to be able to rigorously identify regressions. This is a hard task. If a regression is seen, whatever UI we build for it should clearly display it and clearly display why it is a regression. One artefact of our current system where we reduce all the data into a single number per changeset is that it is not at all clear if the regression is real or noise. We have no ability to drill down in the data and see which pages regressed. We have no particular clue as to what happened. In fact, regressions aren't marked on the chart at all.

Another critical part is sending the right signal to the right people. This means getting rigorous regression information into the hands of people that can understand (with the tools given) the extent of the regression and can hopefully help determine why. TBPL should go orange if a regression is pushed. This does not mean that we can never have regressions -- that is unrealistic, as desired features may require performance regressions to implement, as well as trade-off decisions between competing performance metrics (the infamous example being speed vs memory).

https://wiki.mozilla.org/Auto-tools/Projects/Signal_From_Noise and other blog posts here have discussed in detail why we want to keep the full spectrum of numbers that we get. We don't measure our noise levels. We don't know how many samples are required for convergence or if we've reached it or if that's even possible (though Larres has done some analysis ). We need to know this. Some tests we might run for too many iterations. Others we run for most assuredly far too few.

And, as discussed elsewhere , I think it is extremely important that people actually look at this data. Having a system that is easy to use, rigorous, and that documents how it does its calculations will be a huge help, as people would actually have a reason to want to use it. But we also need someone that's really ready and willing to drill down and mine the data for the knowledge it contains. Why do we get multi-modal distributions? What are we testing? What aren't we testing?

If you think this all sounds hard....it is! Its a lot of work and there aren't many appreciable short cuts. Much of our work thus far has been ripping out hacks that were made for expediency in the past, and replacing them with less hacky code. There are some things worth doing right. Going without performance tests for Firefox is pretty much unthinkable, so we're left with the alternative: actually making a system that works.

13:53 July 24, 2012

Mozilla Automation and Testing : The Naming of (Talos) Things

As a Talos developer, I have found it confusing what Talos tests were named on TBPL and graphserver . I am not alone: https://bugzilla.mozilla.org/show_bug.cgi?id=770460 So I sat down to figure out how Talos was run by buildbot and how to correlate test names across Talos, buildbot, graphserver, and TBPL.

Buildbot initiates PerfConfigurator to generate a YAML configuration file which is then executed by run_tests.py . This may invoke any number of tests . Talos reports this information to the graphserver. The buildbot suite name is reported to TBPL as well as the links returned from graphserver

See also: https://wiki.mozilla.org/Buildbot/Talos#How_Talos_is_Run_in_Production

Buildbot

I set out to make a script that gathered this information and follow the information flow. The basic buildbot configuration is found in http://hg.mozilla.org/build/buildbot-configs/raw-file/tip/mozilla-tests/config.py . While I only needed the SUITES variable, which contains the name as reported to TBPL as well as the Talos command line for each suite, the entire file has to be imported and read by python to work. So I added buildbot as a package dependency. In addition, I had to mock the project_branches.py and localconfig.py files. For localconfig, I purely stubbed it, since I didn't need it anyway: http://k0s.org/mozilla/hg/talosnames/raw-file/tip/talosnames/localconfig.py For project_branches.py, I could have pulled this down in real time, and should for up-to-date information, but for momentary expedience I just copied it: http://k0s.org/mozilla/hg/talosnames/file/tip/talosnames/project_branches.py

Talos

This takes care of the buildbot information. For desktop talos, it is then possible to call PerfConfigurator with the arguments from mozilla-tests/config.py and generate a Talos configuration file. remotePerfConfigurator currently requires a device to be attached in order to work correctly, so I punted on that problem for the time being. Having the config file, it can be read to introspect how the tests are being run.

Hovering over a talos letter on TBPL, you can see the full name of the associated (TBPL) suite, e.g. Talos nochrome opt was successful, took 12mins when one hovers over T (n) . If you click on the n, you will see the name of the suite as reported by buildbot: Rev4 MacOSX Lion 10.7 mozilla-central talos nochromer . Note the nochromer from http://hg.mozilla.org/build/buildbot-configs/file/68c191f31d39/mozilla-tests/config.py#l291 You can also see the name of the test as reported to graphserver, in this case:

tdhtmlr_nochrome_paint: 738.29

Where the 738.29 is a link to the graphserver data . The name, tdhtmlr_nochome_paint is the name of the talos test plus the test name extension for Page Load tests but not for Startup tests : http://hg.mozilla.org/build/talos/file/de24503258c7/talos/output.py#l174

The test_name_extension appends _nochrome and/or _paint depending on if these flags are set, usually via --noChrome and --mozAfterPaint arguments to PerfConfigurator. In order to determine the correct test name extension, I used the talos.test module and inspected if the test was a subclass of TsBase or PageloaderTest : http://k0s.org/mozilla/hg/talosnames/file/ef8590b55605/talosnames/web.py#l63

TBPL

TBPL determines its suite name and letter from a long if..else regex matching chain in Data.js: http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/bad7f21362be/js/Data.js#l512 This takes the buildbot suite name, some magic and glue, and yields its long name, which is then matched up in http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/tip/js/Config.js to get the TBPL initial. Rather than trying to stub how this is done, I took advantage of the structure of this file in a horrible hack, whereby I matched the regexes with a regex and then extracted the information I wanted from them (Don't try this at home, kids!): http://k0s.org/mozilla/hg/talosnames/file/ef8590b55605/talosnames/api.py#l77 This is highly undesirable, but it does work (for the time being).

Graphserver

So we have buildbot, TBPL, and the talos sides of things figured out, nicely lining us up to tackle graphserver. Graphserver details the test mapping from short name to long name in the rather Kafka-esque data.sql schema: http://hg.mozilla.org/graphs/file/da54bac92c1b/sql/data.sql#l2568 I wanted to at least get the long graphserver names from the short names, as these are the only strings displayed in the UI. So I created a in-memory database using SQLite as there was no desire to persist the data, just read it, and SQLite is built in to python and avoids database-deployment woes. The table defitions was not SQLite-compatible, so I created my own table definitions . unix_timestamp() is not a SQLite function, so I removed lines containing a reference to this function. Fortunately, this does not affect any of the test lines I care about.

Putting it all together you get a table following the information flow:

  • buildbot has test suites which contain arguments to PerfConfigurator
  • PerfConfigurator generates a YAML file which is used as configuration to run one or more tests
  • the tests report results to graphserver and the resulting links are displayed on TBPL
  • the buildbot suite is reported to TBPL
  • graphserver maps the Talos test names, plus an extension for the page load test case, to a full name displayed it its UI

I called the script I wrote to parse all of this talosnames: http://k0s.org/mozilla/hg/talosnames . Its one of the messiest scripts I've ever written, though I suppose its partially amazing, given that no one ever thought about doing this before, that it was possible to write at all.

Currently, talosnames outputs just a single page which I host here: http://k0s.org/mozilla/talos/talosnames.html Its not dynamic, currently, but if it needs to be regenerated please feel free to ping me and I can do this.

How this could be easier

In general, this was mostly an exercise in untangling a web that we ourselves wove. If we had decided and stuck with conventions up front, there would be nothing to do here.

  • if Data.js was a JSON structure, talosnames could read this JSON and do the regex matching itself: https://bugzilla.mozilla.org/show_bug.cgi?id=774942
  • if talos test name extensions didn't depend on startup test vs page load test the world would be a better place
  • remotePerfConfigurator currently requires a device to be attached to generate configuration: https://bugzilla.mozilla.org/show_bug.cgi?id=775221 . If remotePerfConfigurator could work sans a device, we could generate and inspect this test information in talosnames.
  • I couldn't really figure out what buildbot command lines were for desktop and which were for mobile. I probably could have eventually tracked this down, or did a much easier hack whereby if --fennecIDs was in the command line then I'd call remotePerfConfigurator though the above prevented action on this anyway
  • data.sql should mostly go away
  • up to date data structures: talosnames graphs the tip of TBPL's Config.js and Data.js , builbot-config tip's config.py , and graphsever tip's data.sql . While this gets the latest information, it is unknown what the deployment state of any of these files are.

TODO

While I am glad to be able to sort this out a bit, a lot more could be done given the time.

  • a TBPL-like view that displays the TBPL abbreviations and maps to buildbot suites and tests
  • list which buildbot suites are active or inactive
  • Talos counters: the talos test config lists some of the counters (although not all) in http://k0s.org/mozilla/talos/talosnames.html . Graphserver, on the other hand, has entries for each of these counters on a per-test basis. Counters are mostly a mess in Talos. It would be nice to consolidate them there and display in talosnames all of the counters associated with a Talos test.
09:03 May 14, 2012

Automation and Testing : Overhaul of Talos Configuration

Last week I pushed a fix to bug 704654 that fixes a number of issues, conceptual and user-facing, with how Talos handles configuration. I've had an idea on how I wanted to do this for a few months now, but it has always been tabled. But with my (joking, sorry) pledge to Bob Moss to fix all bugs in Talos by the end of quarter.

I had a free weekend so instead of killing the prerequisite bugs as I usually do I decided to tackle the problem in one go. My goals:

I hear people prefer blog posts with pictures, so with no reason here is a bunch of cute foxes:

/mozilla/images/panda_adoption.jpg

I've moved the basis of the Talos configuration to PerfConfigurator.py instead of some combination of .config files, PerfConfigurator.py, and run_tests.py. This gets rid of the duplication between the various config files as well as the command line options. In fact, there isn't much left of the configuration files

I don't like configuration to live in code, and so empathize with those who look at this cautiously from that point of view. However, PerfConfigurator following my rework isn't so much configuration, but a configuration basis. Given the goals above, some piece of code has to validate a given configuration, has to know what data is in a configuration, and has to provide whatever command line options are used to front-end the configuration. The previous incarnation of Talos and PerfConfigurator had a significant amount of code to this end, but it was both spread out and incomplete. So I don't think putting it all in one place is a big conceptual change. Having a piece of code that knows the allowable form of configuration gives great power and having the code all in one place just makes it more human-readable.

The unofficial history of Talos configuration, as I understand it, goes something like this: Initially, there was one configuration file. You copied it, edited it by hand, and ran your tests on it. At some point, this became cumbersome, and PerfConfigurator was created to automatically fill in values from a set of command-line choices, and in addition allow the values to be marked up a bit. The road was already paved for some part of configuration basis living in code versus in the .config file. Then, as the need to run tests in different configurations grew, .config files flourished to this end. I'd like to think the changes for bug 704654 as the next logical step in Talos's configuration evolution.

Longer term, we'd like to remove even more of Talos's configuration and replace .yml files with command line options. The complexity of configuration will be managed by mozharness .

09:33 April 25, 2012

Automation and Testing : Considering a Page-Centric Talos

Currently, the canonical unit of Talos tests is a page set. However, a page-centric point of view offers several intrinsic advantages on top of being, in my opinion, more conceptually coherent.

A page-centric point of view allows easy adding and updating of pages. Currently, making a new page set is a big deal. Since we average over all pages in a page set to obtain a quality metric, adding a new page (or removing a page) will change this number and the entire baseline for comparison has to be recentered. If we made the page the canonical unit of testing, then adding or removing a page doesn't involve a recentering as each page has a quality metric associated with it.

Taking an average over all pages to get a quality metric, as we do, gives a higher weight to pages that take (e.g.) longer to load. For instance, consider the output for tsvg:

|i|pagename|runs|
|0;gearflowers.svg;79;65;68;68;67
|1;composite-scale.svg;46;35;44;41;42
|2;composite-scale-opacity.svg;21;22;24;22;20
|3;composite-scale-rotate.svg;23;21;21;20;19
|4;composite-scale-rotate-opacity.svg;19;24;19;19;23
|5;hixie-001.xml;45643;14976;17807;14971;17235
|6;hixie-002.xml;51257;15193;21693;14969;14974
|7;hixie-003.xml;5016;37375;5021;5024;5008
|8;hixie-004.xml;5052;5053;5054;5054;5053
|9;hixie-005.xml;4618;4533;4611;4532;4554
|10;hixie-006.xml;5059;5107;9741;5107;5089
|11;hixie-007.xml;1629;1651;1648;1652;1649

A performance loss (or gain) in e.g. gearflowers.svg is likely not to be noticed in this pageset as it is several orders of magnitude lower than (e.g.) hixie-002.xml, so a small percentage-wise noise in the latter could easily hide a legitimate regression in the former.

Having this additional data of what changes regress which pages allows us to explore how these particular page modifications affect performance. If we can isolate patterns, we can fix them.

One conceptual disadvantage to a page-centric approach is that deciding whether a changeset is a net regression or not becomes harder. Ideally a human (or other expert system) would evaluate all of the data across pages and decide whether a change is a regression or not. However, we have many pages and not enough people, so this is harder to do than to craft a formula for a quality metric. To obtain an overall quality metric for a push, some sort of averaging over pages must be done. We currently throw away the highest value and take the mean of remaining page averages. If we continue with this approach we throw away the ability easily add and remove pages without futzing with the metric. Instead, a method should be sought whereby adding a new page does not affect a metric.

12:44 February 15, 2012

Talos Signal from Noise: Configurable Talos Data Filters

As part of Signal from Noise I introduced a patch that changes the way --ignoreFirst works and adds configurable data filters to Talos :

While this is a small change in terms of how the code currently works, it lays the groundwork for a window of possibilities in terms of Talos statistics. Currently, pageloader calculates the "median" (ignoring the high value), the mean, the max, and the min, and outputs these along with the raw run data. Pageloader is for loading pages and taking measurements, not really for doing statistics. So it would be nice to move this upstream: first to Talos, then to graphserver proper.

Being able to specify data filters with --filter from the command line and filter: in the .yml configuration file allows the test-runner to change the "interesting number" by which we measure performance metrics on the fly. While there are currently only a few filters available, it is easy to add more metrics as we need them.

In a parallel effort, the JetPerf software consumes Talos filters . This is a good example of the expansion of the Talos ecosystem: as a ciritical part of our performance testing infrastructure, building tests and frameworks on top of Talos. In general, the A Team is moving towards a testing ecosystem of reusable parts and sane APIs.

Data filters were added to talos as an interim measure to make the "interesting number" calculations more flexible. As we play with different types of statistics, we need the ability to change configuration without having to jump through too many hoops and this fulfills this immediate need.

However, in the longer term, Talos and pageloader shouldn't really be doing statistics at all. They are in the "statistics gathering" camp where graphserver is in the "statistics processing" business. It would also be nice if there was a piece of software that let you analyze Talos results locally, ideally using the same statistics processing package that graphserver uses. This is outlined in https://bugzilla.mozilla.org/show_bug.cgi?id=721902 .

http://k0s.org/mozilla/talos/bug-721902.gv.txt
16:42 January 31, 2012

Talos Signal from Noise: analyzing the data

Recently, a change was pushed as part of the Signal from Noise effort in order to make Talos statistics better: https://bugzilla.mozilla.org/show_bug.cgi?id=710484 The idea being that the way were are doing things is skewing the data and not helping with noise.

Currently, pageloader calculates the median after throwing out the highest point: http://hg.mozilla.org/build/pageloader/file/beca399c3a16/chrome/report.js#l114 We introduced --ignoreFirst to instead ignore the first point and calculate the median of the remaining runs.

However, after introducing the change we noticed that our distribution had gone bimodal during side by side staging:

Were we doing something other than what we thought we were doing? Were our calculations wrong? Or was something else going on?

So jmaher and I dove in to take a look at the data. jmaher dug up a high-mode and low-mode case from the TBPL logs corresponding to the push sets displayed on graphserver

https://tbpl.mozilla.org/php/getParsedLog.php?id=8982519&tree=Firefox&full=1
high point:
NOISE: __start_tp_report
NOISE: _x_x_mozilla_page_load,109,NaN,NaN
NOISE: _x_x_mozilla_page_load_details,avgmedian|109|average|354.25|minimum|NaN|maximum|NaN|stddev|NaN
NOISE: |i|pagename|median|mean|min|max|runs|
NOISE: |0;big-optimizable-group-opacity-2500.svg;123.5;354.25;92;1130;147;1130;1078;92;100
NOISE: |1;small-group-opacity-2500.svg;109;2333.25;103;9247;103;9012;9247;111;107
NOISE: __end_tp_report
https://tbpl.mozilla.org/php/getParsedLog.php?id=8982267&tree=Firefox&full=1
low point:
NOISE: __start_tp_report
NOISE: _x_x_mozilla_page_load,108,NaN,NaN
NOISE: _x_x_mozilla_page_load_details,avgmedian|108|average|113.00|minimum|NaN|maximum|NaN|stddev|NaN
NOISE: |i|pagename|median|mean|min|max|runs|
NOISE: |0;big-optimizable-group-opacity-2500.svg;119;353.75;91;1132;139;1132;1086;91;99
NOISE: |1;small-group-opacity-2500.svg;108;113;103;9116;103;133;9116;108;108
NOISE: __end_tp_report

From http://pastebin.mozilla.org/1470000 .

Since I can't really read this being a mere human being, I modified results.py to parse this data:

+
+if __name__ == '__main__':
+    import sys
+    string_high = """
+|0;big-optimizable-group-opacity-2500.svg;123.5;354.25;92;1130;147;1130;1078;92;100
+|1;small-group-opacity-2500.svg;109;2333.25;103;9247;103;9012;9247;111;107
+"""
+    string_low = """
+|0;big-optimizable-group-opacity-2500.svg;119;353.75;91;1132;139;1132;1086;91;99
+|1;small-group-opacity-2500.svg;108;113;103;9116;103;133;9116;108;108
+"""
+    big = PageloaderResults(string_high)
+    small = PageloaderResults(string_low)
+    import pdb; pdb.set_trace()

This makes some explorable PageloaderResults objects that explorable with pdb . While I did this for a one-off hack, this is something we'll probably generally want as part of Signal from Noise: https://bugzilla.mozilla.org/show_bug.cgi?id=722915

Then I looked at the data:

(Pdb) pp(small.results)
[{'index': '|0',
  'max': 1132.0,
  'mean': 353.75,
  'median': 119.0,
  'min': 91.0,
  'page': 'big-optimizable-group-opacity-2500.svg',
  'runs': [139.0, 1132.0, 1086.0, 91.0, 99.0]},
 {'index': '|1',
  'max': 9116.0,
  'mean': 113.0,
  'median': 108.0,
  'min': 103.0,
  'page': 'small-group-opacity-2500.svg',
  'runs': [103.0, 133.0, 9116.0, 108.0, 108.0]}]
(Pdb) pp(big.results)
[{'index': '|0',
  'max': 1130.0,
  'mean': 354.25,
  'median': 123.5,
  'min': 92.0,
  'page': 'big-optimizable-group-opacity-2500.svg',
  'runs': [147.0, 1130.0, 1078.0, 92.0, 100.0]},
 {'index': '|1',
  'max': 9247.0,
  'mean': 2333.25,
  'median': 109.0,
  'min': 103.0,
  'page': 'small-group-opacity-2500.svg',
  'runs': [103.0, 9012.0, 9247.0, 111.0, 107.0]}]

You'll notice that a few things from the runs data:

  • the runs data is indeed bifurcated. In all case there is a low value, around a hundred, and a high value in the thousands
  • contrary to the assumption that the first datapoint may be biased and high, you can't really see any bias, at least compared to the magnitude of the bifurcation

So how does this compare to the graphserver results? http://graphs-new.mozilla.org/graph.html#tests=[[170,1,21],[57,1,21]]&sel=1327791635000,1328041307110&displayrange=7&datatype=running

For the old data and the low value of the new data, we see times around 110-120ms. The high value of the new data is around 590ms. Are these numbers what we'd expect?

Throwing away the high value and taking the median for both data sets gives a number of the order of 100 or so (the old algorithm). Taking the median functions as a filter for the bifurcated results towards the majorant population. Since the low population is slightly more majorant, dropping the highest number in the way that pageloader does further biases towards it. It is not surprising we see no bifurcation in the old data.

For the new data, we drop the first run. Coincidentally or not, for the cases studied the first run was part of the low population, so that tends towards bifurcation. Taking the median of the remaining data points gives

High case:

  • big-optimizable-group-opacity-2500.svg : (1078 + 100) / 2 = 589
  • small-group-opacity-2500.svg : (9012 + 111) / 2 = 4561.5

Low case:

  • big-optimizable-group-opacity-2500.svg (99 + 1086) / 2 = 592.5
  • small-group-opacity-2500.svg : (133 + 108) / 2  =  120.5

So why does high case come out high and the low case come out low? So there is even more magic. Graphserver reports an average by take the mean of all the pages but discarding the high result: http://hg.mozilla.org/graphs/file/d93235e751c1/server/pyfomatic/collect.py#l208 (from http://hg.mozilla.org/graphs/file/d93235e751c1/server/pyfomatic/collect.py#l265 from http://hg.mozilla.org/graphs/file/d93235e751c1/server/collect.cgi ). Since both of the runs exhibit the high value of the bifurcation in the high case, you report the lower of the two bifurcated values: 589, from big-optimizable-group-opacity-2500.svg. Since in the low case only one of the values is bifurcated, you get the low value: 120.5, from small-group-opacity-2500.svg .

Okay mystery solved. We know why graphserver is reporting what data it is reporting and we also know that our algorithm is doing what we think it is doing. However, this is the beginning instead of the end of the problem.

By taking the average and discarding the high value of two data points, we are doing something weird and wrong. We are effectively only reporting one of the two pages. Note for the high and the low case what we are actually viewing data from the different pages! This is misleading and probably outright wrong. We essentially have two pages just to throw one of them away and then we have no confidence at what we are looking at. I'm not sure if the code at http://hg.mozilla.org/graphs/file/d93235e751c1/server/pyfomatic/collect.py#l208 would even work for a single page. Probably not. In general I grow increasingly skeptical of our amalgamation of results. We need increasingly to be able to get to and manipulate the raw data. We certainly need a way of digging into the stats and know what we're looking at and have confidence in it. In general, talos, pageloader, and graphserver need to be made such that it is both easier to try new filters as well as more transparent to what is actually happening.

We have been trying to bias towards the low numbers. Looking at the data for the four tests show that there are 13 low-state numbers and 7 high-state numbers. While there are more numbers in the low state, it is not an overwhelming majority.

This leaves the big elephant in the room: why are these runs bifurcated? Are we seeing a code path, or is something else happening on these builders that leads to bifurcated results? While this will be challenging to investigate, IMHO we should know why this happens. While our method of throwing out the highest data point, getting the median, throwing the data to graphserver, then getting the average of the whole pageset back, has a positive effect of minimizing noise (which is important), it is also sweeping a lot under the rug. We need to have confidence that what we're ignoring is okay to ignore. I don't have that confidence yet.

13:10 January 24, 2012

Mozilla Automation and Testing - Jetpack Performance Testing

I have a working proof of concept for Jetpack performance testing (JetPerf): http://k0s.org/mozilla/hg/jetperf . JetPerf uses mozharness to run Talos ts tests with an addon built with the Jetpack addon-sdk to measure differences betwen performance with and without the addon installed.

Playing with Jetpack + Talos performance lets us explore statistics in a bit more straight-forward manner than the production Talos numbers. As part of the Signal from Noise project which I am also part of, there is a lot of parts to staging even small changes in how we process Talos data since the system involved has many moving parts ( Talos, pageloader, graphserver ). By contrast, since JetPerf is a new project, it is much more flexible to explore the data that we have not hitherto explored.

I made a mozharness script to clone the hg mirror of addon-sdk . It then builds a sample addon and runs Talos with it installed.

Looking at raw numbers wasn't very interesting, so I made a parser for Talos's data format It was pretty quick to get some averages out before and after the addon was installed, but I thought it would be more usefulto display the raw data along with the averages.

https://bug717036.bugzilla.mozilla.org/attachment.cgi?id=591224

These really aren't fair numbers, as currently the stub jetpack I use prints to a file, but its at least a start of a methodology.

The reason I'm sharing this isn't just to make a progress report, but more to present some ideas about thinking about what to do with Talos data. While this was done for JetPerf, much of this also applies to Signal from Noise. You run Talos and get some results. What do you do with them? Currently we just shove them into http://graphs.mozilla.org/ and say that's where you process them, but I think looking at them locally is not only important but necessary if you're doing development work. I think a big part of any statistics-heavy projects is to make it easy for all of the stakeholders to explore data, apply different filters and see how things fit together. While it takes a statistician to be rigorous about the process, anyone can play with statistics and it takes a village to really conceptualize what is being looked at. I hope, to this end, developers will use my software so that they can understand what it is doing and provide the valuable feedback I need.

TODO

JetPerf is still very much at a proof of concept stage. Ignoring the fact that none of it is in production, there are still many outstanding questions about basic facts of what we are doing here. But outside of polishing rough edges, here are some things on the pipe.

  • test more variation of addons; currently we just load panel and print something to a file
  • test on checkin (CI): so the main point of JetPerf is to get a better idea of what SDK changes cause addon performance regressions and hits, to be able to quantify them. While as stated this is a very open ended project, one thing to turn this from a casual exploration to a developer tool is running the tests on checkin. This will give an update in real time of if a checkin breaks performance.
  • graphserver: in order to assess Jetpack's performance over time, we will want to send numbers to some sort of graphserver . This will allow us to keep track of the data, to view it, and apply various operations to it.

I may also spin off the (ad hoc) graphing portion and the Talos log parser portions into their own modules, as they may be useful outside of just Jetperf.

16:51 January 1, 2012

Mozilla Automation + Testing - MozBase Continuous Integration

As part of the A-Team 2011 Q4 goals I was able to devote a few days to setting up continuous integration (CI) for MozBase . I revived and extended autobot to support buildbot 0.8.5, set up tests and a simple test runner for mozbase, and deployed a test instance to k0s.org. You can see the waterfall here: http://k0s.org:8010

While buildbot comes with a gitpoller the version in buildbot 0.8.5 (the current in http://pypi.python.org/ ) did not work with git 1.6.3, the version on k0s.org. Since my box is on an ancient version of Ubuntu (and is remote and not trivially upgradable), I brought the generic autobot poller from being buildbot 0.8.3 compatible to 0.8.5 compatible (which is worth noting is not trivial). Also, while there has been a patch for an hgpoller submitted by Mozilla developers some four years ago, it has been be WONTFIX ed, so I went ahead with a generic polling architecture which (IMHO) seems a wiser architectural choice. While I sympathize with the architectural ideology of using a push-based architecture, and believe this is closer to ideal, polling will always work and does not require access to the repository servers which is a huge factor when using https://github.com or even Mozilla hg repositories. (Incidentally, I found neither this patch nor http://hg.mozilla.org/build/buildbotcustom/file/tip/changes/hgpoller.py to work OOTB, so, sadly, I proceeded to roll my own. Also incidentally, it is not trivial to depend on buildbotcustom using install_requires due to its lack of a setup.py file.) After debugging the gitpoller I pushed a test change and was happy to see that autobot built correctly. Autobot now listens to MozBase changes!

I was unable to finish the (parenthetical) Q4 goal of having autobot report to autolog , so this remains outstanding work. There is a lot that could be done with autolog. The basic idea and TODOs are outlined in the README (which itself could use some work; it is largely up to date except the Projects section, though incomplete). I will endeavor to work on this in my available time or as need escalates, but my priority for 2012 Q1 will be separating Talos Signal From Noise so it is unlikely I will be able to put a lot of time into autobot (sadly). On the other hand, I am more than willing to help and advise if anyone wants any features or to iron out the crinkles. While the architecture is not completely straight forward, it is a decent approximation to a convex hull over the problem space of having simple to write, simple to maintain, simple to debug continuous integration for small(er) projects. As usual, if anyone wanted to seek out alternate solutions, that is fine too, but I am essentially happy with my architecture decisions and technology choices.

Regardless of whether the CI solution for MozBase is autobot or (other), it is important to remember that continuous integration is a safety net and not a first line of defense. It is regrettable that autobot has no more notifications (yet) than the waterfall display and the autobot character lurking in #ateam (the default IRC bot isn't very verbal OOTB and I haven't had time to customize it). But I think having some (admittedly smokescreen) automated testing for MozBase is an important step towards the evolution of the software as well as towards development practices in general.

12:18 December 28, 2011

Auto-tools Q4 in reflection: progress on mozbase and talos

Most of my effort this quarter was spend on two related goals:

  1. developing a sane set of python packages to build test harnesses on top of. We call this MozBase: https://wiki.mozilla.org/Auto-tools/Projects/MozBase
  2. Making Talos sane and porting it to use the MozBase set of packages.

These are illustrated in our goals page: https://wiki.mozilla.org/Auto-tools/Goals/2011Q4#Mozbase

From one point of view, this isn't exciting work. But I live for this stuff. I think of software as an ecosystem to be cultivated and I live to cultivate it. So while, for the most part, I can't point to any exciting features that I implemented (nor were there planned to be), in retrospect I am proud of the fruits of my efforts and those of my team-mates and comrades. A big shout out to BYK and others who have stepped up to the plate to help the A-Team with these super-important efforts.

When I look back I see:

  • Talos wasn't a python package. Now it is!
  • MozBase didn't even exist or have a repo. Now it does
  • MozBase didn't have documentation or tests worth speaking of. Now it has at least a good start!
  • Talos even has a test for installation. We need more tests, but its a good start!
  • There has been a lot of cleanup of Talos towards the end of making it more robust, easier to use, and easier to contribute to.
  • The A-Team didn't have any community contributors. Now we do! This one actually makes me the happiest :)

When I look the progress, I see Talos evolving towards what I would call real software (instead of a one-off that has been extended to do way too much to make it a one-off) that Mozillians can hack on and extend and make useful changes to. This also sets the stage for making Talos easier for developers to use locally to test their changes as well as getting more of our test harnesses to use the MozBase suite of utilities as well as making it easier to write new harnesses without reinventing so much of the wheel.

One of our our next priorities towards these ends is Bug 713055 - get Talos on Mozharness in production This is a huge step towards making buildbot more extensible as well as having desktop talos be more accessible to developers in a way that should be identical to the way that it is done in automation. :aki has done a bunch of work to start moving our aging buildbot infrastructure towards something more sane. This is mozharness .

Armen (:armenzg) also updated the way that talos.zip is sought so that it can be decoupled from buildbot. This is another big step forward that he details in his blog post: talos.zip, talos.json and you .

So a huge shout out to :jmaher and :wlach for all the Talos help, and :ahal and :ctalbert as well as all the help from those in release engineering for making all of this possible. I look forward to getting this all better in the coming year.

20:06 December 5, 2011

The state of Talos this week:

This is a rough map of what we want to do. As said, with so many balls in the air, we will want to block on as little as possible and make as few really big changes at a time so that we can ensure that each piece of the puzzle fits together correctly.

10:42 November 21, 2011

I've been developing Talos recently. There are many caveats working on this test harness that demands a more rigorous process than, say, a webapp. It has a large amount of necessary platform-specific code. It is deployed in a complex infrastructure environment. And it has no tests.

In order to test Talos, the A*Team has an internal staging environment (thanks to the efforts of anode and bhearsum and others) that mirrors the production testing infrastructure environment. Like production, it requires an HTTP-hosted URL structure containing pageloader , a pageset (tp5 ), and other resources necessary for buildbot plus Talos. (We should probably document the directory structure.)

In order to test Talos, you point the A*Team staging environment configuration to your HTTP-hosted location of your copy of this structure of resources. Then you issue a buildbot sendchange (which can be scripted for ease of use) that corresponds to a set of Talos tests that are run on each platform of interest. We have some simple scripts to run tests (i.e. ./chrome.sh or ./dirty.sh) to run sets of tests as we do in production. This translates to a variety of buildbot sendchange commands appropriate for the tests to be run. Green runs means good.

In order to test my Talos changes, I needed to setup a system whereby I could translate my changes into a hosted copy of talos, pageloader, etc. So here is what I did.

Steps:

  1. Replicate http://people.mozilla.org/~jmaher/taloszips/tip/

    It would be nice to provide a sane base template for this.

  2. Put the talos zips on a web server:

    cd mozilla/web/talos # change to a desired hosted directory
    wget -r -l0 --no-parent http://people.mozilla.org/~jmaher/taloszips/tip/
    mv people.mozilla.org/~jmaher/taloszips/tip # the piece you need
    rm -rf people.mozilla.org # cleanup unneeded directories
    find tip -iname 'index.html*' -delete # remove unneeded index pages
    

    [Example: http://k0s.org/mozilla/talos/tip/]

  3. Clone a copy of Talos:

    cd ~/mozilla/src/
    virtualenv.py talos-staging
    cd talos-staging; mkdir src; cd src
    hg clone http://k0s.org/mozilla/hg/talos
    echo 'default-push = ssh://k0s.org/mozilla/hg/talos' >> talos/.hg/hgrc
    
  4. Development process:

Based on jmaher's update_talos.sh, I wrote a script to help me turn changes into changes in my hosted copy of talos.zip. Since I work largely in diffs hosted on bugzilla or my mercurial queue of Talos patches, I wanted a script that would apply a series of changes to a checkout of talos . In addition, I wanted to keep the flexibility of being able to edit these files on disk.

The script lives at http://k0s.org/mozilla/update_talos.py . I will endeavor to improve it as testing needs become more apparent. It sadly loses jmaher's update_talos.sh feature to create versioned zips. I thought about hosting a dedicated talos repository for testing (and still may, if that seems better down the line), but usually want to test a specific change and rollback to a known state.

The script does the following:

  1. Cleans up and reclones, optionally
  2. Applies a series of diffs
  3. Creates a talos.zip and moves to the appropriate place on disc.
  4. Fetches a fresh copy of pageloader.xpi
  5. Syncs the files with the HTTP server
  6. Cleans up and reclones, optionally

After the HTTP copy is updated, I can run (e.g.) xperf.sh to trigger that set of tests in the staging environment and watch the waterfall to assess the viability of the change

It would be nice to have something more generic, but the path to good software is through iteration. Perhaps as more people develop their own scripts to test Talos in the staging environment we will evolve to a more generic script to update talos as well as copies or templates of the URL/directory structure of what as needed as well as the staging software.

10:59 November 15, 2011

Introducing MozBase

Over the years, Mozilla has developed a number of test harnesses for automated testing of Firefox and other applications. Most of the harness code is written in python due to its utility towards this type of development. As one would expect, the harnesses arose from necessity and grew organically. However, as the harnesses grew it became apparent that there were several generic tasks that the harnesses shared:

  • creating and manipulating a profile
  • installing addons into the profile
  • invoking (e.g.) Firefox in a desired manner
  • process management
  • ...a few other things

These pieces have largely been developed in a vacuum (in the early stages) or copy+pasted from other harnesses (in the later stages). This has lead to duplicated functionality, difficult to maintain and inconsistent harness software (since fixing things one place means that they probably need to fix them other places), and a system which was fully understood by no one after it became of sufficient complexity. The harness software could not be reused because it is tightly coupled to the implementation even when the underlying intent was generic.

Meet MozBase!

As software grows, it should be cultivated such that the effectivity and its knowledge base are maximized. Code should be made reusable and the architecture evolved towards a representation of intent. This is the goal of the MozBase effort by the A-Team : https://wiki.mozilla.org/Auto-tools/Projects/MozBase

  • we want to make high quality components to build test harnesses
  • ... and other pieces of software
  • ... that might be useful on their own
  • we want to replace existing code with these pieces
  • ... but cultivate their knowledge base
  • we want to develop canonical and reusable python tools
  • ... and encourage the community to use them

Developing MozBase is one of the A-Team goals this quarter. While cultivating software is an ongoing effort, we're off to a good start. We already have several MozBase python packages:

Our immediate goals are to cultivate these into high-quality tools taking lessons from the existing harnesses. Then, porting the harnesses to these tools that can be maintained in a unified manner. Right now, we're working on Talos both because this is a good proving ground for these tools and because much of its code can be replaced with MozBase code easily (for some definition of "easy").

While MozBase is about software, it is also about having a sane and maintainable environment to cultivate software in. While modular packages are great, their utility is in how they may be used together (as well as with other code) instead of in the craft of an individual package. So we're tackling these issues too.

Python importing in Mozilla Central: currently (most) python in mozilla central is not packaged and we manually futz with pythonpath and sys.path in several inconsistent and hard to maintain ways. In order to move towards python packages in any reasonable fashion we need to make importing easy and unified as well as moving towards how the python world typically does importing. There is bug 661908 for creating a unified virtualenv in the $OBJDIR. Work is likely to start on this or a similar effort soon (either this quarter or Q1 2012).

Mirroring software to Mozilla Central: we have hampered ourselves -- rewritten software and avoided fixing bugs -- by not using third-party python packages for tools that live in mozilla-central. In addition, since many of the test harness already live in m-c , if we are going to move these to consume mozbase we will need a strategy to mirror it and other software to the tree. While nothing has been definitively decided, preliminary discussion has pointed towards having a script to fetch resources from a variety of locations and add them to mozilla-central or elsewhere. We're having a meeting this week to figure out what we really want to do and go from there.

Such is the MozBase effort. I am excited to start moving our code into a solid maintainable structure, and I hope you are too. If you are, please check out our github project or sign in to #ateam# and tell us what you think. We'd love contributors!

15:13 November 14, 2011

jhammel now maintains mozregression

So the secret is out!

http://harthur.wordpress.com/2011/11/01/new-mozregression-owner/

I am going to be maintaining mozregression going forward. I released a 0.6 version to pypi today which hopefully fixes a few setup.py issues. You can find me at jhammel __at__ mozilla __dot__ com or as jhammel in #ateam.

http://groups.google.com/group/mozilla.tools/t/b1f12f5127761207

14:32 November 14, 2011

Talos is now a python package

The A-Team is working on creating a set of high-quality python utilities that are consumable, general purpose, and interoperable in an effort called MozBase . A huge part of this quarter's effort is to improve Talos to consume MozBase software and to make it an extensible harness that may also be consumed.

As one of the first steps towards making Talos consume upstream MozBase packages, I have made Talos a python package . This allows Talos to depend on upstream python packages in an automated fashion, permit additional setup/install time steps to be automated, and install in a manner that dotted paths against talos can be resolved by python import. That is, other packages can now usefully import talos without depending on a set directory structure.

Unfortunately, since the talos repository was arranged such that all the python scripts and other data lived in a fairly disorganized top-level directory, this involved making a talos subdirectory and moving all files (except the README) into that subdirectory and carefully ensuring that all data resources were properly installed alongside the python scripts.

Even more unfortunately, this change led to some confusion that could have been avoided ahead of time. Talos uses a tests.zip file that contains both the scripts and the data, and though I would have liked to do additional cleanup as part of making Talos a python package, I deliberately held off on changing anything that would invalidate this methodology. However, unbeknownst to me, there were other resources that depended on the talos directory structure, and these got broken with my change. I apologize for that, and will communicate these changes more widely next time. In the meantime, if you have any tools that depend on the talos directory structure, know that they will break next time you update. If you have questions about this, please contact me.

Although the fallout was regrettable, I think this is a necessary and forward facing change in the light of MozBase, Mozharness , and general good python practices. We're now looking at deprecating the tests.zip methodology and moving towards a Mozharness script for running Talos for both desktop testers and production. More on that as things progress.