Tester recommendation Engine

Internal App | Applause Quality App | 08-2019

Product Design, UI, UX Writing

The problem

Test cycle staffing is currently a manual process done by test engineers, favoring top testers which creates bias and gaps in coverage. The goal is to automate this process for more reliable test results and better coverage for Applause’s customers. Any noise on a cycle will require extra time from the test engineer and increase cost to the business.

MVP will evolve in phases until full automation, so we have to frame the problem differently for each stage of maturity.

How Might We…provide test engineers with a recommendation engine to help evaluate qualified testers to deliver successful test results?

How Might We…enable test engineers with a feedback mechanism in order to train the engine to deliver more accurate recommendations?

How Might We…empower test engineers with automated test cycle optimization to save them time and reduce cost?

The Challenges

  • Tester selection bias will remain until full automation

  • Balance trust from users for human validation with business goal

  • UI decisions have to feed back to data science model attributes

The PROCESS

Working closely with the Data Scientists and Product Manager, we discussed and defined each attribute to feed the model during our product definition phase. As you can see in the diagram below (left), historical data on testers fed into the 3 data science models which become predictive attributes that composite a total ‘Fit’ score.

Each attribute can be broken down to 3 value ranges: High, Medium and Low. This creates a total of 27 different combinations. Below (right) is an example of one of the combinations.

HHM = High (Bug), High (Test Case), Medium (Noise).

ds-models.png

Next, I created user flows that outlined a test engineer’s decision making process when evaluating and selecting testers. It was crucial at this stage to balance automation goals with the trust of the user while we trained the algorithm with human validation and feedback. In order to prevent selection bias, we were very careful which pieces of platform data to disclose at each stage of maturity.

THE SOLUTION

UI decisions and copy were meticulously crafted and tested. All ‘Fit’ scores were ranked into three ranges which translated on the UI to ‘Great Fit’, ‘Good Fit’ and ‘Poor Fit’. It was discovered during early user testing that the raw Fit scores (i.e. 72) were misleading when not scaled and resulted in distrust of the engine. So it became a conscious design decision to hide some of what goes into the score and only display an attribute’s value and associated sentiment. As you can see on the tester scorecard, there is the option to Agree or Disagree with the attribute’s sentiment about a particular tester which then feed back into the data science models. Comments are also collected at the early stage to clarify any discrepancies.

Cycle Fit behaves like an advanced filter on the search results UI, allowing the user to modify each data science attribute based on their cycle needs. It also offers a way to build user trust as they familiarize themselves with how the algorithm behaves. Copy was carefully written and tested to avoid selection bias or leading questions. I learned that treating the values as levels of importance was the most neutral over trying to quantify the goal of the cycle. The 3-tab selector in the UI maps back to the 3 value ranges of each attribute - High, Medium, Low.


 

EARLY ADOPTION

MVP of the tester recommendation engine was measured by the usage of recommended testers in a test cycle. Early adoption indicates that 33% of ‘Great Fit’ testers were staffed for a cycle compared to 23% previously, and less ‘Poor Fit’ testers were selected dropping from 16% to 12%.

Learnings

  • User experience is 2-way as the user plays an integral part in training the model

  • Through many phases and iterations, I learned to be flexible as the vision evolved

  • UX writing was very important in balancing business goals with user goals