Building Quality into HMH Ed

Published in

HMH Engineering

12 min readOct 5, 2021

Like in any form of engineering, quality needs to be baked in to all of the aspects of the software engineering process.

This year saw the successful launch of our new learning platform, which brought a fresh user experience to all of our different users, from teachers to school administrators, to students. Just as important as this though, was the high level of quality that was build into the platform throughout its development. In this post I’ll give a tour of the test strategy that was put in place for the development of the new platform which helped us attain these quality standards.

First of all I should give some background on where we were coming from in terms of testing and quality during development. Previously our approach to testing, and test automation in particular, was very much heavily focused on browser-based end-to-end testing. Any new feature or bug fix needed end-to-end tests added, and the level of lower-level code coverage such as unit or integration tests was very low (approximately 40–50%).

This meant two things — one was that we had a very large and unstable suite of tests that needed to be run at every stage of our deployment pipeline.

These tests would fail for a variety of reasons. Tests had to be written to explicitly deal with slowness in back services, which is something that is difficult to know when to do in advance, so these kinds of failures would often need to be patched. The would also fail due to errors in the backend services, which would be escalated to the appropriate teams, but still meant a lot of time invested in investigating issues that weren’t directly related to the UI code. Sometimes page locators would be updated without updating the appropriate tests and they would fail in that case. Occasionally they would also fail when the application had an issue, but these were much less common, and on top of that these issues would often be trivial code errors that could have been caught much earlier.

The amount of engineers time that was required to be invested in investigating these issues and ensuring they weren’t affecting the application adversely was very large. Whilst we got a high level of confidence from these tests, this simply wasn’t worth the effort. The second issue was that due to the low level of test coverage during the coding stage, there were very trivial issues getting into the build, necessitating backing the offending code out and blocking the deployment pipeline.

So when we saw the chance, with the development of the new platform, to overhaul our quality practices to something more modern, we jumped at the chance. We started by setting some goals for ourselves in what we wanted our new test strategy to achieve. The main goal was to provide improvement in our ability to practice ‘continuous deployment’, which is the idea that any build that is good enough to pass required automated tests should be shipped in front of customers. If our only goal was the ship more builds though, we could have done this easily by simply removing all quality gates and deploying every build straight to production. Obviously this isn’t realistic so we needed to have appropriate quality gates throughout the development lifecycle.

You might have seen the image below before — it’s what’s called a ‘testing pyramid’. While it shouldn’t be taken too literally, the intention is to reflect the idea that a larger amount of tests at the coding stage is more important than a large amount of tests at the ‘acceptance’ or ‘system’ testing phase that end-to-end browser tests would belong to.

A pyramid shaped chart showing the different types of testing, the base being unit tests, followed by integration tests, then end-to-end tests

In addition to the concept of the test pyramid, we also wanted to build into the test strategy the concept of ‘shifting left’. This describes the idea that any quality practice that can be built earlier into the process, should be. The earlier bugs are found, the cheaper they are to find and fix.

One of the most important underpinnings of this strategy is the static code analysis and code coverage metrics we put in place. For this we use a tool called Sonarqube. Sonarqube supplies a large set of rules for identifying bugs and ‘code-smells’ (code issues which don’t necessarily cause bugs but reduce maintainability or readability of code). It also provides functionality for measuring code-coverage levels per pull request. This allowed us to set a target of 90% code coverage for each pull request.

Sonarqube analysis of a pull request in the Lift repository

Whenever a pull request is raised for the Ed platform, the changes are continuously tested and the developer of the pull request is unable to merge it until the changes have been successfully tested. All tests apart from the end-to-end browser tests are run at this stage, along with the Sonarqube analysis and the results are posted to the pull request in Github.

This pull request failed the checks required

When working with a large team of developers, having quality checks that are properly enforced is essential to make sure standards are adhered to continuously and consistently.

The middle layers of the pyramid are probably the ones that there is the most debate about. It can be difficult to draw a clear distinction between ‘component’ and ‘unit’ tests. Really there’s no fundamental difference and the purpose of those parts of the pyramid is just to illustrate the fact that both may be required. At the end of the day, both types of tests require the component that is being tested to be fully isolated from its dependencies using mocks or stubs. It also acknowledges that code often needs to be written in isolation from the code that is going to use it — for example developing the API library code for a particular endpoint before actually using that endpoint in an application or component.

My colleague Aislinn wrote extensively in a previous post about how we go about doing application-level integration testing so I don’t want to go too deep into the details again here. I will say that this is probably the most important and valuable part of the test strategy as it allows us to get a pretty high degree of confidence in the functionality of the application, whilst still being able to run these tests in our CI.

Essentially, we use React Testing Library and Mock Service Worker in tandem to allow developers to write tests which render the application they are working on and use mocked responses to any API requests that are made. Importantly, the API responses are the only thing that should be mocked. This means that the application that is tested is pretty close to what a user will actually see, providing a fairly high level of confidence from tests that run quickly and reliably.

Being able to have fast and reliable application integration tests that test the user facing functionality but can still run early in the continuous integration process is excellent, but the comprehensive use of API mocking does create a gap, as we have no way of knowing if a change to the API could have been made, thus breaking our application code. This is where Pact comes in. Pact is what is called a ‘consumer-driven contract testing framework’. Much more detail about Pact can be found on its homepage, but a shortened version of the basic process for Pact is:

The consumer of an API writes a test which describes the interaction that takes place with the provider of the API — including the path, which type of request it is (GET, POST etc), and any headers, parameters or body that needs to be sent, as well as the response that is returned. Notably these tests should focus on the parts of the response the consumer actually uses. These two pieces of information together form a ‘contract’.
These contracts are published to a middleman called a ‘broker’ along with metadata such as the version of the consumer and some tags.
These contracts can then be used in the CI of the relevant providers to run a set of tests which confirm that the response they return for a given request is correct.

There is a lot more to it than this of course and successful usage of Pact or contract testing in general requires a large investment in training and setup of the broker and reconfiguring CIs, on top of actually writing the consumer tests themselves. It enables us to adopt a less end-to-end testing focused approach, which pays dividends over time.

Before jumping into discussing the ‘tip’ of the test pyramid in the form of end-to-end testing, I wanted to say a quick word about how we work in a monorepo pattern and the tooling we use to support that approach. All UI platform code for Ed is stored in a single repository, going by the name ‘Lift’. Within this repository are a large number of individual packages with specific purposes. Some of them represent applications within the platform, like the ‘My Classes’ application or the ‘Professional Learning’ application. Most of them are components, libraries or APIs which may be used by a number of other packages. Running all of the tests within the repository against any specific change would take a long time (in fact we know it takes around an hour). Lerna allows us to run only the tests that belong to packages which have been changed, or depend on packages which have been changed.

Changing Component A on the left pulls in Application A as a dependency, but changing Component B on the right pulls in both applications as both depend on it.

Often his means a pull request that changes only one package that doesn’t have any dependencies, often takes less than a minute to run the tests for.

As mentioned earlier in the article, end-to-end browser based testing is very much a double-edged sword. Having the confidence that a particular feature works end to end for real users in a real browser is great, but the fact that they run relatively late in the release process and are vulnerable to failure in a variety of ways outside the control of the team means they shouldn’t be overused.

The guidance that we give to our teams is that such end-to-end tests should be developed for only the most important user flows provided by the platform. Because the Ed platform is so large and has such a wide variety of users and usages, we still have a significant end-to-end suite, but more issues are caught earlier in the process by our other kinds of tests and the ones we do have provide more value per test than they did previously.

An example of such a user flow might be a teacher locating a resource such as an e-book using the Discover application, assigning it to a student, having the student open the resource and mark it as completed, then finally checking that the teacher sees the status as completed.

These end-to-end test are run in our deployment pipeline, both in our staging environments prior to production deployment, and on production after a new build has been deployed there.

Codecept is the framework we use for end-to-end testing. It is Javascript based which means all members of the frontend development teams can work with it easily.

Whilst it’s less important than the actual content of the tests that we write, I should also note that we wrote all of the new end-to-end tests using Codecept.io. We needed to migrate away from Protractor which our previous end-to-end suite was written in, and Codecept was found to be the best fit for our requirements.

On top of all this automated testing running our CI/CD pipelines, there are a variety of other quality related activities that we perform on a regular basis. The majority of these activities take place outside of the pipeline, either because they can take a long time to run, or because there is a manual element.

The performance of Ed is quite an important consideration for us, so we perform a blended performance test which simulates user flows in the platform and makes the same requests to backend services that would be needed during real usage of the platform. These are run with user volumes which exceed the expected actual usage so that we can be sure the backend services will cope under these loads. The tool we use for this is Gatling, which is a Scala based tool with a convenient DSL for declaring performance testing scenarios.

An example of the report generated by Gatling for one of the scenarios tested.

As well as the performance of the backend services used by the UI, we also test and monitor the performance of the UI itself during real usage using tools such as Sitespeed.io and the Core Web Vitals information surfaced in Datadog.

We also recognise that our customers, particularly students, want to use Ed on a variety of different devices and browsers. Although Chrome is by far the most widely used browser, there are still significant numbers of users that use Mac and iOS with Safari, as well as Android phones. We perform cross-browser testing according to our browser support matrix at the team level and also use Datadog RUM and Browser Logs to identify browser-specific issues.

Some of the main browsers we support. Chrome, Microsoft Edge, Firefox and Safari.

Testing for accessibility is something that is also built into our processes — my colleague Anne-Marie has previously written about why accessibility is important for HMH, and how we got about testing for it. In summary though we have tests for accessibility built in at all stages of the development process, including automated tests using jest-axe and manual testing of keyboard navigation and screen-readers.

Whilst we firmly believe in the value of reliable, fast and repeatable automated testing, we also recognise as a Quality Engineering team that testing is a skill and there is room in our process for good old fashioned ‘bug-hunting’. This takes the form of in-sprint exploratory testing of specific features, as well as regular exploratory sessions which we call ‘Test Days’ which I may cover in a later post.

You might be asking, what are the benefits we have seen from adopting this test strategy? At the beginning of this post I mentioned that we had a few main goals:

Increase the deployment velocity of the platform, following the principle of continuous deployment that dictates that more frequent deployment leads to better quality.
Observe a higher level of quality in the platform

I’m happy to say that we’ve been more than successful in both of these goals. Back at the start of this journey when we were very much still operating with the reverse of the model described here — in that we had a large volume of end-to-end tests and low test coverage at the lower levels, we were operating at a rate of less than one deployment a day. This year, as of the writing of this article we have performed over 600 deployments to production of our front-end platform code, a rate of more than 3 a day. On top of this we see a much lower build failure rate, with almost 65% of builds being able to go in front of customers, versus as few as 20% previously.

We also have great feedback from the engineers working on the platform in relation to how easy the new deployment pipeline is to interact with in comparison to the previous one.

The proof that all of this hangs together came back in July when we released the new version of Ed to the largest portion of our user base, i.e. teachers and students. This launch went extremely smoothly, with no major problems, which is really a big achievement for the first release of a new platform in front of customers. We’re happy now that we have an approach that is sustainable and will allow us to continue building on a solid foundation.

Building Quality into HMH Ed

Written by Brendan Donegan