The Guide to Practical and Pragmatic IT Architecture Design

Stress Test Methodology

For stress and performance testing, the methodology distinguishes three types of tests that need to be verified:
  1. The first and most important tests are the online tests. These tests are focused on the high-volume online interactions with users with the system. 
  2. Second category are concurrent tests that focuses on concurrent events that happen in production while online users are interacting with the system. These concurrent events are typical things like interfaces that use system resources while users access the system, but could also be backup, data synchronization or any other even. The objective of online user and concurrent event tests are to test that the system can handle the load that typically would occur at peak without getting too slow or crashing.
  3. And the last category is batch testing that looks at the batch schedule and activities that happen during the batch window. Its purpose is to see if the most critical and most time-consuming batch activities can finish within the specific timeframes. 
 
For each of these categories we need to select the most critical activities or events that could impact the system. The performance test criteria to select are a mix of:
  • Frequency and volume: which transactions are most frequently triggered and high volume
  • # users impacted, if the specific event does not work, how many users are impacted
  • Business criticality how critically impacted is the business if the activity does not work

So, for instance for online testing, we would look for the 10 highest volume transactions that are important to the business as well have a large impact on number of users impacted. Some examples are:
Online tests:
Login
Get quote
Buy product or service 
Query data or account
..
Concurrent tests:
(near) realtime interfaces 
Interfaces that run hourly, daily
Backup (if that situation could occur in production)
..
Batch tests
Data processing or data conciliation activities
..

Performance Test Approach

Once the test cases have been identified, the conditions, volumes and expected results need to be determined. Expected results are performance targets such as expected response times or processing time window. As each result, even run under similar conditions, will have a slightly different response, the results need to be determined statistically so typically it would be that 80% of transaction x responds within 2 seconds. 
 
With the test cases we need to plan the overall performance test and test cycles. For performance test, most common would be to run 3 cycles, where each cycle runs all tests and at the end there is an intermediate report with results and a window to do performance tuning. 
 
To run each test, it would start with a smoke test. This test is done with one transaction and checks that the system is properly set up, configured and data has been properly populated.
Then there is a ramp up test, this is step-by-step increasing the volume of transactions, and looks how the system respond. Is the usage of CPU and memory linearly increased or is there a certain abnormal behavior that needs to be looked at. This ramp up is an important step in the performance test as it could identify certain behavior that cannot be evidenced once running a full volume test. 
 
Performance Test Architecture


Once the ramp up test has shown that the system behaves correctly, the system needs to go through a soak stress test. Soak stress test is testing the platform at normal expected volume for 72 hours at least. The extended 72 hours period is important as these tests could identify memory leaks (increased usage of memory without properly releasing it) or slowing performance (due to volume of data).
The test ends with a short peak test to see if the system can handle expected peak volumes during the day, week or month.

Only in the last test cycle, a break test is performed, to see under which volumes the system would really break or more concisely described, when the system stops responding within the expected response time result. 


Environment and Tooling

The logistics of performance testing is something that needs careful planning. The key element is the environment that needs to mimic as much as possible the production size environment. 
Additionally, testing tooling needs to be used that help with automating running the test cases and measuring the response times. In parallel system logs need to be reviewed to look at resource use such as CPU, memory, disk and I/O usage. 

The tooling also helps with simulating high volume of user transactions. Not only can one user be simulated doing a transaction with a think time, i.e. time between every user interaction, but also with distributing a full load of users in a statistical way, randomly spreading the interactions over time. 

Use case:

For a large ecommerce implementation, the performance test was the only part that needed to be completed. A new manager was responsible for executing the performance test and due little experience, they executed the tests without tooling, but were able to simulate the same volumes of transactions and load as was specified. However, once the system went into production, the system collapsed after 30 hours and the go-live had  to be retracted and postponed.

What happened? Even though the system was tested with the right volumes, the conditions how the test was executed did not simulate a real production environment. It was found out that the volume loads were tested using the same IP addresses and that the system technically was seeing that as the same user, even if they had different user names. Once in production, obviously every user used a different device and the system could not handle that due to a misconfiguration. Performance testing tooling would have helped here as it simulates traffic from different devices and a costly re-run could have been avoided. 

Timeline

From a timeline perspective, a performance test execution is relatively quickly. For small systems, it can be done within a week, for large implementations such as ERP or eCommerce, it would take 3-4 weeks to execute the performance tests. What takes more time is the planning, defining the test cases and logistics. Below is shown a test plan for a Core implementation with 3 cycles. A cycle would run for a week, with running the tests the first 3 days and reviewing the results in the second part of the week. The final week would be used to run the final cycle and break testing and prepare the final report.
 
The final report would show the conditions and environment that has been tested. It would also show list of transactions and interfaces that have been tested and their respective response times. 
The key of testing is to cover as much as possible, but there may be cases left open after testing that could not be resolved during the test cycles. The report needs to identify which cases need further attention and what needs to be done to mitigate it. 

Case study:

We ran a performance test on a large system and we found out that there was a memory leak due to the increasing use of memory that would not be released. The problem would only surface after 2 days, so due to the go-live, we decided to go into production, but with a mitigating measurement to restart the system at midnight to release the memory for the first week. That gave us more time to look for how to resolve the problem without endangering the production date. 

No comments: