A Comprehensive Game Design Methodology
From First Ideas to Spectacular Pitches and Proposals

The content of this website is licensed under a Creative Commons CC BY-NC-SA 4.0 Attribution–NonCommercial– ShareAlike 4.0 International License. You can freely use and share everything non-commercially and under the same license as long as you attribute it to and link to:

J. Martin | |

However, you can also buy the Ludotronics PDF edition
for an unreasonably moderate price at DriveThruRPG.
Learn here about the five excellent reasons to do so!

Why DriveThruRPG? It’s the largest tabletop RPG download store and you’ll probably end up buying much more than just your copy of Ludotronics. Which would benefit all game designers!

Why not Amazon? Ludotronics isn’t well-suited for the Kindle format. And at €14.99, Amazon’s cut amounts to €9.75. Well, no.

More to read: My papers at Research Gate, my blogs at between drafts and just drafts.

Level One: Prototyping

Proposition Phase Level One

Beat 2. Assess

What principal function playtests serve is a frequent source of confusion, so the following needs to be spelled out early and clearly:

  • Conducting playtests is not about asking for opinions, it is about learning what’s wrong with your design.

That’s it. Everything else follows from there. It doesn’t mean that you shouldn’t ask your playtesters for their opinions, far from it. But the interesting thing is not their opinions. The interesting thing is what their opinions tell you about the strengths and weaknesses of your design.

Then, playtesting should not be confused with quality assurance testing. QA testing is conducted by a team of QA testers, who are professionals and intimately familiar with the game, for the completely different purpose of quality control as an essential part of the development cycle.

Finally, another source of confusion is how playtests should be conducted. The most common approach is to confront playtesters with a prototype or a new build, observe them playing and perhaps have them self-comment during play, make notes, ask questions about their experience with a prepared questionnaire or with open questions, possibly record everything, and then sit down and analyze the data. There’s nothing wrong with all this on the surface, that’s more or less how it should be done. It’s under the surface where misconceptions lurk. What game designers often fail to realize is that such tests are, for all practical purposes, social science research. And to obtain test results that are valid, reliable, significant, and relevant, in other words, actually useful, these tests must follow a number of rules.

The first rule of playtesting is, you have to meticulously prepare every playtest in advance. There are two different sets of preparations for two different question types. These types differ epistemologically with regard to the nature of knowledge and how that knowledge can be obtained.

  • The Unknowns. The first type are questions that you already have, e.g.: are there enough clues for the player to find the hidden path to the tower? Are there enough enemies to keep the player engaged? Will the number of enemies overwhelm the player? And so on.
  • The Unknown Unknowns. The second type are questions that you don’t know exist, e.g.: does the player correctly interpret the level objective? (Playtesters miss the hidden path to the tower not because they can’t find it, but because they misinterpret the level objective.) Would the player rather go exploring instead of fighting? (The number of enemies is perfect, but playtesters aren’t happy because they would rather enjoy go exploring at this point.) And so on.

As a general rule, you can answer the first type of questions with testing methods that fall under the label of quantitative research, and the second with testing methods that fall under the label of qualitative research. These technical terms are not important; important is how you proceed in each case. Let’s have a look at both types of research and their associated methods in detail.

For the first type, collect everything you already know you want to know, and translate that into questions in such a way that each question asks exactly one thing, and one thing only. As you might have noticed, the example question above with respect to the correct number of enemies was split in two (too many enemies to maintain control?; not enough enemies to keep engaged?) for that very reason. Then, if you have your set of questions, create a hypothesis for each question that, again, hypothesizes exactly one thing! There’s no need to create your hypothesis according to what you believe to be true. On the contrary—your results might be more reliable when you try to disprove something with your hypothesis that you believe to be true. Also, be imaginative—different hypotheses will teach you different things! Let’s say your question is: “Does the player find the hidden path to the tower?” There’s a great number of possible hypotheses for this question. For example, with separate mirror-hypotheses in parentheses: “Players will miss (find) the path to the tower.” “X percent of all players will find (will miss) the path to the tower.” “If players recognize at least two clues (fail to recognize two or more clues), they will find (they will miss) the path to the tower.” And so on. Whatever hypothesis you pick, it should be the one with the greatest potential to expand your knowledge effectively and efficiently for subsequent design decisions.

Now, why go through all this trouble to ask a simple question? There are three major reasons.

  • Focus. The first reason is that you need to focus. You cannot simply observe something and be aware of all the details, neither live nor in replay. But for playtesting, it’s the details that count. Almost all of what you’re not directly focused on will be gone. Sensory data, as you might remember from the Process phase, is first stored in sensory memory, where it decays rapidly—visual and haptic data will last from a few milliseconds to a maximum of two seconds, and audio data up to three seconds, with some luck. And the vast majority of that data isn’t forwarded to working memory, not to speak of long-term memory, to be processed. It’s dismissed because you’re focused on something else. To give you a sense of impact, there’s the famous selective attention test by Daniel J. Simons of Harvard University, more popularly known as “gorilla test.” Up to half of the participants’ sensory memories fail to forward, and the participants therewith fail to remember, an actor in a gorilla costume—who moves openly and deliberately through a group of six ball players divided into two teams. That’s because participants are occupied, as instructed, with counting the passes between the two teams.
  • Judgments. The second reason is that human thought processes are riddled with systematic errors in judging and thinking. These errors, called cognitive biases, make up convincing reasons to dismiss the undesired and embrace the expected on the spot. Watertight hypotheses make it as tough as possible for our brains to give in to these biases and weasel out of the results and the consequences of our observations. Also, never change your hypothesis after the test! Maybe it turns out your hypothesis wasn’t a good fit for your setup or your results, or you realize it was sloppily formulated. Whatever the reason, it doesn’t matter. Throw away the test results, create a better hypothesis, and run a new test from scratch. Otherwise, your biases will win every time.
  • Resources. The third reason is that you will have a much better idea of how many playtesters you will need, what kind of playtesters you will need, how many observers and assistants you will need, what kind of testing equipment and technologies you will have to employ, and how, overall, you can allocate your resources in the most effective manner.

When you’re sure that you’re testing what you actually want to be testing (that’s the “validity” part), you should then see to it that the conditions are exactly the same for every session and every playtester from a particular setup (your test needs to produce similar results under consistent conditions to pass the “reliability” check). Playtesters should play with the same equipment; different rooms or the same room at different times should have roughly the same noise level, temperature, and lighting conditions; the welcome and introduction should be the same; and whatever else applies to your setup should always remain consistent. In the same manner, following up on a topic that we discussed in the Process phase in Level Two: Interactivity, you can’t have random (or rather pseudo-random) events in your prototype’s test setup if you want to get results that can be compared and interpreted in a meaningful manner. Except, of course, you want to test a pseudo-random event!

Finally, after your test players have finished playing, you can and should ask them prepared questions to verify your observations and to stress-test your hypotheses. These questions should be the same for every player; in other words, they should be “standardized.” That way, the answers from all your playtesters from a particular test setup can be turned into a data set and compared, analyzed, and interpreted.

That should suffice for the first type of research, the “quantitative” type.

The second type of research, the “qualitative” type, needs a completely different kind of setup: the semi-structured or unstructured interview. Remember, with this type it’s about questions you don’t yet have, questions you don’t yet know exist. So you can’t just go and ask them! Here, in stark contrast to your preparations for the first type of research, you only prepare a very loose and very broad set of questions, all of which should be as open as possible. You can ask these questions after or even during a playtest (more on that below), individually with each playtester or with several playtesters as a group (also called “focus group”). You can ask all of your loosely prepared questions, or only a few, in whatever order you like, and you can and should pursue interesting clues with new questions and follow-up questions that you make up on the spot.

In Players Making Decisions, Zack Hiwiller makes a very important point in this regard, and that is to keep questioning. About our own work and our own accomplishments, we all have the tendency to answer questions and to correct wrong assumptions. And we want to be helpful! But when your playtester asks questions or makes bad assumptions or needs help, you should keep questioning instead—questions like, to take a few examples from Hiwiller, “Why would you think that?” or “What did you think at that moment?” or “Why did that bother you?” or “Why would you want to do X at that point?” and similar. For unstructured or semi-structured interviews, this is among the best advice you can get to drill down and lay bare the really interesting stuff.

A very effective method is K. Anders Ericsson and Herbert A. Simon’s “Think Aloud Method” or “Think Aloud Protocol,” proposed in their 1980 paper “Verbal Reports as Data.” With this method, the “interview” is conducted while the playtester is playing, and the “interviewer” has one task, and one task only: to keep the playtester talking about what they’re doing and why they’re doing it. You can’t ask questions that are more open than that! Should your playtesters be playing in their own homes, they can record their comments and impressions in so-called “diaries.” Which is still an effective method, even if you can’t nudge them on from time to time.

But whatever method you employ for this second type of questions, always make sure you don’t overprepare. The whole reason is to elicit interesting observations and experiences from your playtesters that tell you something about your design that you haven’t thought of yet, to shed light on the “unknown unknowns.”

Finally, there’s the question of multiplayer and coop testing for both types of research. In most cases, you want to have a setup where one playtester or one team of playtesters plays against, or together with, carefully instructed players from your developer team. Your own players could play cautiously, aggressively, professionally, naively, annoyingly, or in any manner you instruct them to, so that you can observe the behavior and reactions of your playtester or playtesters in different emotional circumstances. Only by isolating playing behavior, and by reducing the amount of events and variables you need to track, will you be able to accomplish your test objectives in coop or multiplayer setups. There might be exceptions to this. But if you want to let loose a group or groups of playtesters against each other, be sure to have a great set of hypotheses for that.

That’s enough with respect to methods. Next up, you have to decide what kind of playtesters you need. This, of course, depends on your particular game, the state of your prototype, your test setup, and your questions. But to make sense, all conceivable scenarios should have at least these two elements in common:

  • Target Audience. Your playtesters must belong to your primary target audience. Otherwise, testing doesn’t make sense for a whole raft or reasons, some of which we discussed in the context of difficulty and familiarity in the Process phase. Playtesting with players that don’t belong to your primary target audience works both ways, so to speak, and not in your favor. On the one hand, you don’t learn about the strengths and weaknesses of your design for players that will actually play your game. On the other hand, you might end up fixing things that don’t need fixing. Imagine you rebuild your whole interface because it took your playtesters too long to figure it out, but your primary target audience would have adapted to your original design in a snap. That’s a lot of resources that went into what could be called, very generously, a medium priority item.
  • Tester Type. For each setup, you have to determine whether you need playtesters that already have some experience with your game, either from earlier tests or because they have seen some of it in action, or so-called “tissue testers,” playtesters who come fresh to your game and cannot be used again for this or any other setup that calls for tissue testers. (But they can later be used for other setups that do not involve tissue testers.)

Obviously, it can be difficult to find not only good playtesters, but the right playtesters for your prototype and for your particular test setup. If that helps, you’re not the only one with this problem! Not just playtesting, but social science research in general has been plagued by the problem of “convenience samples” since forever. But shortcuts of this kind won’t get you anywhere in the end, neither in research nor for playtesting purposes.

When you have found the right playtesters for your project, you can mix and match all the different testing methods discussed above as you please. But don’t push your playtesters past exhaustion. Also, there’s the question of recompense. During development and with a budget, it should go without saying that playtesters need to be paid. But even for a proof of concept–prototype at only the meagerest of budgets, it’s good manners to offer some recompense when you’re trying to recruit playtesters. You don’t have to shower them with silver and gold. But playing a prototype riddled with issues and answering whole catalogs of questions isn’t a reward in itself. So think of something that would be appreciated. (Promises of a copy of the final game, years away in a highly uncertain future, don’t count.)

Now that we have discussed playtesting methods and playtesters, let’s move on to test data. From your playtests, you should always record as many data points and as many different types of data as you can, and meticulously document everything. For your proof of concept–prototype, this will most certainly not include elaborate technical tools like automatically recorded, graphable event data, heat maps and hot spot measuring, physiological tests that measure electrodermal activity or muscle activity, or a split testing infrastructure, to name a few. Or artificial neural networks that you train to play and break your game! Yet, there’s still a lot you can do and record with a minimum of effort.

To begin with, you should track time, record everything that is said during tests and interviews, and record physical events with a camera. The gameplay should also be recorded, obviously, but there’s more. Take the in-built computer camera, or any old mounted cell phone, and record the facial expressions of your playtesters. Later, you can run these recordings side-by-side with your gameplay recordings to review these sessions in detail. Facial expressions, prominently so called micro-expressions, will tell you a lot about emotional reactions, from confusion and annoyance to flow and fiero. And don’t forget fitness trackers—even run-of-the-mill models can record several types of physical data that you can correlate with in-game events later! Finally, there should always be an assistant present whose one job it is to watch and take notes about everything noteworthy, from peculiarities to irregularities, with regard to setup and procedure.

After making sure you have all the data you need, the last step is analysis and interpretation. From the four parameters of sound scientific research we mentioned earlier in this beat, we already touched upon “valid” and “reliable” in the methods section. When it comes to analysis and interpretation, the other two become important: “significance” and “relevance.”

Fig.5.2 Test Setup
Fig.5.2 Test Setup

A high significance means that the result has a low probability to have come about randomly, or by error. (It’s a bit more complicated than that, but it’ll do for our purposes.) One of the strongest indicators for shaky significance is a very low number of playtesters, or a number of wrong playtesters (in terms of target audience, repeat or tissue testers, and so on). Without the complex math social scientists apply to setups and analyses, you can’t do much about it, except being careful and considerate. Yet, being careful and considerate goes a long way. You do not want to put a lot of time and effort into redesigning parts of your prototype, or later your game, on the basis of a faulty test result.

The last parameter, relevance, means exactly what it seems to mean. A test can be reliable and valid, and its results significant, but the results might not matter enough to warrant action, or even matter at all. Don’t waste your resources on findings that are not relevant.

All this, from methods to playtesters to data to evaluation, is certainly not exhaustive. But it should give you a head start. In any case, it should be more than enough for the purpose of testing your prototype.

Nevertheless, we shouldn’t close this beat without the advice already given with regard to creating your reward system in Level Six: Integral Perspectives II: if all this isn’t your cup of tea, which is perfectly okay, you might want to try and get someone on the team, at least temporarily, who has a background in social sciences. Not necessarily for your proof of concept–prototype, that would be overkill. But for your pre-production prototype and the development cycle in general. This is an investment that you will not regret.