Volume 33, Issue 1 p. 226-243
METHODS DIALOGUE
Free Access

Scale use and abuse: Towards best practices in the deployment of scales

Kelly L. Haws

Corresponding Author

Kelly L. Haws

Owen Graduate School of Management – Marketing, Vanderbilt University, Nashville, Tennessee, USA

Correspondence

Kelly L. Haws, Owen Graduate School of Management – Marketing, Vanderbilt University, 401 21st Avenue South, Nashville, Tennessee 37203, USA

Email: [email protected]

Search for more papers by this author
Kevin L. Sample

Kevin L. Sample

University of Rhode Island - Marketing, Kingston, Rhode Island, USA

Search for more papers by this author
John Hulland

John Hulland

Marketing Department, Terry College of Business, University of Georgia, Athens, Georgia, USA

Search for more papers by this author
First published: 16 July 2022
Citations: 3

Accepted by Lauren Block, Editor; Associate Editor, Joel Huber

See relevant article: https://doi.org/10.1002/jcpy.1319, Commentaries on “Scale use and abuse: Towards best practices in the deployment of scales” by Constantine S. Katsikeas; Shilpa Madan; C. Miguel Brendl; Bobby J. Calder; Donald R. Lehmann; Hans Baumgartner; Bert Weijters; Mo Wang; Chengquan Huang; Joel Huber.

Abstract

Given that consumer researchers and other social scientists often operate with latent constructs that are not directly observable, sound measurement practices are essential for the continual development of scientific knowledge. An abundance of validated and reliable scales to measure constructs of interest exists within the literature. However, once these measures are introduced, how are they subsequently utilized? In this article, we focus on the deployment of measurement scales and the critical underlying issues consumer behavior and marketing researchers should consider. We discuss recent practices in scale deployment and specifically scale modification (through changes in wording, length, and dimensionality). Building from this perspective, we provide recommendations for best practices in the usage, adaptation, validation, and reporting of previously introduced scales.

Understanding consumer-related phenomenon often involves investigating both established and new theoretical constructs. Accordingly, it is imperative that researchers accurately manipulate and measure these constructs. In the current article, we focus on the measurement of constructs via scales—established multi-item instruments that measure focal constructs in a reliable and valid manner (Churchill, 1979; Netemeyer et al., 2003). In some cases, researchers undertake an extensive process to develop useful and psychometrically sound measurements for use in a variety of research (e.g., Hulland et al., 2022; Netemeyer et al., 2003). However, our focus here is on the usage of these measurement tools after their introduction. The issue of scale deployment has received far less attention than scale development. Whereas any researcher measuring a theoretical construct of interest needs to ensure that they appropriately measure it, we focus primarily on consumer behavior research, noting that the conclusions presented here can also be applied to other areas of marketing research (and beyond). In combination with our review of relevant literature more broadly, we offer tangible examples drawn from the Journal of Consumer Psychology (JCP).

So, how are consumer researchers using scales? To aid our discussion, we begin with four recent examples of scale usage in JCP that illustrate how researchers apply acceptable current standards in scale deployment. To begin, Farmer et al. (2021) utilized several scales in their paper, but we focus on two instances here. First, they utilized the intolerance of ambiguity scale exactly as originally created and validated (Webster & Kruglanski, 1994), which we label as “as is, validated” usage, and we argue is the gold standard for scale deployment. As a second example, Farmer et al. (2021) also used a scale to measure political ideology (Kidwell et al., 2013). This latter scale, however, was not empirically validated when originally developed but was utilized by Farmer et al. (2021) exactly as originally published, which we label as “as is, improvised” usage. Thus, in both instances, a scale was used as originally published, but in the first case, the employed scale was validated when developed (i.e., it underwent initial empirical validation), whereas in the second case the employed scale was improvised (i.e., the published source for the original scale did not empirically validate it).

In our third example, Raghunathan and Chandrasekaran (2021) utilized a validated scale of corporate social responsibility (Turker, 2009). However, Raghunathan and Chandrasekaran shortened the length of this scale and modified the item wording, which we refer to as “modified, validated” usage. Our final example of scale deployment is drawn from Affonso et al. (2021), who deployed a scale to measure decision difficulty (Laroche et al., 2005). As with the political ideology scale, this scale was never empirically validated through scale development procedures. Additionally, Affonso et al. (2021) modified the wording of this scale, which we label as “modified, improvised” usage. Thus, in our third and fourth examples, the scales used were both modified, but one of these scales was validated, whereas the other was improvised. Herein we ask, what are the implications of these various scale usages for consumer researchers?

Ultimately, the researchers in these examples had to ask themselves, “How can our construct best be measured?” This seemingly straightforward question led to four different approaches, each potentially appropriate. However, the selection of a suitable approach should not be taken lightly. Inappropriately applying a scale can have serious consequences for research, running the risk of introducing potential methodological confounds or unknowingly undermining theory, resulting in inconsistencies in findings that can harm “the basis for cumulating the findings of studies necessary to obtain generalized knowledge about marketing constructs” (Finn & Kayande, 2005; p. 18), a concern echoed by Gilliam and Voss (2013) and Diamantopoulos (2005).

For instance, a researcher may use all or part of an existing, but not exactly appropriate, measure—attempting to inform theory using a partially related scale or a previously utilized yet not fully validated set of measures (which we refer to as “improvised”). In other situations, a researcher may end up utilizing a scale in a context that differs substantially from the original's intent or structure by changing length, scope, intent, content, and/or item wording, while relying on citation of the prior measure for valid current usage. Unfortunately, such practices can destabilize (both empirically and conceptually) a stream of research over time. This is particularly a concern given that constructs are rarely studied or measured in a vacuum, but rather, they are assessed alongside other constructs in an effort to understand the interrelationships between them (Krosnick, 1999; Stanton et al., 2002).

Given that consumer researchers and other social scientists often operate with latent constructs that are not directly observable, sound measurement practices are essential for the continual development of scientific knowledge. In contrast to arguments for not measuring constructs (e.g., Calder et al., 2021), we assert that scales can and should be effectively utilized to measure key constructs. Yet, in doing so, researchers need to consider how this might be done most appropriately. Accordingly, the purpose of this article is threefold. First, we provide a glimpse into how scales are currently being used in consumer research. Second, we offer considerations for the appropriate selection and deployment of scales. Finally, we specify the due diligence and reporting needed when deploying scales within marketing research. Given the ready availability of online supplements for recent research articles, more thorough reporting of the process taken when utilizing scales is both feasible and important. To aid readers throughout, we provide tables to succinctly capture the essence of our arguments, starting with an initial glossary of key existing and new scale terminology in Table 1.

TABLE 1. Glossary of scale-related terminology
Glossary of scale terminology
Term Definition
Established Terminology
Reliability “…that portion of measurement that is due to permanent effects that persist from sample to sample.” (p. 10)*
Internal Consistency/Reliability “…assesses item interrelatedness. Items composing a scale (or subscale) should show high levels of internal consistency.” (p. 10)* Typically assessed with Cronbach's alpha but may be assessed with McDonald's omega or a host of other procedures.
Alpha “The most widely used internal consistency reliability coefficient…” (p.11)* Developed by Cronbach (1951) and equally weights all items.
Omega A metric to assess internal consistency developed by McDonald (1999) and is, for particular instances, more appropriate than alpha (see Hayes & Coutts, 2020).
Composite Reliability A specific way of assessing internal consistency, which uses SEM to weight items.
Test–retest “… the stability of a respondent's item responses over time.” (p. 10)*
Validity “…refers to how well a measure actually measures the construct it is intended to measure. Construct validity is the ultimate goal in the development of an assessment instrument and encompasses all evidence bearing on a measure.” (p.11)*
Face Validity “…the mere appearance that a measure has validity (Kaplan & Saccuzzo, 1997, p.132).”* “…one aspect of content validity (Nunnally & Bernstein, 1994, p. 110).” (p. 6)**
Convergent Validity “…the degree to which two measures designed to measure the same construct are related.” (p. 8)**
Discriminant Validity “…assesses the degree to which two measures, designed to measure similar but conceptually different constructs, are related.” (p. 8)**
Factor Analysis “… is an appropriate and popular method for assessing dimensionality of constructs” … determining “the number of common factors or latent constructs needed to account for the correlation among the variables.” (p. 27)* In addition, factor analysis is used to estimate loadings between items and constructs.
Exploratory Factor Analysis (EFA) Used when “the researcher generally has a limited idea with respect to the dimensionality of constructs and which items belong or load on which factor.” Also, “… is typically conducted during the initial stage of scale development.” (p. 27)* EFA is most typically conducted using principal component analysis.
Confirmatory Factor Analysis (CFA) “… a commonly accepted method to test/confirm dimensionality…focuses on whether a hypothesized factor model does or does not fit the data.” (p. 27)* Fit is most typically determined using maximum likelihood estimation (MLE).
Item Loadings An estimate (often standardized) of the relationship between an individual measurement item and an underlying factor or latent construct.
New Terminology for our Article
Types of Scales Refers to the scale selected by a researcher.
Validated A multi-item measure that has been vetted per scale development guidelines.
Improvised A multi-item measure that has not been formally vetted per scale development guidelines, can also be referred to as an “ad hoc” scale.
Types of Scale Deployment Refers to how a researcher uses a scale in an experiment.
As Is Utilizing a scale exactly as previously published.
Modified Adapting a previously presented validated or improvised scale through changes to the wording, length, and/or dimensionality of a scale.
  • Note: *quoted from Netemeyer et al., 2003; **quoted from Bearden et al., 2010.

THE CURRENT STATE OF SCALE USAGE

Journal of consumer psychology scale usage review

To understand how scales are being deployed in consumer research, we conducted a review of recently published research. To avoid biases through the selection of hand-picked exemplars, we conducted a census review of articles from four consecutive recent issues of the Journal of Consumer Psychology [30(3), 30(4), 31(1), 31(2)]. This review identified 44 published articles, with nine of these being conceptual and 35 empirical. Careful review of each of the 35 empirical articles identified the use of some form of an existing scale in 22 articles (i.e., 50% of the articles overall, 63% of the empirical articles). If a paper did not cite a source for a set of items or only utilized one item, we assumed that it was either a straightforward construct not requiring a validated measure or an improvised scale constructed by the authors, and we do not include it in our review. Overall, there were 66 unique instances of scale deployment, with many articles using multiple scales. We considered the use of the same scale more than once within the same paper as a single deployment instance. For example, Teeny et al.'s (2020) use of an arousal scale in studies 2, 3, 4, and 5 counts as one usage.

We examined each of these 66 scale usages in more detail. In most cases, the authors cited the source of the scale in the body of the paper during the introduction of the experiment and provided more details in the Methodological Details Appendix (MDA). Although we were initially most interested in determining whether scales were being used as developed in the original source, and if not, how they were being changed, we also examined the validation procedures (or lack thereof) of the scale within the initial cited source. Specifically, we conducted a comparison of the focal JCP paper and its accompanying MDA with the cited scale source (sometimes the cited source referenced another paper for the original scale, and we reviewed this additional source as well). This allowed us to determine whether the cited scale had undergone an initial validation process within one of the cited works or if it was an improvised scale. Although we acknowledge that there is significant variation in terms of the extent to which validation evidence is provided, we coded a scale as previously validated if the initial validation process provided evidence of the scale's validity and reliability through at least a rudimentary application of scale development methods (Churchill, 1979; Netemeyer et al., 2003). Further, we were able to ascertain if any modifications from the cited source of the scale had been undertaken and, if so, the type of modifications. In five instances, due to a lack of information, it was indeterminate as to what type of modification took place. Table 2 provides an overall summary of our findings. A more complete picture of the specific scales used in each paper and their conceptual usage (e.g., independent variable, mediator, moderator, manipulation check) is presented in Table A1.

TABLE 2. Scale deployment summary from review of JCP issues 30(3)–31(2)
Scale deployment Scale type selected Total
Validated Improvised
Part A: Summary of scale usage
Used as is 16 11 27
Modified 17 22 39
Total 35 31 66
Part B: Summary of Scale Modifications
Wording only 3 7 10
Length only 2 1 3
Dimensionality only 8 0 8
Multiple modifications 3 10 13
Indeterminate 1 4 5

Scale selection and deployment

As summarized in Table 2, our review shows that researchers select one of two general types of scales [validated (53.0% in our review) or improvised (47.0%)]. Validated scales have been developed using scale development procedures as laid out by Churchill (1979), Netemeyer et al. (2003), and others. In contrast, improvised scales are constructed in a more ad hoc manner, having not been subjected to formal development procedures.

In addition to making a choice about scale selection, researchers choose how to deploy a scale. Looking at Table 2, we see that scales are deployed in one of two manners [as is (40.1%) or modified (59.9%)]. Deploying a scale “as is” refers to using the scale as previously published, whereas modifying a scale can refer to a change in item wording, scale length, or scale dimensionality (i.e., the number of dimensions assessed). These modifications can also occur concurrently (as shown in Panel B of Table 2). In fact, our review documents that the use of multiple modifications is the most common form of modification (most typically both wording and length).

As illustrated in our opening examples (and shown in Table 2), researchers make use of all four combinations of scale type (previously validated vs. improvised) and deployment (“as is” vs. “modified”). Yet, how do researchers initially decide which scale to use and how to deploy it? Additionally, is there a better way to make these decisions? Within the next two sections, we reflect on the considerations that emerge for both the type of scale and the method of deployment decisions. Following, we suggest best practices for scale deployment.

CONSIDERATIONS FOR VALIDATED AND IMPROVISED SCALES

The Holistic Construal framework proposed by Bagozzi (1984) provides a conceptual lens to help understand consumer researchers' collective use of both validated and improvised scales over time. Bagozzi's Holistic Construal explicitly shows the connection between theoretical constructs and empirically observed measures. The core idea underlying the holistic construal (encapsulated in the larger, rounded-edged rectangle seen in Figure 1) is that the conceptual framework that links a focal construct to both antecedents and consequences is established in a particular time/context.

Details are in the caption following the image
Holistic construal conceptual framework

When a researcher uses a theory in one or more studies, both the theory and the measures are tied to that specific context. As time and/or context changes, however, all elements are potentially subject to some level of obsolescence or irrelevance. For example, the definition of “environmentalism” as a construct in the 1970s and its relevant measures as identified by researchers at the time resulted in a strong connection between theory and measurement at that moment. However, fifty years later the construct definition has broadened, and many of the original measures (e.g., those related to phosphates) are no longer relevant. In this example, both the theory and measures have evolved, but they have done so independently of one another. In such cases, strict adherence to the exact same set of questions originally devised to measure the underlying construct will limit generalizability (and likely confound measures with findings), and it may be better to use reasonable variations (i.e., conceptual replication).

As Bagozzi (1984, p. 27) notes, by “addressing the content and structure of our theories in more depth, we can make the science and art of marketing less haphazard and more subject to evaluation and control.” These considerations are at the crux of scale development due to instruments needing to correctly measure their intended constructs. This means that any measurement of a construct of theoretical consequence should be done with a validated scale. Furthermore, over time these relationships may need to be revisited. Still, at other times, it is acceptable for research to use a more improvised scale when seeking to answer an immediate question (e.g., several items assessing purchase likelihood), as this may carry no long-term effects on theoretical constructs.

Given the evolving space/time continuum within which every construct lies, how can researchers assess the long-term impact of scale usage? Whereas there is no way to truly know this, researchers should nonetheless look at the entities being measured and the type of scale employed to arrive at a reasonable answer. As a starting point for this assessment, one can first examine the nature of the scale itself. Scales can measure any of several specific entities (i.e., independent variables, moderators, mediators, and outcome variables), some that are more enduring (typically “trait” measures) and some that are more transitory (typically “state” and/or “response” measures). Trait scales measure an enduring characteristic of a consumer [e.g., consumers who tend to be picky shoppers (Cheng et al., 2021), tend to be environmentally conscientious (Haws et al., 2014), or prefer local food (Reich et al., 2018)]. Measurement of individual differences most typically serves the role of independent variables or moderators. Trait scales are expected to remain relatively stable over time (often accompanied by test–retest reliability when the scale is initially developed; see Netemeyer et al., 2003), and therefore the set of measures employed should remain consistent.

In contrast, other scale usages are more transitory in nature and depend almost exclusively on the precise stimuli and manipulations presented in a specific research study (i.e., state and response scales). State scales measure the momentary disposition of a consumer that can be manipulated [e.g., PANAS (Watson et al., 1988); coping (Duhachek, 2005)], and tend to appear in conceptual frameworks as consequences or mediating variables. Response scales gauge a reaction to another entity [e.g., product designs (Homburg et al., 2015), brands (Warren et al., 2019), or more general product or brand evaluations], and most often appear as consequent or outcome variables. Still, even though responses to these different types may change, the measures for assessing them are less likely to do so, and this only occurs slowly over time.

We now provide considerations for scale deployment, first, when used “as is” and second, when modified from the original source, while embedding observations about the usage of improvised and validated scales. From a measurement perspective, we suggest that enduring and transitory scales are largely the same, and therefore we do not continue to emphasize this distinction. Throughout, we draw upon key examples from both our systematic JCP review and the literature at-large, offering perspectives on the potential benefits and downfalls of these various forms of scale usage.

CONSIDERATIONS FOR “AS IS” AND MODIFIED SCALE DEPLOYMENT

As is scale deployment

When utilizing a previous scale “as is,” there are three key considerations for its deployment. (1) Domain of applicability: Using a scale “as is” is not the same as using a scale as intended, and researchers should take care to ensure that their investigative context is appropriate for the scale's original, stated purpose. (2) Multiple scale options: Choosing among multiple scales when more than one option exists. (3) Use of prior improvised measures: Using a set of measures previously published but not specifically validated in the prior article(s).

Domain of applicability

First, without deviating from the intended theoretical foundations of their work, researchers must ensure that their current constructs appropriately match the measures of the scale chosen for deployment. This typically involves investigating the research in which the scale was initially developed, ensuring that its definitions and domains reasonably match the current intended usage.

An illustration of inappropriately matching scales over time involves the deployment of Hofstede's cultural dimensions scale (Hofstede, 1980) as an individual-level psychological measure. Hofstede (1980) originally developed a cultural model that encompassed four dimensions, each of which is defined at the societal or national level. Over the decades, Hofstede and colleagues have provided country-specific scores for each of these dimensions. Whereas such scores nicely capture cross-country cultural differences, they are defined at the national level. Yet, many researchers have applied these Hofstede scores at the individual level. As noted by Yoo, Donthu, and Lenartowicz (2011, p. 195), this involves a process whereby “individuals are equally assigned Hofstede's national culture indices by their national identity.” Such a process ignores individual differences within individual cultures and commits the ecological fallacy of interpreting differences between populations as if they applied between individuals (see Hofstede, 1980; Williamson, 2002). In contrast, Triandis et al. (1985) introduce an idiocentrism–allocentrism scale that refers to individualism–collectivism at the individual level. Triandis et al. (1985, pp. 396) conclude that this distinction between the psychological and cultural levels is an important one, and they propose that it is more appropriate than “collectivism-individualism terminology be employed for analyses at the cultural level and the allocentric-idiocentric for analyses at the individual level.” Therefore, to avoid such misapplications, we urge researchers to carefully ensure the appropriate usage of a scale.

Multiple options

Our second consideration is that, in some cases, more than one viable scale may be available and potentially relevant. Ideally, one scale more closely aligns with the researchers' conceptual framework or has been shown to be superior within prior literature. However, this is not always the case. In such situations where multiple options exist, the consumer researcher can use Google Scholar (or other suitable databases) to restrict searches to certain journals or domains, find scales of interest that have been used within these domains, determine how they have been used, and identify relevant usage citations, including any updated and validated versions of the original scale. This can provide insight as to how scales have been historically deployed and help determine which scale would be most appropriate for the investigation at hand. This information (i.e., the source of the original and modified scale) can then be cited within the manuscript.

Another way to decide between multiple options is the scale length. In an example from our JCP review, Chan (2020) makes a reasonable determination of this sort, choosing the Brief Need for Cognitive Closure Scale (BNCCS) over the Need for Cognitive Closure Scale (NCCS), stating that participants “…completed the 15-item BNCCS (Roets & Van Hiel, 2011), which we chose over the original NCCS as it had 42 items (Webster & Kruglanski, 1994) that we felt was too long for an online study” (p. 517). No matter if a decision is made based on length or relevance, providing clarity as to why one scale was chosen over another is important for reference by future researchers.

Use of prior improvised measures

A third consideration for researchers deploying a scale “as is” revolves around the use of improvised scales. In our JCP review, we note that we found 31 instances of improvised measures used from prior research (See Table 2), the majority of which had been further modified (22 occurrences of “modified” usage). For example, Farmer et al. (2020, 2021) utilized the improvised political ideology scale they had originally adapted (see Kidwell et al., 2013) from a prior source (Nail et al., 2009). Whereas they utilized a scale as is, the initial lack of validation could be problematic. We do note that Farmer et al. attempted a limited assessment of validity by checking the correlation between the scale and political party identification. However, in other manuscripts deploying scales encountered during our review, we note no use of validation procedures. We strongly urge researchers to find a validated scale or engage in additional validation procedures (as explained in a later section).

“As is” usage summary

Ultimately, researchers using scales developed in prior research “as is” are attempting to follow good practices. However, they should ensure that the measures are in fact appropriate to the current study context, that reasons for using the measures are explicitly provided especially when multiple options exist, and that improvised scales should be appropriately vetted (we provide an overview of this relatively straightforward task in our Best Practices section below).

Modified scale deployment

Researchers must often wrestle with the fact that existing construct measures do not completely fit with the particular context that they are studying. This often leads to modifications such as shortening established scales, using only one or a few dimensions of a multidimensional scale, creating a set of improvised measures for a specific context, and/or combining scales that results in a measure loosely based on prior scales. The negative consequences of doing so can be substantial [e.g., measures become unreliable and/or invalid, which leads to difficulty in drawing conclusions from findings (Hinkin, 1995)]. As evidenced from our JCP review, scales developed in prior research are often modified (39 of the 66 deployments, with 17 of these being for previously validated scales and 22 being for previously improvised scales). Moreover, we observe that both validated and improvised measures are modified in three primary ways (that are not mutually exclusive): (1) wording, (2) length, and (3) dimensionality, which we discuss next, in turn. We note that other modifications, such as changes to the rating scales could be made, but we do not focus on these issues here (we direct the reader to Weathers et al., 2005 and Weijters et al., 2010 for more information).

Wording modifications

Appropriate scale wording changes often reflect a cultural shift, an adaption of the language for a change in understanding, or the replacement of content that has become obsolete or outdated. Additionally, researchers may modify the wording of scale items to better align with specific researcher intentions. This can be seen with our opening example of Affonso et al. (2021) and their simple modification to the wording of the improvised decision difficulty scale to customize it to the product categories used (e.g., car or vacation package).

One key reason for modification of the content of an existing scale may be to reflect a cultural shift. A cultural shift occurs when a scale developed in a particular culture is applied to another wherein the meanings and norms may differ. Wong et al. (2003) specifically explore this conundrum at a high level. They demonstrate through cross-cultural studies involving numerous countries and over 1200 participants that scales would fare better cross-culturally by being framed as questions and avoiding reverse-worded phrasings (although doing so may hide and not address the underlying issues; see Weijters & Baumgartner, 2012 and Baumgartner et al., 2018). In another example, Thompson (2007) made efforts to introduce a shortened form of PANAS that incorporated language more reflective of varying cultural backgrounds, yielding what is labeled an “International PANAS Short Form” scale. For other situations, shifts are needed merely to ensure that the measures will be understood in the native tongue (for more insight, see Harkness et al., 2010; McGorry, 2000). In these instances, the scale is typically translated into the desired language and then back-translated into the original language to verify the accuracy of item wording (e.g., Kopalle et al., 2010). For some scales, researchers have already taken the steps to create language-specific versions of the scale [e.g., the Big Five in Italian (Guido et al., 2015)].

Additional needs for scale wording modifications arise as the norms and references within a society evolve over time. Many scales that researchers wish to deploy may have phrases and/or wording that have fallen out of normal societal usage. Consequently, modifications to item wordings may enhance clarity or avoid outdated references. For instance, when developing the GREEN scale, Haws et al. (2014) utilized prior scales for validation that contained dated references to such issues as low phosphate detergent (Straughan & Roberts, 1999), of which most consumers during the time of the GREEN scale development were unfamiliar. Therefore, they specifically state (p. 338) “Our intent was…to develop a concise scale that would not easily become outdated.” Further, the wording of scale items may be modified to better fit the research goals of the researchers. Typically, in these instances, a researcher presents measures that are at a minimum “based on” or “adapted from” an existing, validated measure. For instance, from our JCP review, numerous papers make slight changes to adapt to the study context or make scales similar in readability [e.g., Teeny et al., 2020 using “More energized” instead of the original scale term “Energetic”; Granulo et al., 2021 changing the need for uniqueness scale to ask questions about the frame (vs. lenses) of eyeglasses].

Length modifications

The most common approach identified in our JCP review for length modification was reducing the length of the scale (13 out of 16 length modifications). In most cases, the apparent goal was simply to reduce the number of items on a scale while still measuring the original construct. Academics and practitioners often feel that scales are too long or cumbersome for their desired purposes, and certainly, the length of surveys has been noted previously as a serious concern (e.g., Diamantopoulos et al., 2012; Katsikeas et al., 2006; Stanton et al., 2002).

In some cases, prior research provides carefully developed and validated shortened versions of existing scales. This can occur either sequentially or at the same time the original scale is developed (e.g., Graf et al., 2018). Examples include the Materials Value Scales (reduced from 18 to 15 and 9, with initial evidence that a 6-item version might suffice under certain circumstances, see Richins, 2004) and the Positive and Negative Affect Schedule, or PANAS (reduced from 20 to 10 items, see Thompson, 2007). Another common example is a reduced version of the Big Five factors of personality, including the Mini-IPIP scale (reduced from 50 items to 20, Donnellan et al., 2006) and further the Ten-Item Personality Inventory (TIPI; Gosling et al., 2003), which is a common choice for researchers (e.g., Spiller & Belogolova, 2017; Wessling et al., 2017). In these examples, the overall scope of the construct is maintained (i.e., the number of dimensions stays constant), while the number of items per dimension is reduced.

In many cases, however, a shorter version of a scale has not been previously validated, requiring the researchers to decide whether to validate the shortened scale in their own work. This is done with varying degrees of rigor. Some follow a path of purification through psychometrics, dropping items according to recommendations from Churchill (1979) and Gerbing and Anderson (1988). At other times, researchers drop items for conceptual reasons before testing them. Even though there may be merit in these decisions, in many cases insufficient information is provided as to why only some of the items were utilized (and why those specific ones). Within our JCP review, we note differential usage of the same scale (PANAS; Watson et al., 1988); for example, Kapoor and Tripathi (2020) use the full 20-item version whereas Reich and Pittman (2020) employ only six items. Admittedly, scale reduction is needed at times, but this can lead to questionable findings when no explicit justification is offered for seemingly arbitrary changes.

Although scale reduction appears to be much more common, lengthening of scales can also occur (3 out of 16 length modifications in our review). This can play out in one of two ways: introducing additional improvised measures whether validated or not (expansion) or using more than one scale together (amalgamation). Expansion typically involves researchers introducing a few improvised measures to an existing scale, usually in an effort to enhance and/or update it. Amalgamation of items from various scales occurs to potentially incorporate more than one existing measure or to broaden the scope of what is being studied. In an example from our recent JCP review, Wongkitrungrueng et al. (2020) report modifying and combining measures of trustworthiness from two different sources.

Dimensionality modifications

Finally, researchers adapt a scale to a context by utilizing only some of the original subdimensions. While this does alter the overall length of a scale, the theoretical implications are potentially more substantial than when only engaging in reduction (i.e., dropping one or two items from a unidimensional scale). One common manner by which modifications to scale dimensionality occurs is via the use of a dimensional subset of an original scale, typically one or a few dimensions from a multidimensional measure (this was the most common form of scope change in our JCP review and occurred most frequently with validated scale usage). For instance, from our JCP review Farmer et al. (2021) use the intolerance of ambiguity dimension of Webster and Kruglanski's (1994) need for a cognitive closure scale as their mediator in assessing how political ideology affects the response to ambiguity, whereas three different articles (Bryksina, 2020; Granulo et al., 2021; Li et al., 2021) only use the 11-item counter-conformity subscale of the need for uniqueness scale (Tian et al., 2001). In these cases, the authors argue reasonably that a particular subdimension was most relevant to their conceptualization, resulting in no assessment of the other dimensions. However, in other situations as scales are developed with dimensions utilized collectively, there could be a potential unraveling of the theoretical construct when reducing dimensionality.

“Modified” scale usage summary

Overall, scale modification is quite common and takes many different forms within the consumer behavior literature. Yet, care should be taken to ensure that when such modifications are used, researchers avoid unintentionally undermining rather than advancing theory. Improvised changes, especially within enduring research, risk an acceleration of drift between the constructs and measures that are part of the holistic construal (Figure 1). In an extreme case, this leads to a complete disconnect between the conceptual and empirical elements. Whereas improvised scale deployment and modifications have become a common part of theoretically-driven consumer research, we believe that more rigor is needed within the deployment of these instruments and the manner in which their usage is reported. So, what should researchers do, particularly when no well-established scale fits the current research conceptually and/or logistically? In the next section, we offer some guidance into what we believe are best practices in the deployment of scale measures.

BEST PRACTICES FOR SCALE DEPLOYMENT IN CONSUMER BEHAVIOR RESEARCH

Whereas best practices in scale deployment call for using validated scales as is, over time and across contexts the various elements in the Holistic Construal framework (Figure 1) are likely to evolve. The core theory can change as antecedents and outcomes are added (e.g., new mediators), as moderators are uncovered, as theoretical frameworks are extended (e.g., multiple mediators, sequential mediators), and as new theoretical contexts (e.g., sustainability; robot and artificial intelligence applications) become more important. Similarly, the measures aligned with the core constructs can become dated, as new methods are developed, existing measures are found wanting, and so forth. Importantly, these evolutionary shifts in theory and measures do not necessarily occur in lockstep with one another. At times, these shifts may necessitate a new scale development process, whereas at other times, modifications to existing measures (accompanied by validation procedures we discuss below) may suffice.

From our JCP review, only a few mentions of these modifications reported any psychometric analysis. In Hulland et al. (2022), an extensive review of scale development processes in eight top marketing journals (International Journal of Research in Marketing, Journal of the Academy of Marketing Science, Journal of Consumer Psychology, Journal of Consumer Research, Journal of Marketing, Journal of Marketing Research, Journal of Services Research, and Marketing Science) for articles published between 2000 and 2021 found 262 articles explicitly stating some type of discriminant validity and/or item reliability assessment when making modifications to an existing scale (see Table A2 for a full list of these articles). Both this more extensive review and our JCP review revealed that the level of detail provided about how modifications were carried out varies considerably across articles. Consequently, there are several areas for improvement.

In this section, we provide a pragmatic approach to balance the competing demands of consistency and generalizability against those of context specificity and convenience. As we do not want to unnecessarily burden researchers with stringent requirements, we attempt to be clear, concise, and reasonable in our recommendations. We also stress that these are general recommendations intended to guide researchers towards enhanced clarity in both their thinking and reporting when using prior scales. We proceed by systematically addressing four primary considerations for scale deployment: (1) scale selection and fit assessment, (2) scale modification, (3) scale validation, and (4) scale reporting. We provide a decision tree in Figure 2 that covers the various decisions and steps needed to address scale deployment. This process begins with identifying potential scale(s) and evaluating whether it fits the context appropriately as is or may do so with minor modifications. This is followed by validation procedures, and then reporting precisely how the scale was deployed. The remaining tables supplement these actions with additional procedural details. Table 3 focuses on the scale selection and fit assessment process (which applies to all scale usages), Table 4 provides recommendations for various types of scale modifications, Table 5 summarizes recommended validation procedures, and Table 6 provides recommendations for reporting one's usage of scales.

Details are in the caption following the image
Scale deployment decision tree
TABLE 3. Recommended scale selection and fit assessment
Identification of scale and assessment of scale fit for deployment
Step Guidelines
(1) Specify Construct Clearly define the construct to be measured, the domain, and the intended type of measurement.
(2) Identify Instrument Conduct a literature review to find the most closely aligned scale(s) to your construct and domain. Review sources to determine whether the scale was originally validated or improvised.
(3) Assess Alignment

Highly aligned:

If deployment is highly consistent with prior use, appropriate fit may be assumed.

Uncertain or less closely aligned:

If deployment is inconsistent with prior use or researchers are uncertain of alignment, conduct an informal assessment:

  • Solicit open-ended feedback from 2 to 3 experts, providing them with the construct definition, domain, and all items from the proposed scale to be used, asking if the proposed scale is appropriate for intended measurement and if modifications are needed.

If modifications are desired, refer to Table 4 for modification guidelines, followed by Table 5 for validation guidelines.

TABLE 4. Recommended scale modification guidelines
Modification of scales
Modification Type Guidelines
Wording It is acceptable to make minimal changes to reflect a different context unless avoiding dated or polarizing language.
If the culture or language by which the scale will be utilized is different than the one it was developed, these shifts are acceptable if appropriately modified (see Harkness et al., 2010, for these guidelines).
Arbitrary changes are typically unacceptable.
Any wording shifts should be thoroughly explained and justified as these types of shifts have enduring effects on scales and future use.
Length If changes to length are desired, conduct an initial pretest of at least n = 50. CFA can be utilized to find potential items to be removed before any validation procedures as indicated in Table 5.
When reducing length, maintain a minimum of 3 items per dimension, (4–5 is preferred).
Scales should use the full set of original items. Adding new improvised items to an existing scale should be avoided, as such actions require more rigorous scale validation and/or scale development procedures.
Dimensions Ensure that your construct definition, domain, and intended measurement fully justify the removal/addition of one or more dimensions and provide this rationale in your research.
Best practice entails collecting all subdimensions in at least one study, reporting findings comparing the all-scale dimensions to the focal subdimensions in the supplemental materials.
Multiple Modifications Multiple modifications can quickly erode the theoretical foundations of scales. If absolutely necessary and having only transitory consequences within the current context, multiple modifications can be made, supported with additional validation procedures as outlined in Table 5. Otherwise, avoid multiple modifications.
TABLE 5. Recommended Scale Validation Proceduresa
Validation of scalesa
Step Recommended procedures

(1) Face Validity

Do the scale items align with intended deployment?

Conduct Formal Fit Assessment:
  • Provide 2–3 experts (e.g., trained academics, field experts) or a panel of respondents of n = 50 (if appropriate) your construct definition and domain to evaluate scale items. Randomly present all items from the proposed scale of deployment on a seven-point scale ranging from “very bad fit” (−3) to “very good fit” (3). Acceptable fit for items occurs when means are greater than 0.
  • Revisit modification guidelines in Table 4 for poor face validity results. Follow this with a reassessment of face validity.

(2) Internal Reliability

Do the scale items hold together?

Internal Reliability Pretest (at least n = 50):
  • Examine the internal consistency of the entire measure with Cronbach's Alpha (or potentially Omega [Hayes & Coutts, 2020]). Results should be 0.70 or greater.
  • Conduct Confirmatory Factor Analysis (CFA). Item loadings should be greater than 0.70
  • Revisit modification guidelines in Table 4 for poor internal reliability. Follow this with a reassessment of internal reliability.

(3A) Convergent Validity

Does the scale measure the same construct as a validated scale?

Convergent Validity Assessment (at least n = 50)
  • Typically conducted in a separate study than internal reliability.
  • Run a within-subjects study where participants respond to both your scale and an established, validated scale (i.e., typically the original, validated scale) that measures the same construct. Correlations between the two scales should be significant and high (r 0.70 or higher).
  • More than one validated scale may be needed depending upon the dimensionality of the intended scale of deployment.
  • May not be necessary if only minor modifications to a validated scale have occurred.
  • May not be possible with improvised scales. If not, discriminant validity should be assessed with at least 2 related, validated scales.
  • Revisit modification guidelines in Table 4 for poor convergence. Follow this with a reassessment of convergent validity or choose the alternative scale, returning to Table 3.

(3B) Discriminant Validity

Does the scale measure a distinct construct from a validated, related, yet different, scale?

Discriminant Validity Assessment (at least n = 50)
  • Can be simultaneously run with convergent validity assessment.
  • Run a within-subjects study where participants respond to both your scale and an established, validated scale that measures the related, yet distinct construct. Correlations between the two scales should be low (r below 0.70).
  • More than one validated scale may be needed depending upon the dimensionality of the intended scale of deployment.
  • This assessment may not be necessary if only minor modifications to a validated scale have occurred.
  • Revisit modification guidelines in Table 4 for poor discrimination. Follow this with a reassessment of discriminant validity or choose the alternative scale, returning to Table 3.
  • a In addition to modified scales, we recommend these procedures for all “as is” usages of improvised scales that have not been subjected to a form of this validation previously.
TABLE 6. Recommended reporting of scale usage
Scale deployment reporting
Information to Report Type of Scale Deployment
“As Is” Scale Deployment Modified Scale Deployment
Scale Content Validated & Improvised Scales:
  1. Provide the name, citation, and number of items of the as-is scale in the manuscript.
  2. Report the scale items as utilized in an appendix or supplemental online material.
  3. Note the anchors used, reporting the measures in the order assessed or noting that the order was randomized.

Validated & Improvised Scales:

  1. Provide the name, citation, and number of items of the modified scale in the manuscript.
  2. Note the exact modifications to the scale and the reasoning behind these modifications in the body of the manuscript.
  3. Report the scale as fully utilized in an appendix, ideally including additional explanation and procedures for modifications.

Scale Validity and Reliability

Validated Scales

  1. Provide reliability results (alpha or omega) in the manuscript.

Improvised Scales

  1. Provide face validity results in an appendix or supplemental online material.
  2. Provide item loadings and either alpha or omega in the manuscript.
  3. Provide convergent and discriminant validity results and details in the manuscript.

Validated Scales

  1. Provide face validity results in an appendix or supplemental online material.
  2. Provide item loadings and either alpha or alpha or omega in the manuscript.
  3. Provide convergent and discriminant validity results in an appendix or supplemental online material.

Improvised Scales

  • Report in line with modified, validated (1, 2, and 3 immediately above) but report convergent and discriminant validity results in the manuscript.

Scale selection and fit assessment

The first hurdle researchers encounter is scale selection and assessment of scale fit with the current research context. In our review, we often found articles citing prior research when utilizing scales but also citing prior research when measuring very straightforward constructs that are less theoretically nuanced (e.g., purchase intentions, general attitudes, “response” measures in general). As not every construct requiring assessment necessarily needs a multi-item scale or citation to be measured, citations for these are not strictly necessary. However, for more theoretically nuanced constructs that require a scale, we outline procedures for assessing the fit of the scale to the research context below, summarizing key steps in Table 3.

First, researchers must have a clear definition of the construct intended to be measured and the domain within which it is to be utilized. This allows for a more direct search (the second step in scale selection) within databases or sources of validated scales (e.g., Google Scholar; Handbook of Marketing Scales; Bearden et al., 2010) for relevant literature to appropriately match the intended construct with a scale. If one or more scales are available, researchers should examine the prior validation of the scale, favoring those that were previously validated over those previously presented in an improvised manner.

The final step in scale selection involves ensuring that there is a strong connection for the unit/level of analysis between the construct and measures (e.g., utilizing the idiocentric-allocentric psychological scale over Hofstede's scale when looking at individual consumer traits). Using the researchers' expertise, they should determine which one provides the strongest fit, by matching the construct definition with the scale, by selecting a scale at the appropriate level of analysis, and by reviewing how the scale has been deployed in prior literature. If more than one scale remains and these scales perform similarly in prior research, it may be prudent to collect all measures for each of the alternative scales being considered for use with the underlying construct in a pretest, and then compare the empirical results. However, it is also acceptable for the researcher to base this decision on length, choosing the shorter scale.

If historical scale usage is less closely aligned or the researchers are uncertain, it is at this point that we recommend having 2 or 3 experts (i.e., trained academics and professionals within the domain of interest) provide open-ended feedback via an informal fit assessment. We suggest asking if the proposed scale(s) and the associated items are suitable for the researchers' construct definition, the domain, and the intended measurement. Depending upon this feedback, one can either select a different scale, use the scale as is, or choose to modify the scale. If choosing to modify the scale, we direct researchers to the next section along with Table 4. If using a previously improvised scale “as is,” we direct researchers to the validation section along with Table 5. If using a scale as is that was previously validated, a researcher can proceed with deployment, reporting scale usage in the manuscript per Table 6.

Scale modification

We encourage researchers to utilize established, validated scales “as is” when feasible, as this facilitates the accumulation (and generalization) of empirical findings over time (Bagozzi, 1984; Hinkin, 1995; Netemeyer et al., 2003). Yet this is not always practical, and a previous scale might not pass the informal assessment of fit described above despite being the most closely related measure to a construct. In such cases, modifications become necessary. All modifications should be clearly acknowledged and briefly justified by the authors. We highlight specific considerations for each type of modification (wording, length, and dimensionality) in Table 4. Whereas we advocate for strong validation procedures for any form of modification (described more fully in the next section and summarized in Table 5), when engaging in multiple, significant modifications authors are effectively creating a new scale and should follow scale development guidelines (e.g., Churchill, 1979; Hulland et al., 2022; Netemeyer et al., 2003).

Wording modifications

Minor wording changes are likely to occur and even be essential, and it is quite possible that the fit assessment discussed previously would help identify items that might need wording changes. Simple wording changes may be needed to appropriately relate the scale to the current context (e.g., Affonso et al., 2021 modification of the decision difficulty scale to reflect the specific product categories used in their studies). Although this example is perhaps of the simplest form, similar changes may be necessary when adapting an existing measure from a related field outside of marketing. In general, minor changes made to fit a new context are acceptable and require minimal additional validation procedures. If more substantial changes to the wording of items are needed (e.g., shifts are needed to avoid dated language), empirical evidence for validity (see Table 5) should be presented. Researchers desiring to adapt a scale to a different language should refer to the explicit language guidelines presented in Harkness et al. (2010).

Length modifications

As mentioned previously, most modifications to length involve reducing the number of items, which can be motivated both by a resource-based, pragmatic need to shorten one's study instrument and/or to avoid the use of historically poor performing items. In both cases, the expert judgment suggested for the initial fit assessment in Table 3 can provide insights regarding what items to eliminate. Stanton et al. (2002) address this issue extensively, suggesting numerous steps to undertake when shortening a scale (see Table A3).

For shortening length, we recommend conducting a pretest involving at least 50 respondents (before validation procedures; we suggest that more would be better, especially in the case of higher-order constructs) assessing internal reliability via Confirmatory Factor Analysis (or Exploratory Factor Analysis for unidimensional scales) loadings and/or coefficient alpha (or the less common omega; Cho, 2016; Hayes & Coutts, 2020) to find potential items for exclusion. When values fall below.70, researchers should seriously consider removing these items (e.g., see Hulland et al., 2018).

Another salient issue is the minimum number of items to retain. There is a general desire to use single-item measures (e.g., Fei et al., 2020; Isaac et al., 2021) as doing so minimizes the costs of survey administration. However, Diamantopoulos et al. (2012) demonstrate that whereas single-item measures may suffice for transitory variables, they are often insufficient for the reliable measurement of enduring constructs. This ideally should be at least 4–5 items to avoid under-identification problems (Böckenholt & Lehmann, 2015; Netemeyer et al., 2003). Regardless of the exact length of modification, we emphasize the need for validation following the procedures outlined in Table 5.

Dimensionality modifications

It is possible that the scale search process identifies scales for which only a part of a multidimensional construct is relevant. This narrower fit is likely to be noted by the authors or by the experts/respondents from a fit assessment, per Table 3. Although this may well be the case conceptually, a key concern is that other dimensions might matter if measured, given that they have a pre-existing conceptual relationship in the hierarchically-structured prior measure. Accordingly, unless the researchers developing the initial scale specifically note that the dimensions can be used separately, we suggest assessing the relationship between the entire scale and the focal construct of interest in one of the primary studies in the paper, and then proceeding with the relevant dimension(s) in subsequent studies (although a separate pretest could suffice).

Scale validation

We have mentioned the need for scale validation in the previous section while discussing different forms of scale modification because most scale modifications require some level of validation. We also suggested that improvised scales should undergo validation procedures even when deployed “as is,” unless previously validated. We provide a discussion of the primary validation steps in the following subsections, with a summary presented in Table 5.

Face validity

The first step for validating an improvised and/or modified scale involves ensuring that the items from the scale remain relevant for the current application. For this, we suggest a formal fit assessment with a minimum sample size of at least 50 panel respondents or 2 or 3 experts. Practically, this can be done either online or offline, and executed by providing the raters with the construct definition and then presenting them with each scale item (in random order) to assess the fit with the construct on a seven-point scale ranging from “very bad fit” (−3) to “very good fit” (3). If all means are above zero, then the items can be utilized as presented and one can move to the assessment of internal reliability. However, if any means fall below zero, the researchers should return to scale modification procedures as laid out in the previous section, utilizing the item-level insights gained from this fit assessment. Alternatively, they may find another scale or undertake new scale development.

Internal reliability

After confirming satisfactory face validity, the next step is to ensure that the scale items are appropriately held together within the measurement scale to be employed. This can be done via an internal reliability pretest. To do this, researchers should provide at least 50 respondents with all scale items randomized (or in the suggested deployment order if originally noted to be presented in a certain order) along with the construct of interest. In analyzing the data, Confirmatory Factor Analysis (CFA) should be conducted and either Cronbach's Alpha (for most applications) or McDonald's Omega (primarily for multidimensional constructs; Cho, 2016; Hayes & Coutts, 2020; McDonald, 1999) should be checked. Results for the CFA loadings should generally be 0.70 or higher, and Alpha (or Omega) should also be above 0.70. If results are not satisfactory, modification procedures may be necessary and/or the use of an alternative scale may be required.

Convergent validity

The third step of scale validation ensures that the proposed scale of deployment measures the same construct as an alternative, established measure. That is, researchers should assess convergent validity. For example, launching a modified version of the shortened form of PANAS proposed by Thompson (2007) and the original PANAS scale would be an attempt to assess convergent validity. To assess convergent validity, at least 50 respondents should be presented with all scale items from the proposed scale of deployment in addition to an already established measure for the focal construct. Correlations should be significant and high (i.e., r above 0.70). For many scales, this would mean including the original, validated scale. However, as improvised scales may not have a validated option, researchers should perform a discriminant validity assessment (see below) between at least two related, validated scales.

Whereas it may be possible to determine convergent validity simultaneously with internal consistency, completing these assessments sequentially is more common. In cases where only minor changes are being made to a previously validated scale, this convergent validity step may not be necessary. In contrast, substantial modifications call for the inclusion of this step, and in such cases, when results are not satisfactory (i.e., correlations below 0.70), further modifications and/or use of an alternative scale may be necessary.

Discriminant validity

The final step in scale validation provides evidence that the measures for the proposed scale of deployment are distinct from the established measures of a different, yet related, construct (i.e., their correlation is below 0.70). This is an assessment of discriminant validity. For example, Luchs et al. (2021) showed that their proposed measure of consumer wisdom was different from a common measure of general wisdom (r = 0.33). Discriminant validity can be assessed simultaneously with convergent validity, meaning at least 50 respondents will respond to the proposed scale of deployment and the established scale that measures a different (but theoretically related) construct. As with convergent validity, discriminant validity assessment for slight modifications to a validated scale are probably unnecessary. For all other scales of deployment, failure to show adequate discriminant validity between constructs (i.e., results above 0.70) suggests that the proposed scale of deployment and the distinct scale are likely confounded, indicating the need for additional modification and/or alternative scale usage.

Nomological validity

Scale developers often mention nomological validity as an additional facet of construct validation. Cronbach and Meehl (1955) refer to this as the extent of “lawful” fit between the proposed scale measure for a focal construct and a network of related constructs. The evaluation of nomological validity “involves investigating both the theoretical relationships between different constructs and the empirical relationships between measures of those constructs” (Netemeyer et al., 2003, p. 82). If the proposed scale is sound, it will exhibit strong empirical connections with these other measures as predicted by theory. Whereas this is a critical step in the process of scale development (e.g., Hulland et al., 2022; Netemeyer et al., 2003), for theory-testing papers—our focus in this paper—it is an unnecessary additional step.

Scale reporting

In our review, too often the deployment of scales was accompanied by vague language. Consequently, we provide recommendations regarding the reporting of scale usage (Table 6). Overall, regardless of the exact type of scale (validated or improvised) or the manner of deployment (as is or modified), researchers need to communicate what they have done and be clear about why they made any modifications or utilized an improvised scale. In the case of the use of previously modified scales, the original source of the measure should also be included. Authors also need to report the relative effectiveness of different scales to provide guidance to other researchers. Although we acknowledge that the same time and space constraints that lead researchers to desire length reduction in measures when collecting data also limit the extent of reporting on these measures in scientific research articles, nearly all journals now allow for additional reporting through online appendices or other types of supporting material. Thus, there is ample opportunity to carefully report on scale selection, usage, modifications, and validation. To aid researchers in this process, we provide guidance regarding the information and locations of reporting within a manuscript and online materials for scale deployment in Table 6. This information is useful for clarity in the review process, for readers of the work, and for future researchers utilizing the same measures.

We suggest that crucial information regarding measures should be conveyed in the manuscript or in a web appendix. This information includes a clear construct definition, the citation of the scale chosen to measure this construct, the reasoning behind this choice, and if any modifications took place. If no modifications took place, this should be explicitly stated in the first study of scale use (e.g., used “as is”). If modifications took place, these modifications should be explicitly noted in the first study of scale use. Further, face validity assessments and the validity assessments in alignment with Table 5 should be noted. The validity assessment results, how a scale was presented to participants, the exact wording of items, and randomization procedures can be detailed in the manuscript or online supplementary materials depending on the selected scale and method of deployment. We again provide specific recommendations for this in Table 6.

Further considerations

As theories evolve and measures change, the correspondence between measures and constructs of the Holistic Construal (Figure 1) needs to be revisited from time to time (e.g., Smith & McCarthy, 1995; Stanton et al., 2002). At other times, researchers cannot find a measure that is appropriate for their well-defined construct, or the measure available may impose excessive time burdens on a study, increasing participant fatigue. In both situations where the use of existing scales or slightly modified versions thereof becomes impossible, we recommend one of two steps: (1) a re-evaluation or (2) the development of a new scale. Re-evaluation's primary focus is on empirically confirming (or modifying) existing measures via a full range of reliability and validity assessments. If a re-evaluation of a scale will still not provide the needed measure for a well-defined construct, then at this point a researcher should engage in scale development. A detailed discussion of both re-evaluation and scale development is beyond the scope of this paper. As such, we recommend that researchers follow best practices for scale development as described elsewhere (Churchill, 1979; Hulland et al., 2022; Netemeyer et al., 2003). Finally, at times, research is more exploratory than theory-testing in nature, and using modified (and often shorter) versions of longer validated scales is likely sufficient to illuminate potential relationships of interest for further study (e.g., Alba's, 2012 “bumbling”).

Best practices summary

We have attempted to provide researchers with a clear path in bolstering theory and rectifying the fraying connection between constructs and measures when deploying scales in their research. For scale selection, researchers must clearly define constructs and take the appropriate steps to find the most closely related scale. When deploying an improvised scale or modifying a scale, perhaps a necessary evil in consumer behavior research, researchers need to be more rigorous. Consequently, we have suggested steps rarely utilized within marketing or consumer behavior research for the deployment of scales. However, due to researcher constraints, we have erred on the side of less burdensome requirements for validity assessments. Ultimately, however, the most important takeaway from our suggestions may in fact be the more thorough and careful reporting of scale usage. As evidenced in our JCP review, it was often difficult to understand how researchers deployed scales. We therefore ask that researchers be very clear as to how and why a scale is used and how it performed within their studies. Again, please refer to Tables 3, 4, 5, and 6 for our general recommendations regarding the entirety of scale deployment.

CONCLUSION

Many researchers agree with the idea that the use of validated scales makes it easier to compare findings across studies while also facilitating the “development and testing of theory” (Hinkin, 1995, p. 983). The use of enduring validated measures enhances the communication of results between researchers and promotes the generalizability of findings (Netemeyer et al., 2003; Sharma & Weathers, 2003). Given similar conditions, such measures make it possible for researchers to replicate findings particularly since they are relatively stable over time.

Amidst a time in which social science has been plagued with what some have deemed a “replication crisis” (Loken & Gelman, 2017; Shrout & Rodgers, 2018), we address critical issues related to the use of measures in research. Specifically, and as illustrated in our review of four recent issues of the Journal of Consumer Psychology, researchers often utilize measurement scales from prior research that have undergone unvalidated modifications. We do note that the recent increase in usage of web appendices and other forms of additional information about the specifics of research procedures have made information about the precise measures used more accessible (although we did at times have difficulties easily accessing this information). Yet, there are no clearly accepted standards for how to report those modifications. We advocate that, in addition to the provision of the exact measures used, authors also identify the nature of their modifications, that is, wording, length, or dimensionality changes that we have discussed herein. As the precision of measurement increases, we not only gain greater insights into phenomena of interest, but we also enable other researchers to build upon previous findings. Hence, deliberate, precise measures are essential to solid research, and this is where the usage of scales has an important influence.

We also note that current experimental research tends to favor parsimony to enhance the overall usability of measurement scales over rigor. These choices can result in the introduction of inconsistencies in findings across studies (i.e., they represent a methodological confound). We acknowledge that there are trade-offs between comprehensiveness and parsimony, and there is not a scale for every construct. However, despite being trained in how to effectively use scales, researchers pressed for time and burdened by publication pressures often engage in shortcuts resulting in potentially invalid and misleading research. If a research study finds a surprising or expected null effect, is that because of a boundary condition or is it because of a poor measure? In a likely other case, researchers might avoid measuring constructs of high relevance to their current research purposes to sidestep the potential pitfalls of haphazardly including modified scale measures. In this case, we inherently limit our ability to integrate and grow our scientific insights. Such trade-offs ultimately represent Type I versus Type 2 errors at a broader level.

Our general advice is to use carefully validated measures of constructs where possible and to apply a more critical eye to measurement rigor when modifications are necessary. To this end, we propose a set of best practices. In making these suggestions, we recognize that researchers must make trade-offs between many study design considerations. The guidelines presented herein represent what we believe to be prudent yet practical advice for researchers wishing to ensure measurement integrity and wanting to add to a growing body of knowledge in a particular domain. Although our review draws primarily from consumer behavior research, the measurement issues we addressed also apply to managerial contexts and beyond. We highlight the importance of more nuanced insight regarding these key issues when using higher-order constructs, which deserves more future attention. Further, we acknowledge that revisiting our recommendations for usage in multi-sample research, including cross-cultural measurement equivalence testing, is an important context to further consider for scale deployment.

In closing, we have attempted to provide some recommended guidelines consistent with the various types of modifications being used, but we conclude with a reminder that our most interesting and counterintuitive, and our most straightforward findings, will always be limited in terms of their generalizability and robustness to the strength of the measures we use to assess them.