Module 07: Measurement and Scales - Research for Everybody

Topic 1

Fundamentals of Measurement

Measurement is the process of assigning numbers or labels to objects, events, or characteristics according to specific rules. In research, we measure variables to test hypotheses and answer research questions. Understanding measurement principles is essential for collecting meaningful data and drawing valid conclusions.

What is Measurement?

Measurement is the systematic assignment of values to represent properties of objects, events, or people according to a set of rules.

Key Components of Measurement:

Concept

The abstract idea you want to measure

Examples: Intelligence, satisfaction, anxiety, motivation

Construct

A concept that has been given a precise theoretical definition

Example: "Self-efficacy" defined as belief in one's ability to succeed

Operationalization

The process of defining how a construct will be measured

Example: Measuring self-efficacy using Bandura's 10-item scale

Variable

The measurable representation of a construct

Example: Self-efficacy score ranging from 10-40

Indicator

Observable evidence of the construct (individual items)

Example: "I can solve difficult problems if I try hard enough"

The Operationalization Process

1

Abstract Concept

Start with the general idea

"Job Satisfaction"

→

2

Conceptual Definition

Define what you mean theoretically

"A positive emotional state resulting from the appraisal of one's job"

→

3

Dimensions

Identify components of the concept

Pay, promotion, supervision, coworkers, work itself

→

4

Indicators

Create observable/measurable items

"I am satisfied with my current salary" (1-5 scale)

→

5

Measurement

Collect and score the data

Total score = sum of all items

Complete Operationalization Example

Concept: Academic Stress

Conceptual Definition: The physical and psychological tension experienced by students resulting from academic demands that exceed their perceived ability to cope.

Dimensions:

Workload stress (amount of work)
Exam stress (testing anxiety)
Performance pressure (expectations)
Time pressure (deadlines)

Indicators (sample items):

"I feel overwhelmed by the amount of coursework"
"I feel anxious before exams"
"I worry about not meeting my professors' expectations"
"I often feel rushed to meet deadlines"

Response Scale: 1 (Never) to 5 (Always)

Scoring: Sum of items; higher scores = higher stress

Types of Variables

By Role in Research

Independent Variable (IV)

The presumed cause; manipulated or measured as predictor

Example: Teaching method (traditional vs. online)

Dependent Variable (DV)

The presumed effect; the outcome you measure

Example: Student test scores

Mediating Variable

Explains HOW the IV affects the DV (mechanism)

Example: Student engagement mediates teaching method → scores

Moderating Variable

Affects WHEN or for WHOM the effect occurs

Example: Learning style moderates teaching method effect

Control Variable

Held constant or statistically controlled

Example: Prior GPA, age

Confounding Variable

Unwanted variable that affects both IV and DV

Example: Socioeconomic status affecting both study habits and scores

By Nature of Data

Categorical (Qualitative)

Categories or groups

Nominal: Gender, major, country
Ordinal: Education level, rank

Continuous (Quantitative)

Numerical values with meaningful intervals

Interval: Temperature, test scores
Ratio: Age, income, weight

Why Good Measurement Matters

The quality of your research depends on the quality of your measurements:

Poor measurement → Invalid conclusions
If you don't measure what you think you're measuring, your conclusions are meaningless
Statistical significance is worthless if the underlying measurement is flawed
"Garbage in, garbage out" applies to research

Topic 2

Levels of Measurement

The level of measurement determines what mathematical operations you can perform with your data and what statistical tests are appropriate. Understanding these levels is crucial for proper data analysis and interpretation.

The Four Levels of Measurement

1 Nominal

Lowest Level

Definition: Categories or names with no inherent order or ranking

Properties:

Categories are mutually exclusive
Categories are exhaustive
No ranking or order
Numbers are just labels

Examples:

Gender (male, female, non-binary)
Nationality (Thai, American, Japanese)
Blood type (A, B, AB, O)
Marital status (single, married, divorced)
Major (Psychology, Engineering, Business)
Jersey numbers (not for math, just identification)

Allowed Operations:

= ≠ (equal, not equal)

Cannot: < > + - × ÷

Appropriate Statistics:

Mode (most frequent)
Frequency counts
Percentages
Chi-square test

2 Ordinal

Ranked

Definition: Categories that can be ranked or ordered, but intervals between ranks are not equal

Properties:

Has all properties of nominal
Categories can be ranked
Distance between ranks unknown
No true zero point

Examples:

Education level (high school, bachelor's, master's, PhD)
Socioeconomic status (low, medium, high)
Race finishing position (1st, 2nd, 3rd)
Pain level (none, mild, moderate, severe)
Likert items* (strongly disagree to strongly agree)
Military rank (private, corporal, sergeant)

*Likert scales (multiple items summed) are often treated as interval

Allowed Operations:

= ≠ < > (equal, not equal, greater, less)

Cannot: + - × ÷ (no meaningful addition)

Appropriate Statistics:

Mode, Median (not mean)
Percentiles, ranks
Spearman correlation
Mann-Whitney U, Kruskal-Wallis

3 Interval

Equal Intervals

Definition: Ordered categories with equal intervals between values, but no true zero point

Properties:

Has all properties of ordinal
Equal intervals between values
No absolute zero (zero is arbitrary)
Ratios are not meaningful

Examples:

Temperature in Celsius or Fahrenheit (0° doesn't mean "no temperature")
Calendar year (Year 0 is arbitrary)
IQ scores (0 doesn't mean no intelligence)
Standardized test scores (SAT, GRE)
Likert scales (when summed across items)
pH scale

Allowed Operations:

= ≠ < > + - (addition, subtraction meaningful)

Cannot: × ÷ (ratios not meaningful)

30°C - 20°C = 10°C difference ✓

30°C is NOT "twice as hot" as 15°C ✗

Appropriate Statistics:

Mean, Standard deviation
Pearson correlation
t-tests, ANOVA
Regression analysis

4 Ratio

Highest Level

Definition: Equal intervals AND a true zero point (zero means absence of the property)

Properties:

Has all properties of interval
True zero point exists
Ratios are meaningful
All mathematical operations allowed

Examples:

Age (0 = birth)
Income (0 = no income)
Weight (0 = no weight)
Height (0 = no height)
Time (0 = no time)
Temperature in Kelvin (0 = absolute zero)
Number of children (0 = no children)
Reaction time in milliseconds

Allowed Operations:

= ≠ < > + - × ÷ (all operations)

$60,000 is twice $30,000 ✓

Someone 40 years old has lived twice as long as someone 20 ✓

Appropriate Statistics:

All statistics available
Geometric mean
Coefficient of variation
All parametric tests

Summary Comparison

Level	Categories	Order	Equal Intervals	True Zero	Example
Nominal	✓	✗	✗	✗	Gender
Ordinal	✓	✓	✗	✗	Education
Interval	✓	✓	✓	✗	Temperature °C
Ratio	✓	✓	✓	✓	Age, Income

The Likert Scale Debate

Are Likert Scales Ordinal or Interval?

Technically Ordinal

Individual Likert items (single questions) are ordinal because:

Categories have order
Intervals may not be equal
Is the gap between "agree" and "strongly agree" the same as between "neutral" and "agree"?

Often Treated as Interval

Likert scales (sum of multiple items) are often treated as interval because:

Multiple items average out irregularities
Research shows parametric tests are robust
Practical convention in social sciences
Allows more powerful statistical analyses

Practical Advice:

Use 5-7 point scales (more points = more interval-like)
Sum multiple items to create a scale (not single items)
Report your assumption and justify it
When in doubt, use both parametric and non-parametric tests

                                    Common Mistakes
                                    Treating nominal as ordinal: Coding male=1, female=2 doesn't make gender ordinal!
Computing means for ordinal data: "Average education level = 2.3" is problematic
Assuming equal intervals: The difference between ranks 1 and 2 may not equal 2 and 3
Using wrong statistics: Can't use Pearson correlation with nominal variables

                                

Topic 3

Validity of Measurement

Validity refers to whether you are measuring what you intend to measure. A valid measure accurately captures the construct of interest. Without validity, your research conclusions are questionable regardless of how reliable your measure might be.

Validity is the degree to which a measurement instrument measures what it is supposed to measure.

Types of Validity

Face Validity

Weakest

Definition: Does the measure appear to measure what it's supposed to measure on the surface?

How assessed: Subjective judgment—"Does it look right?"

Example:

A math test that includes math problems has face validity. A "creativity test" that only asks about favorite colors might lack face validity.

Limitations:

Subjective and superficial
Not sufficient alone
Easily deceiving

When important:

Participant buy-in (they should feel items are relevant)
Stakeholder acceptance

Content Validity

Important

Definition: Does the measure cover all aspects (dimensions) of the construct adequately?

How assessed:

Expert judgment
Systematic review of literature
Content validity ratio (CVR)
Content validity index (CVI)

Example:

A depression scale should cover cognitive, emotional, and physical symptoms—not just sadness. Missing "fatigue" or "sleep problems" would reduce content validity.

Process:

Define the construct comprehensively
Identify all dimensions
Create items for each dimension
Have experts rate item relevance
Calculate CVI (should be >.80)

Construct Validity

Most Important

Definition: Does the measure actually capture the theoretical construct it claims to measure?

Sub-types:

Convergent Validity

Measure correlates highly with other measures of the same construct

Example: Your new anxiety scale should correlate strongly with existing validated anxiety scales

Should be: r > .50

Discriminant (Divergent) Validity

Measure does NOT correlate highly with measures of different constructs

Example: Your anxiety scale should not correlate too highly with a personality scale (showing they're different constructs)

Should be: r < .30 with unrelated constructs

Known-Groups Validity

Measure can distinguish between groups known to differ

Example: Depression scale should show higher scores for clinical patients vs. healthy controls

How assessed:

Factor analysis (do items load on expected factors?)
Multitrait-multimethod matrix
Correlation with other measures
Experimental manipulation

Criterion Validity

Very Important

Definition: Does the measure correlate with an external criterion or outcome?

Sub-types:

Concurrent Validity

Measure correlates with criterion assessed at the SAME time

Example: Depression scale scores correlate with clinical diagnosis (assessed simultaneously)

Predictive Validity

Measure predicts FUTURE criterion or behavior

Example: SAT scores predict college GPA (assessed years later)

How assessed:

Correlation with criterion variable
Regression analysis
Sensitivity/specificity (for diagnostic measures)

Challenge: Finding a good "gold standard" criterion

Validity Assessment Summary

Type	Question Asked	How to Assess	Evidence Needed
Face	Does it look like it measures the construct?	Subjective judgment	General appearance
Content	Does it cover all aspects of construct?	Expert review, CVI	CVI > .80
Construct	Does it capture the theoretical construct?	Factor analysis, correlations	Good factor structure, convergent/discriminant validity
Criterion	Does it relate to real-world outcomes?	Correlation with criterion	Significant correlation with outcome

Threats to Validity

Social Desirability Bias

Participants respond in socially acceptable ways rather than honestly

Solution: Anonymous responses, indirect questions, social desirability scales

Acquiescence Bias

Tendency to agree regardless of content

Solution: Mix positively and negatively worded items

Extreme Response Bias

Tendency to use extreme endpoints of scales

Solution: Use more scale points, consider cultural factors

Ambiguous Items

Questions that can be interpreted differently

Solution: Clear wording, pilot testing, cognitive interviews

Validity is NOT All-or-Nothing

Validity is a matter of degree
Validity is established over time through accumulated evidence
Validity is specific to a purpose and population
A measure can be valid for one use but not another

Example: SAT is valid for predicting college success, but not for measuring creativity or emotional intelligence.

Topic 4

Reliability of Measurement

Reliability refers to the consistency and stability of measurement. A reliable measure produces the same results under consistent conditions. While reliability is necessary for validity, it is not sufficient—you can have a reliable measure that consistently measures the wrong thing.

Reliability is the degree to which a measurement instrument yields consistent, reproducible results across time, settings, and assessors.

The Relationship Between Validity and Reliability

Not Reliable, Not Valid

Scattered all over—inconsistent and inaccurate

Reliable, Not Valid

Consistent but systematically wrong

Not Reliable, Not Valid

Average is right, but too variable

Reliable AND Valid

Consistent and accurate—the goal!

Key Insight: Reliability is necessary but not sufficient for validity. A measure can be reliable without being valid, but it cannot be valid without being reliable.

Types of Reliability

Test-Retest Reliability

Definition: Consistency of scores when the same test is given to the same people at two different times

How assessed:

Administer measure to sample
Wait appropriate interval (usually 2-4 weeks)
Administer same measure again
Correlate scores from Time 1 and Time 2

Statistic: Pearson correlation coefficient (r)

Acceptable values:

r ≥ .70 for research purposes
r ≥ .80 for individual decisions

Considerations:

Time interval matters (too short = memory effects; too long = real change)
Practice effects may inflate reliability
Assumes construct is stable over time
Not appropriate for unstable constructs (mood states)

Parallel Forms Reliability

Definition: Consistency between two equivalent versions of the same test

How assessed:

Create two equivalent forms of the test
Administer both forms (same people, same time or different times)
Correlate scores from Form A and Form B

When used:

Pre-test and post-test designs (avoid memory effects)
Repeated assessments (e.g., clinical monitoring)
High-stakes testing (prevent cheating)

Challenge: Creating truly equivalent forms is difficult and time-consuming

Internal Consistency

Definition: Degree to which items in a scale measure the same construct (hang together)

Most common measures:

Cronbach's Alpha (α)

Most widely used; assesses overall internal consistency

Interpretation:

α ≥ .90 — Excellent
α .80-.89 — Good
α .70-.79 — Acceptable
α .60-.69 — Questionable
α < .60 — Poor/Unacceptable

Note: Alpha is affected by number of items—more items = higher alpha (can be artificially inflated)

Split-Half Reliability

Divide items into two halves; correlate the halves

Usually corrected with Spearman-Brown formula

McDonald's Omega (ω)

Newer alternative; makes fewer assumptions than alpha

Increasingly recommended over alpha

Item-Total Correlation:

Correlation between each item and the total score. Items with r < .30 may need to be revised or removed.

Inter-Rater Reliability

Definition: Agreement between different raters/observers assessing the same thing

When needed:

Observations
Interviews coded by multiple coders
Content analysis
Essay grading
Clinical assessments

Measures:

Percent Agreement

Simple but doesn't account for chance agreement

Formula: (Agreements / Total) × 100

Cohen's Kappa (κ)

For categorical data (2 raters)

Accounts for chance agreement

κ ≥ .81 — Almost perfect
κ .61-.80 — Substantial
κ .41-.60 — Moderate
κ .21-.40 — Fair
κ < .20 — Slight/Poor

Intraclass Correlation (ICC)

For continuous data; multiple raters

Most appropriate for many situations

Reliability Summary

Type	What It Assesses	Statistic	When to Use
Test-Retest	Stability over time	Pearson r	Stable traits; ability tests
Parallel Forms	Equivalence of forms	Pearson r	Pre/post designs; repeated testing
Internal Consistency	Item homogeneity	Cronbach's α, ω	Multi-item scales
Inter-Rater	Agreement between raters	Kappa, ICC	Observations, coding, subjective scoring

Factors Affecting Reliability

Number of Items

More items generally = higher reliability

Recommendation: Use at least 3-5 items per construct, ideally more

Sample Homogeneity

Homogeneous samples = lower reliability (restricted range)

Recommendation: Report sample characteristics; use diverse samples

Response Options

More response options = higher reliability

Recommendation: Use 5-7 point scales rather than 2-3

Item Clarity

Ambiguous items reduce reliability

Recommendation: Pilot test, cognitive interviewing, clear wording

Testing Conditions

Distractions, fatigue affect responses

Recommendation: Standardize administration conditions

Rater Training

Untrained raters = lower inter-rater reliability

Recommendation: Train raters, use clear coding schemes, practice

Reporting Reliability

Always report reliability for your study, not just from previous studies:

"Internal consistency was acceptable (α = .82)"
"Inter-rater reliability was substantial (κ = .74)"
"Test-retest reliability over 2 weeks was r = .85"

Note: Reliability can vary across samples, so calculate for YOUR data!

Topic 5

Developing and Selecting Scales

Whether you develop your own scale or use an existing one, understanding scale development principles is essential. This topic covers common scale types, the process of developing new measures, and guidelines for selecting existing validated instruments.

Common Types of Rating Scales

Likert Scale

Measures agreement with statements

"I enjoy my work."

Strongly Disagree Disagree Neutral Agree Strongly Agree

1 2 3 4 5

Tips:

Use 5-7 points for better discrimination
Include reverse-coded items
Label all points, not just endpoints
Consider including or excluding neutral

Semantic Differential Scale

Rates concepts using bipolar adjectives

Rate your job:

Boring

1 2 3 4 5 6 7

Exciting

Difficult

1 2 3 4 5 6 7

Easy

Tips:

Use truly opposite adjectives
7-point scale most common
Good for measuring attitudes, perceptions

Visual Analog Scale (VAS)

Continuous line between two endpoints

Rate your pain:

No Pain

Worst Pain

Score: Distance from left (0-100mm)

Tips:

Usually 100mm line
Measure in millimeters for score
Good for subjective experiences
Avoids number bias

Numerical Rating Scale

Direct numerical rating

How satisfied are you with the service? (0-10)

0 1 2 3 4 5 6 7 8 9 10

Not at all satisfied Completely satisfied

Frequency Scale

Measures how often something occurs

"How often do you exercise?"

○ Never
○ Rarely (few times a year)
○ Sometimes (monthly)
○ Often (weekly)
○ Very often (daily)

Tips:

Define time frames clearly
Make categories mutually exclusive
Consider specific numbers vs. labels

Scale Development Process

1

Define the Construct

Review literature thoroughly
Write clear conceptual definition
Identify dimensions/facets
Distinguish from related constructs

2

Generate Item Pool

Write many more items than needed (2-3x)
Cover all dimensions
Vary item wording (positive/negative)
Use clear, simple language
Review existing scales for ideas

3

Expert Review

Have experts evaluate content validity
Calculate Content Validity Index
Revise based on feedback
Eliminate poor items

4

Pilot Testing

Administer to small sample (30-50)
Check for comprehension issues
Cognitive interviews
Initial item analysis

5

Main Validation Study

Collect data from larger sample (N > 200)
Exploratory Factor Analysis (EFA)
Item analysis (correlations, distributions)
Calculate reliability (α)

6

Confirmatory Testing

New sample for Confirmatory Factor Analysis
Test convergent/discriminant validity
Assess criterion validity
Test-retest reliability

7

Cross-Validation

Test in different populations
Test measurement invariance
Establish norms if applicable
Publish and disseminate

Using Existing Scales

When to Use Existing Scales:

Well-validated scale already exists
Want to compare with previous research
Limited time/resources for development
Need established norms

Criteria for Selecting a Scale:

Theoretical fit

Does the conceptual definition match yours?

Validity evidence

Has validity been established? In what populations?

Reliability evidence

What are reported reliability coefficients?

Population appropriateness

Was it validated on similar population?

Practical considerations

Length, language, cost, permissions

Citation count

How widely used? (check Google Scholar)

Where to Find Existing Scales:

Published articles: Check methods sections
PsycTESTS: APA database of psychological tests
RAND Health Care: Free validated health measures
Measurement instrument databases: PROMIS, NIH Toolbox
Systematic reviews: Reviews of measures in specific areas
Original authors: Contact for permissions and materials

                                    Important Considerations
                                    Permission: Many scales require permission or purchase
Modification: Changes may affect validity—report any modifications
Translation: Validated translation required if using in different language
Cultural adaptation: May need adaptation for different cultures
Re-validation: Consider validating in your specific population

                                

Item Writing Guidelines

DO

Keep items short and simple
Use clear, concrete language
Focus on one idea per item
Use active voice
Write at appropriate reading level
Include both positive and negative items
Match items to target population

AVOID

Double-barreled items ("I am happy and healthy")
Double negatives
Jargon or technical terms
Leading or loaded language
Absolute terms ("always," "never")
Hypothetical situations
Items all keyed in same direction

                                    Quick Checklist for Your Measure
                                    ☐ Construct clearly defined
☐ Items reviewed by experts
☐ Pilot tested for clarity
☐ Reliability calculated and reported
☐ Validity evidence provided
☐ Appropriate for your population
☐ Permissions obtained (if using existing scale)
☐ Any modifications documented

                                

Summary

Module 07 Key Takeaways

What You've Learned

Measurement involves operationalizing abstract concepts into measurable variables
Four levels of measurement (nominal, ordinal, interval, ratio) determine appropriate statistics
Validity asks "Are we measuring what we think we're measuring?" (face, content, construct, criterion)
Reliability asks "Are we measuring consistently?" (test-retest, internal consistency, inter-rater)
Scale development is rigorous; using existing validated scales is often preferred

Next Steps

In Module 08: Ethics in Research, you'll learn about ethical principles and guidelines for conducting research, including informed consent, IRB approval, protecting participants, and responsible conduct of research.

Continue to Module 08

Practice

Measurement Practice Exercises

Applied Measurement Tasks

Operationalization: Choose a construct (e.g., "student engagement") and:
- Write a conceptual definition
- Identify 3-4 dimensions
- Create 2-3 indicators for each dimension
Level Identification: Classify these variables by level of measurement:
- Political party affiliation
- Customer satisfaction (1-10)
- Response time in seconds
- Pain severity (none/mild/moderate/severe)
- Temperature in Kelvin
Scale Selection: Find an existing validated scale for a construct related to your research interest:
- Document validity and reliability evidence
- Identify the population it was validated on
- Assess appropriateness for your study
Item Critique: Identify problems with these items:
- "I never sometimes feel stressed"
- "How satisfied are you with your salary and benefits?"
- "Don't you think exercise is important?"
Reliability Practice: Calculate Cronbach's alpha for a short scale using SPSS, R, or online calculator. Interpret the result.