Topic 1

Fundamentals of Measurement

Measurement is the process of assigning numbers or labels to objects, events, or characteristics according to specific rules. In research, we measure variables to test hypotheses and answer research questions. Understanding measurement principles is essential for collecting meaningful data and drawing valid conclusions.

What is Measurement?

Measurement is the systematic assignment of values to represent properties of objects, events, or people according to a set of rules.

Key Components of Measurement:

Concept

The abstract idea you want to measure

Examples: Intelligence, satisfaction, anxiety, motivation

Construct

A concept that has been given a precise theoretical definition

Example: "Self-efficacy" defined as belief in one's ability to succeed

Operationalization

The process of defining how a construct will be measured

Example: Measuring self-efficacy using Bandura's 10-item scale

Variable

The measurable representation of a construct

Example: Self-efficacy score ranging from 10-40

Indicator

Observable evidence of the construct (individual items)

Example: "I can solve difficult problems if I try hard enough"

The Operationalization Process

1

Abstract Concept

Start with the general idea

"Job Satisfaction"

2

Conceptual Definition

Define what you mean theoretically

"A positive emotional state resulting from the appraisal of one's job"

3

Dimensions

Identify components of the concept

Pay, promotion, supervision, coworkers, work itself

4

Indicators

Create observable/measurable items

"I am satisfied with my current salary" (1-5 scale)

5

Measurement

Collect and score the data

Total score = sum of all items

Complete Operationalization Example

Concept: Academic Stress

Conceptual Definition: The physical and psychological tension experienced by students resulting from academic demands that exceed their perceived ability to cope.

Dimensions:

  • Workload stress (amount of work)
  • Exam stress (testing anxiety)
  • Performance pressure (expectations)
  • Time pressure (deadlines)

Indicators (sample items):

  • "I feel overwhelmed by the amount of coursework"
  • "I feel anxious before exams"
  • "I worry about not meeting my professors' expectations"
  • "I often feel rushed to meet deadlines"

Response Scale: 1 (Never) to 5 (Always)

Scoring: Sum of items; higher scores = higher stress

Types of Variables

By Role in Research

Independent Variable (IV)

The presumed cause; manipulated or measured as predictor

Example: Teaching method (traditional vs. online)

Dependent Variable (DV)

The presumed effect; the outcome you measure

Example: Student test scores

Mediating Variable

Explains HOW the IV affects the DV (mechanism)

Example: Student engagement mediates teaching method → scores

Moderating Variable

Affects WHEN or for WHOM the effect occurs

Example: Learning style moderates teaching method effect

Control Variable

Held constant or statistically controlled

Example: Prior GPA, age

Confounding Variable

Unwanted variable that affects both IV and DV

Example: Socioeconomic status affecting both study habits and scores

By Nature of Data

Categorical (Qualitative)

Categories or groups

  • Nominal: Gender, major, country
  • Ordinal: Education level, rank
Continuous (Quantitative)

Numerical values with meaningful intervals

  • Interval: Temperature, test scores
  • Ratio: Age, income, weight

Why Good Measurement Matters

The quality of your research depends on the quality of your measurements:

  • Poor measurement → Invalid conclusions
  • If you don't measure what you think you're measuring, your conclusions are meaningless
  • Statistical significance is worthless if the underlying measurement is flawed
  • "Garbage in, garbage out" applies to research
Topic 2

Levels of Measurement

The level of measurement determines what mathematical operations you can perform with your data and what statistical tests are appropriate. Understanding these levels is crucial for proper data analysis and interpretation.

The Four Levels of Measurement

1 Nominal

Lowest Level

Definition: Categories or names with no inherent order or ranking

Properties:
  • Categories are mutually exclusive
  • Categories are exhaustive
  • No ranking or order
  • Numbers are just labels
Examples:
  • Gender (male, female, non-binary)
  • Nationality (Thai, American, Japanese)
  • Blood type (A, B, AB, O)
  • Marital status (single, married, divorced)
  • Major (Psychology, Engineering, Business)
  • Jersey numbers (not for math, just identification)
Allowed Operations:

= ≠ (equal, not equal)

Cannot: < > + - × ÷

Appropriate Statistics:
  • Mode (most frequent)
  • Frequency counts
  • Percentages
  • Chi-square test

2 Ordinal

Ranked

Definition: Categories that can be ranked or ordered, but intervals between ranks are not equal

Properties:
  • Has all properties of nominal
  • Categories can be ranked
  • Distance between ranks unknown
  • No true zero point
Examples:
  • Education level (high school, bachelor's, master's, PhD)
  • Socioeconomic status (low, medium, high)
  • Race finishing position (1st, 2nd, 3rd)
  • Pain level (none, mild, moderate, severe)
  • Likert items* (strongly disagree to strongly agree)
  • Military rank (private, corporal, sergeant)

*Likert scales (multiple items summed) are often treated as interval

Allowed Operations:

= ≠ < > (equal, not equal, greater, less)

Cannot: + - × ÷ (no meaningful addition)

Appropriate Statistics:
  • Mode, Median (not mean)
  • Percentiles, ranks
  • Spearman correlation
  • Mann-Whitney U, Kruskal-Wallis

3 Interval

Equal Intervals

Definition: Ordered categories with equal intervals between values, but no true zero point

Properties:
  • Has all properties of ordinal
  • Equal intervals between values
  • No absolute zero (zero is arbitrary)
  • Ratios are not meaningful
Examples:
  • Temperature in Celsius or Fahrenheit (0° doesn't mean "no temperature")
  • Calendar year (Year 0 is arbitrary)
  • IQ scores (0 doesn't mean no intelligence)
  • Standardized test scores (SAT, GRE)
  • Likert scales (when summed across items)
  • pH scale
Allowed Operations:

= ≠ < > + - (addition, subtraction meaningful)

Cannot: × ÷ (ratios not meaningful)

30°C - 20°C = 10°C difference ✓

30°C is NOT "twice as hot" as 15°C ✗

Appropriate Statistics:
  • Mean, Standard deviation
  • Pearson correlation
  • t-tests, ANOVA
  • Regression analysis

4 Ratio

Highest Level

Definition: Equal intervals AND a true zero point (zero means absence of the property)

Properties:
  • Has all properties of interval
  • True zero point exists
  • Ratios are meaningful
  • All mathematical operations allowed
Examples:
  • Age (0 = birth)
  • Income (0 = no income)
  • Weight (0 = no weight)
  • Height (0 = no height)
  • Time (0 = no time)
  • Temperature in Kelvin (0 = absolute zero)
  • Number of children (0 = no children)
  • Reaction time in milliseconds
Allowed Operations:

= ≠ < > + - × ÷ (all operations)

$60,000 is twice $30,000 ✓

Someone 40 years old has lived twice as long as someone 20 ✓

Appropriate Statistics:
  • All statistics available
  • Geometric mean
  • Coefficient of variation
  • All parametric tests

Summary Comparison

Level Categories Order Equal Intervals True Zero Example
Nominal Gender
Ordinal Education
Interval Temperature °C
Ratio Age, Income

The Likert Scale Debate

Are Likert Scales Ordinal or Interval?

Technically Ordinal

Individual Likert items (single questions) are ordinal because:

  • Categories have order
  • Intervals may not be equal
  • Is the gap between "agree" and "strongly agree" the same as between "neutral" and "agree"?
Often Treated as Interval

Likert scales (sum of multiple items) are often treated as interval because:

  • Multiple items average out irregularities
  • Research shows parametric tests are robust
  • Practical convention in social sciences
  • Allows more powerful statistical analyses
Practical Advice:
  • Use 5-7 point scales (more points = more interval-like)
  • Sum multiple items to create a scale (not single items)
  • Report your assumption and justify it
  • When in doubt, use both parametric and non-parametric tests

Common Mistakes

  • Treating nominal as ordinal: Coding male=1, female=2 doesn't make gender ordinal!
  • Computing means for ordinal data: "Average education level = 2.3" is problematic
  • Assuming equal intervals: The difference between ranks 1 and 2 may not equal 2 and 3
  • Using wrong statistics: Can't use Pearson correlation with nominal variables
Topic 3

Validity of Measurement

Validity refers to whether you are measuring what you intend to measure. A valid measure accurately captures the construct of interest. Without validity, your research conclusions are questionable regardless of how reliable your measure might be.

Validity is the degree to which a measurement instrument measures what it is supposed to measure.

Types of Validity

Face Validity

Weakest

Definition: Does the measure appear to measure what it's supposed to measure on the surface?

How assessed: Subjective judgment—"Does it look right?"

Example:

A math test that includes math problems has face validity. A "creativity test" that only asks about favorite colors might lack face validity.

Limitations:

  • Subjective and superficial
  • Not sufficient alone
  • Easily deceiving

When important:

  • Participant buy-in (they should feel items are relevant)
  • Stakeholder acceptance

Content Validity

Important

Definition: Does the measure cover all aspects (dimensions) of the construct adequately?

How assessed:

  • Expert judgment
  • Systematic review of literature
  • Content validity ratio (CVR)
  • Content validity index (CVI)

Example:

A depression scale should cover cognitive, emotional, and physical symptoms—not just sadness. Missing "fatigue" or "sleep problems" would reduce content validity.

Process:

  1. Define the construct comprehensively
  2. Identify all dimensions
  3. Create items for each dimension
  4. Have experts rate item relevance
  5. Calculate CVI (should be >.80)

Construct Validity

Most Important

Definition: Does the measure actually capture the theoretical construct it claims to measure?

Sub-types:

Convergent Validity

Measure correlates highly with other measures of the same construct

Example: Your new anxiety scale should correlate strongly with existing validated anxiety scales

Should be: r > .50

Discriminant (Divergent) Validity

Measure does NOT correlate highly with measures of different constructs

Example: Your anxiety scale should not correlate too highly with a personality scale (showing they're different constructs)

Should be: r < .30 with unrelated constructs

Known-Groups Validity

Measure can distinguish between groups known to differ

Example: Depression scale should show higher scores for clinical patients vs. healthy controls

How assessed:

  • Factor analysis (do items load on expected factors?)
  • Multitrait-multimethod matrix
  • Correlation with other measures
  • Experimental manipulation

Criterion Validity

Very Important

Definition: Does the measure correlate with an external criterion or outcome?

Sub-types:

Concurrent Validity

Measure correlates with criterion assessed at the SAME time

Example: Depression scale scores correlate with clinical diagnosis (assessed simultaneously)

Predictive Validity

Measure predicts FUTURE criterion or behavior

Example: SAT scores predict college GPA (assessed years later)

How assessed:

  • Correlation with criterion variable
  • Regression analysis
  • Sensitivity/specificity (for diagnostic measures)

Challenge: Finding a good "gold standard" criterion

Validity Assessment Summary

Type Question Asked How to Assess Evidence Needed
Face Does it look like it measures the construct? Subjective judgment General appearance
Content Does it cover all aspects of construct? Expert review, CVI CVI > .80
Construct Does it capture the theoretical construct? Factor analysis, correlations Good factor structure, convergent/discriminant validity
Criterion Does it relate to real-world outcomes? Correlation with criterion Significant correlation with outcome

Threats to Validity

Social Desirability Bias

Participants respond in socially acceptable ways rather than honestly

Solution: Anonymous responses, indirect questions, social desirability scales

Acquiescence Bias

Tendency to agree regardless of content

Solution: Mix positively and negatively worded items

Extreme Response Bias

Tendency to use extreme endpoints of scales

Solution: Use more scale points, consider cultural factors

Ambiguous Items

Questions that can be interpreted differently

Solution: Clear wording, pilot testing, cognitive interviews

Validity is NOT All-or-Nothing

  • Validity is a matter of degree
  • Validity is established over time through accumulated evidence
  • Validity is specific to a purpose and population
  • A measure can be valid for one use but not another

Example: SAT is valid for predicting college success, but not for measuring creativity or emotional intelligence.

Topic 4

Reliability of Measurement

Reliability refers to the consistency and stability of measurement. A reliable measure produces the same results under consistent conditions. While reliability is necessary for validity, it is not sufficient—you can have a reliable measure that consistently measures the wrong thing.

Reliability is the degree to which a measurement instrument yields consistent, reproducible results across time, settings, and assessors.

The Relationship Between Validity and Reliability

Not Reliable, Not Valid

Scattered all over—inconsistent and inaccurate

Reliable, Not Valid

Consistent but systematically wrong

Not Reliable, Not Valid

Average is right, but too variable

Reliable AND Valid

Consistent and accurate—the goal!

Key Insight: Reliability is necessary but not sufficient for validity. A measure can be reliable without being valid, but it cannot be valid without being reliable.

Types of Reliability

Test-Retest Reliability

Definition: Consistency of scores when the same test is given to the same people at two different times

How assessed:

  1. Administer measure to sample
  2. Wait appropriate interval (usually 2-4 weeks)
  3. Administer same measure again
  4. Correlate scores from Time 1 and Time 2

Statistic: Pearson correlation coefficient (r)

Acceptable values:

  • r ≥ .70 for research purposes
  • r ≥ .80 for individual decisions

Considerations:

  • Time interval matters (too short = memory effects; too long = real change)
  • Practice effects may inflate reliability
  • Assumes construct is stable over time
  • Not appropriate for unstable constructs (mood states)

Parallel Forms Reliability

Definition: Consistency between two equivalent versions of the same test

How assessed:

  1. Create two equivalent forms of the test
  2. Administer both forms (same people, same time or different times)
  3. Correlate scores from Form A and Form B

When used:

  • Pre-test and post-test designs (avoid memory effects)
  • Repeated assessments (e.g., clinical monitoring)
  • High-stakes testing (prevent cheating)

Challenge: Creating truly equivalent forms is difficult and time-consuming

Internal Consistency

Definition: Degree to which items in a scale measure the same construct (hang together)

Most common measures:

Cronbach's Alpha (α)

Most widely used; assesses overall internal consistency

Interpretation:

  • α ≥ .90 — Excellent
  • α .80-.89 — Good
  • α .70-.79 — Acceptable
  • α .60-.69 — Questionable
  • α < .60 — Poor/Unacceptable

Note: Alpha is affected by number of items—more items = higher alpha (can be artificially inflated)

Split-Half Reliability

Divide items into two halves; correlate the halves

Usually corrected with Spearman-Brown formula

McDonald's Omega (ω)

Newer alternative; makes fewer assumptions than alpha

Increasingly recommended over alpha

Item-Total Correlation:

Correlation between each item and the total score. Items with r < .30 may need to be revised or removed.

Inter-Rater Reliability

Definition: Agreement between different raters/observers assessing the same thing

When needed:

  • Observations
  • Interviews coded by multiple coders
  • Content analysis
  • Essay grading
  • Clinical assessments

Measures:

Percent Agreement

Simple but doesn't account for chance agreement

Formula: (Agreements / Total) × 100

Cohen's Kappa (κ)

For categorical data (2 raters)

Accounts for chance agreement

  • κ ≥ .81 — Almost perfect
  • κ .61-.80 — Substantial
  • κ .41-.60 — Moderate
  • κ .21-.40 — Fair
  • κ < .20 — Slight/Poor
Intraclass Correlation (ICC)

For continuous data; multiple raters

Most appropriate for many situations

Reliability Summary

Type What It Assesses Statistic When to Use
Test-Retest Stability over time Pearson r Stable traits; ability tests
Parallel Forms Equivalence of forms Pearson r Pre/post designs; repeated testing
Internal Consistency Item homogeneity Cronbach's α, ω Multi-item scales
Inter-Rater Agreement between raters Kappa, ICC Observations, coding, subjective scoring

Factors Affecting Reliability

Number of Items

More items generally = higher reliability

Recommendation: Use at least 3-5 items per construct, ideally more

Sample Homogeneity

Homogeneous samples = lower reliability (restricted range)

Recommendation: Report sample characteristics; use diverse samples

Response Options

More response options = higher reliability

Recommendation: Use 5-7 point scales rather than 2-3

Item Clarity

Ambiguous items reduce reliability

Recommendation: Pilot test, cognitive interviewing, clear wording

Testing Conditions

Distractions, fatigue affect responses

Recommendation: Standardize administration conditions

Rater Training

Untrained raters = lower inter-rater reliability

Recommendation: Train raters, use clear coding schemes, practice

Reporting Reliability

Always report reliability for your study, not just from previous studies:

  • "Internal consistency was acceptable (α = .82)"
  • "Inter-rater reliability was substantial (κ = .74)"
  • "Test-retest reliability over 2 weeks was r = .85"

Note: Reliability can vary across samples, so calculate for YOUR data!

Topic 5

Developing and Selecting Scales

Whether you develop your own scale or use an existing one, understanding scale development principles is essential. This topic covers common scale types, the process of developing new measures, and guidelines for selecting existing validated instruments.

Common Types of Rating Scales

Likert Scale

Measures agreement with statements

"I enjoy my work."

Strongly Disagree Disagree Neutral Agree Strongly Agree
1 2 3 4 5
Tips:
  • Use 5-7 points for better discrimination
  • Include reverse-coded items
  • Label all points, not just endpoints
  • Consider including or excluding neutral

Semantic Differential Scale

Rates concepts using bipolar adjectives

Rate your job:

Boring
1 2 3 4 5 6 7
Exciting
Difficult
1 2 3 4 5 6 7
Easy
Tips:
  • Use truly opposite adjectives
  • 7-point scale most common
  • Good for measuring attitudes, perceptions

Visual Analog Scale (VAS)

Continuous line between two endpoints

Rate your pain:

No Pain
Worst Pain

Score: Distance from left (0-100mm)

Tips:
  • Usually 100mm line
  • Measure in millimeters for score
  • Good for subjective experiences
  • Avoids number bias

Numerical Rating Scale

Direct numerical rating

How satisfied are you with the service? (0-10)

0 1 2 3 4 5 6 7 8 9 10
Not at all satisfied Completely satisfied

Frequency Scale

Measures how often something occurs

"How often do you exercise?"

  • ○ Never
  • ○ Rarely (few times a year)
  • ○ Sometimes (monthly)
  • ○ Often (weekly)
  • ○ Very often (daily)
Tips:
  • Define time frames clearly
  • Make categories mutually exclusive
  • Consider specific numbers vs. labels

Scale Development Process

1

Define the Construct

  • Review literature thoroughly
  • Write clear conceptual definition
  • Identify dimensions/facets
  • Distinguish from related constructs
2

Generate Item Pool

  • Write many more items than needed (2-3x)
  • Cover all dimensions
  • Vary item wording (positive/negative)
  • Use clear, simple language
  • Review existing scales for ideas
3

Expert Review

  • Have experts evaluate content validity
  • Calculate Content Validity Index
  • Revise based on feedback
  • Eliminate poor items
4

Pilot Testing

  • Administer to small sample (30-50)
  • Check for comprehension issues
  • Cognitive interviews
  • Initial item analysis
5

Main Validation Study

  • Collect data from larger sample (N > 200)
  • Exploratory Factor Analysis (EFA)
  • Item analysis (correlations, distributions)
  • Calculate reliability (α)
6

Confirmatory Testing

  • New sample for Confirmatory Factor Analysis
  • Test convergent/discriminant validity
  • Assess criterion validity
  • Test-retest reliability
7

Cross-Validation

  • Test in different populations
  • Test measurement invariance
  • Establish norms if applicable
  • Publish and disseminate

Using Existing Scales

When to Use Existing Scales:

  • Well-validated scale already exists
  • Want to compare with previous research
  • Limited time/resources for development
  • Need established norms

Criteria for Selecting a Scale:

Theoretical fit

Does the conceptual definition match yours?

Validity evidence

Has validity been established? In what populations?

Reliability evidence

What are reported reliability coefficients?

Population appropriateness

Was it validated on similar population?

Practical considerations

Length, language, cost, permissions

Citation count

How widely used? (check Google Scholar)

Where to Find Existing Scales:

  • Published articles: Check methods sections
  • PsycTESTS: APA database of psychological tests
  • RAND Health Care: Free validated health measures
  • Measurement instrument databases: PROMIS, NIH Toolbox
  • Systematic reviews: Reviews of measures in specific areas
  • Original authors: Contact for permissions and materials

Important Considerations

  • Permission: Many scales require permission or purchase
  • Modification: Changes may affect validity—report any modifications
  • Translation: Validated translation required if using in different language
  • Cultural adaptation: May need adaptation for different cultures
  • Re-validation: Consider validating in your specific population

Item Writing Guidelines

DO

  • Keep items short and simple
  • Use clear, concrete language
  • Focus on one idea per item
  • Use active voice
  • Write at appropriate reading level
  • Include both positive and negative items
  • Match items to target population

AVOID

  • Double-barreled items ("I am happy and healthy")
  • Double negatives
  • Jargon or technical terms
  • Leading or loaded language
  • Absolute terms ("always," "never")
  • Hypothetical situations
  • Items all keyed in same direction

Quick Checklist for Your Measure

  • ☐ Construct clearly defined
  • ☐ Items reviewed by experts
  • ☐ Pilot tested for clarity
  • ☐ Reliability calculated and reported
  • ☐ Validity evidence provided
  • ☐ Appropriate for your population
  • ☐ Permissions obtained (if using existing scale)
  • ☐ Any modifications documented
Summary

Module 07 Key Takeaways

What You've Learned

  • Measurement involves operationalizing abstract concepts into measurable variables
  • Four levels of measurement (nominal, ordinal, interval, ratio) determine appropriate statistics
  • Validity asks "Are we measuring what we think we're measuring?" (face, content, construct, criterion)
  • Reliability asks "Are we measuring consistently?" (test-retest, internal consistency, inter-rater)
  • Scale development is rigorous; using existing validated scales is often preferred

Next Steps

In Module 08: Ethics in Research, you'll learn about ethical principles and guidelines for conducting research, including informed consent, IRB approval, protecting participants, and responsible conduct of research.

Continue to Module 08
Practice

Measurement Practice Exercises

Applied Measurement Tasks

  1. Operationalization: Choose a construct (e.g., "student engagement") and:
    • Write a conceptual definition
    • Identify 3-4 dimensions
    • Create 2-3 indicators for each dimension
  2. Level Identification: Classify these variables by level of measurement:
    • Political party affiliation
    • Customer satisfaction (1-10)
    • Response time in seconds
    • Pain severity (none/mild/moderate/severe)
    • Temperature in Kelvin
  3. Scale Selection: Find an existing validated scale for a construct related to your research interest:
    • Document validity and reliability evidence
    • Identify the population it was validated on
    • Assess appropriateness for your study
  4. Item Critique: Identify problems with these items:
    • "I never sometimes feel stressed"
    • "How satisfied are you with your salary and benefits?"
    • "Don't you think exercise is important?"
  5. Reliability Practice: Calculate Cronbach's alpha for a short scale using SPSS, R, or online calculator. Interpret the result.