Topic 1

Data Preparation & Cleaning

Before analyzing data, you must prepare and clean it. Raw data almost always contains errors, inconsistencies, and missing values that need to be addressed. Data preparation can take 60-80% of a data analyst's time—it's tedious but essential for valid results.

The Data Analysis Workflow

Collect Data
Prepare & Clean
Explore
Analyze
Report

Setting Up Your Data File

Spreadsheet Structure

ID Age Gender Score_Pre Score_Post Group
001 23 1 45 52 1
002 25 2 38 41 2
003 21 1 52 58 1

Rows = Cases/Participants (one row per person)

Columns = Variables (one column per variable)

Variable Naming Rules

Use short, descriptive names

Age, Score_Pre, GroupID

Start with a letter

Q1, Item_1, Var_A

Use underscores for spaces

Pre_Test, Post_Test

Avoid spaces

Pre Test ✗

Avoid special characters

Score#1, Age@start ✗

Avoid very long names

ParticipantPreTestScoreFirstSession ✗

Creating a Codebook

A codebook documents all variables, their meanings, and how values are coded.

Example Codebook

Variable Description Type Values Missing
ID Participant identifier String 001-150 None
Age Age in years Numeric 18-65 -99
Gender Self-reported gender Categorical 1=Male, 2=Female, 3=Other -99
Score_Pre Pre-test score (0-100) Numeric 0-100 -99
Group Experimental condition Categorical 1=Control, 2=Treatment None

Data Cleaning Checklist

Check for Errors

  • Out-of-range values: Age = 250? Score = -5?

    How: Run frequencies/descriptives and check min/max

  • Impossible combinations: Pregnant male? 10-year-old with PhD?

    How: Cross-tabulate related variables

  • Typos: "Femalee" instead of "Female"

    How: Check unique values for categorical variables

  • Duplicate entries: Same participant entered twice

    How: Check for duplicate IDs

Handle Missing Data

  • Identify missing values: Are they coded consistently?

    Common codes: blank, NA, -99, 999, "missing"

  • Assess patterns: Is data missing randomly or systematically?

    MCAR (completely random), MAR (random given other variables), MNAR (not random)

  • Decide how to handle:
    • Listwise deletion (remove cases with any missing)
    • Pairwise deletion (use available data for each analysis)
    • Mean/mode imputation (replace with average)
    • Multiple imputation (advanced statistical method)

Transform Variables

  • Recode: Collapse categories, reverse-score items

    5-point scale: 1→5, 2→4, 3→3, 4→2, 5→1

  • Compute: Create new variables (totals, averages)

    Total_Score = Q1 + Q2 + Q3 + Q4 + Q5

  • Categorize: Convert continuous to categorical

    Age_Group: 18-25=1, 26-35=2, 36-45=3, etc.

Handling Outliers

Outliers are data points that are unusually far from other observations. They can be legitimate extreme values or errors.

Detection Methods

Visual Inspection
  • Box plots
  • Scatter plots
  • Histograms
Statistical Rules
  • Z-score > 3 or < -3
  • IQR method: below Q1-1.5×IQR or above Q3+1.5×IQR
  • Mahalanobis distance (multivariate)

What to Do with Outliers

Investigate

Is it a data entry error? Can you verify the true value?

Keep

If legitimate, keep it—extreme values are real data

Winsorize

Replace extreme values with less extreme ones (e.g., 95th percentile)

Remove

If error or truly anomalous, remove with documentation

Transform

Log transformation can reduce impact of extreme values

Document Everything!

Keep a log of all data cleaning decisions:

  • What changes were made
  • Why each change was made
  • How many cases were affected
  • Keep original raw data file untouched

This ensures transparency and reproducibility.

Topic 2

Measures of Central Tendency

Measures of central tendency describe the "typical" or "average" value in a dataset. The three main measures are mean, median, and mode. Each has strengths and is appropriate for different types of data and distributions.

The Three Measures

Mean (Average)

x̄ or M

Sum of all values divided by number of values

Mean = Σx / n

Sum of all values ÷ Number of values

Example: Scores: 12, 15, 18, 20, 25

Mean = (12 + 15 + 18 + 20 + 25) / 5

Mean = 90 / 5 = 18

Strengths
  • Uses all data points
  • Most common measure
  • Basis for many statistics
Limitations
  • Sensitive to outliers
  • Can be misleading for skewed data
  • Only for interval/ratio data

Use when: Data is interval/ratio and approximately normally distributed

Median

Mdn

The middle value when data is ordered from lowest to highest

Odd n: Middle value

Even n: Average of two middle values

Example (odd): 12, 15, 18, 20, 25

Median = 18 (middle value)

Example (even): 12, 15, 18, 20, 25, 30

Median = (18 + 20) / 2 = 19

Strengths
  • Not affected by outliers
  • Good for skewed data
  • Works for ordinal data
Limitations
  • Ignores actual values
  • Less common in statistics
  • Harder to use in formulas

Use when: Data is skewed, contains outliers, or is ordinal

Mode

Mo

The most frequently occurring value

Value that appears most often

Can have no mode, one mode (unimodal), or multiple modes (bimodal, multimodal)

Example: 12, 15, 15, 15, 18, 20, 25

Mode = 15 (appears 3 times)

Bimodal: 10, 10, 15, 20, 20, 25

Modes = 10 and 20

Strengths
  • Works for all data types
  • Only option for nominal data
  • Easy to understand
Limitations
  • May not exist or may not be unique
  • Ignores other values
  • Less useful for continuous data

Use when: Data is nominal/categorical, or you want the most common response

Choosing the Right Measure

Data Level Best Measure Example
Nominal Mode only Most common major: "Psychology"
Ordinal Median (or mode) Median satisfaction: "Satisfied"
Interval/Ratio (symmetric) Mean Mean test score: 78.5
Interval/Ratio (skewed) Median Median income: $52,000

Effect of Skewness

Left (Negative) Skew

Mean < Median < Mode

Example: Easy test scores (most high, few low)

Normal (Symmetric)

Mean ≈ Median ≈ Mode

Example: Height in a population

Right (Positive) Skew

Mode < Median < Mean

Example: Income (most low/moderate, few very high)

Real-World Example: Income

Why do we report median household income, not mean?

  • U.S. Median household income: ~$70,000
  • U.S. Mean household income: ~$95,000

The mean is pulled up by billionaires! The median better represents the "typical" household because income is right-skewed.

Topic 3

Measures of Variability

Measures of variability (or dispersion) describe how spread out the data is. Two datasets can have the same mean but very different spreads. Understanding variability is essential for interpreting data and making statistical inferences.

Why Variability Matters

Class A

70, 72, 74, 76, 78

Mean = 74

Consistent performance—all students similar

Class B

50, 60, 74, 88, 98

Mean = 74

High variability—students very different

Same mean, very different stories! We need measures of spread to understand the full picture.

Range

Range

Difference between highest and lowest values

Range = Maximum - Minimum

Example: Scores: 45, 52, 67, 73, 89

Range = 89 - 45 = 44

Pro: Easy to calculate and understand

Con: Based on only 2 values; very sensitive to outliers

Interquartile Range (IQR)

Interquartile Range

Range of the middle 50% of the data

IQR = Q3 - Q1

Q1 = 25th percentile, Q3 = 75th percentile

25%
25%
25%
25%
Min Q1 Median (Q2) Q3 Max
IQR = Middle 50%

Pro: Not affected by outliers; good for skewed data

Con: Ignores 50% of the data

Variance and Standard Deviation

Variance (s²)

Average of squared deviations from the mean

s² = Σ(x - x̄)² / (n - 1)

For sample variance; population uses n instead of n-1

Variance is in squared units (e.g., years²), which is hard to interpret directly.

Standard Deviation (s or SD)

Square root of variance—average distance from the mean

s = √[Σ(x - x̄)² / (n - 1)]

In original units—much more interpretable than variance.

Calculating Standard Deviation: Step by Step

Data: 4, 8, 6, 5, 7

x x - x̄ (x - x̄)²
4 4 - 6 = -2 4
8 8 - 6 = 2 4
6 6 - 6 = 0 0
5 5 - 6 = -1 1
7 7 - 6 = 1 1
Σ = 30 Σ = 0 Σ = 10

Step 1: Calculate mean: x̄ = 30/5 = 6

Step 2: Calculate deviations (x - x̄)

Step 3: Square each deviation

Step 4: Sum squared deviations: Σ(x - x̄)² = 10

Step 5: Divide by n-1: s² = 10/4 = 2.5 (variance)

Step 6: Take square root: s = √2.5 = 1.58 (SD)

Interpreting Standard Deviation

The Empirical Rule (68-95-99.7 Rule)

For normally distributed data:

68%

of data falls within ±1 SD of mean

95%

of data falls within ±2 SD of mean

99.7%

of data falls within ±3 SD of mean

Example Application

If test scores have Mean = 75, SD = 10:

  • 68% of students score between 65 and 85
  • 95% of students score between 55 and 95
  • 99.7% of students score between 45 and 105

Summary: Choosing a Measure of Variability

Measure Best For Pair With
Range Quick overview; ordinal data Any central tendency
IQR Skewed data; outliers present Median
Standard Deviation Normally distributed interval/ratio data Mean

Report Both Central Tendency and Variability!

Always report a measure of center AND spread together:

  • "Mean score was 75.4 (SD = 12.3)"
  • "Median income was $52,000 (IQR = $28,000-$78,000)"

One without the other tells an incomplete story.

Topic 4

Data Visualization Basics

Visualizations help you explore data, identify patterns, and communicate findings. The right chart makes data accessible; the wrong chart misleads. This topic covers essential chart types and when to use each.

Charts for Categorical Data

Bar Chart

A
B
C
D

Use for: Comparing counts or values across categories

Example: Number of students in each major

Tips:
  • Start y-axis at zero
  • Order bars meaningfully (by size or logically)
  • Use horizontal bars for long category names

Pie Chart

Use for: Showing parts of a whole (percentages)

Example: Percentage of budget by category

Tips:
  • Limit to 5-7 slices maximum
  • Include percentages as labels
  • Consider bar chart instead for precise comparisons

Charts for Continuous Data

Histogram

Use for: Showing distribution of continuous variable

Example: Distribution of test scores

Tips:
  • Bars touch (continuous data)
  • Shows shape: symmetric, skewed, bimodal
  • Experiment with bin width

Box Plot (Box-and-Whisker)

Use for: Showing distribution with quartiles; comparing groups

Example: Comparing test scores across classes

Anatomy:
  • Box: Middle 50% (IQR)
  • Line in box: Median
  • Whiskers: Extend to min/max or 1.5×IQR
  • Dots: Outliers beyond whiskers

Charts for Relationships

Scatter Plot

Use for: Showing relationship between two continuous variables

Example: Study hours vs. test score

Look for:
  • Direction: Positive, negative, or no relationship
  • Strength: Tight cluster vs. wide spread
  • Form: Linear or curved
  • Outliers: Points far from pattern

Line Graph

Use for: Showing change over time (trends)

Example: Monthly sales over a year

Tips:
  • X-axis should be time
  • Connect points only when continuity makes sense
  • Can show multiple lines for comparison

Choosing the Right Chart

Your Goal Data Type Best Chart
Compare categories Categorical Bar chart
Show parts of whole Categorical (percentages) Pie chart (or stacked bar)
Show distribution Continuous Histogram
Compare distributions Continuous by group Box plot
Show relationship Two continuous Scatter plot
Show trend over time Continuous + time Line graph

Common Visualization Mistakes

Truncated Y-Axis

Not starting at zero exaggerates differences

Fix: Start at zero or clearly indicate break

3D Effects

3D charts distort perception and add no information

Fix: Use 2D charts—cleaner and more accurate

Too Many Categories

Pie charts with 15 slices are unreadable

Fix: Combine small categories into "Other"

Missing Labels

Charts without axis labels or titles are meaningless

Fix: Always label axes, include title and units

Dual Y-Axes

Two y-axes can be manipulated to show false relationships

Fix: Use separate charts or normalize data

Wrong Chart Type

Using pie chart for non-percentage data, line for categories

Fix: Match chart to data and purpose

Good Visualization Principles

  • Clarity: The message should be immediately clear
  • Accuracy: Data should be represented truthfully
  • Efficiency: Minimize ink-to-data ratio (no chartjunk)
  • Accessibility: Consider colorblind-friendly palettes
  • Context: Provide enough information to interpret
Topic 5

Introduction to Statistical Inference

Statistical inference allows us to draw conclusions about populations based on sample data. This topic introduces the foundational concepts: populations vs. samples, sampling distributions, hypothesis testing logic, and p-values.

Populations and Samples

Population

The entire group you want to study

Parameters: Characteristics of the population

Symbols: μ (mean), σ (SD), ρ (correlation)

Example: All university students in Bangladesh

Sample

A subset of the population you actually study

Statistics: Characteristics of the sample (estimates of parameters)

Symbols: x̄ (mean), s (SD), r (correlation)

Example: 500 students from 5 universities

We use sample statistics to estimate population parameters

This is statistical inference

Sampling Error and Sampling Distribution

Sampling Error: The difference between a sample statistic and the population parameter. It occurs because samples are only part of the population.

The Sampling Distribution

Imagine taking many samples from the same population and calculating the mean of each:

Population

μ = 100

Sample 1: x̄ = 98
Sample 2: x̄ = 103
Sample 3: x̄ = 101
Sample 4: x̄ = 97
Sample 5: x̄ = 102
...

Sampling Distribution of Means

Centered at μ = 100

The distribution of all these sample means is called the sampling distribution.

Central Limit Theorem

Regardless of the population distribution, the sampling distribution of the mean:

  • Approaches a normal distribution as sample size increases
  • Has mean equal to the population mean (μ)
  • Has standard deviation (standard error) = σ/√n

Why it matters: This allows us to use normal distribution for inference even when population isn't normal (if n is large enough, typically n ≥ 30)

Standard Error

Standard Error (SE): The standard deviation of the sampling distribution. It measures how much sample statistics vary from sample to sample.

SE = s / √n

Standard deviation divided by square root of sample size

Key Implications:

  • Larger sample → Smaller SE → More precise estimate
  • To halve the SE, you need to quadruple the sample size
  • SE is used to construct confidence intervals and calculate test statistics

The Logic of Hypothesis Testing

1

State Hypotheses

Null Hypothesis (H₀)

No effect, no difference, no relationship

"There is no difference in scores between groups"

Alternative Hypothesis (H₁ or Hₐ)

There IS an effect, difference, or relationship

"There is a difference in scores between groups"

2

Set Significance Level (α)

The threshold for deciding if results are "statistically significant"

Typically α = .05 (5% chance of rejecting H₀ when it's actually true)

3

Collect Data & Calculate Test Statistic

Conduct study and calculate appropriate statistic (t, F, χ², etc.)

4

Find the p-value

Probability of getting results this extreme (or more) IF the null hypothesis were true

5

Make Decision

If p ≤ α: Reject H₀

Result is statistically significant

If p > α: Fail to reject H₀

Result is not statistically significant

Understanding P-Values

P-value: The probability of obtaining results at least as extreme as those observed, assuming the null hypothesis is true.

What P-Value IS:

  • Probability of the data (or more extreme) given H₀ is true
  • A measure of evidence against the null hypothesis
  • Smaller p-value = stronger evidence against H₀

What P-Value is NOT:

  • NOT the probability that H₀ is true
  • NOT the probability that results are due to chance
  • NOT a measure of effect size or importance

Interpreting P-Values

p > .10 Little to no evidence against H₀
p = .05-.10 Weak evidence (sometimes called "marginally significant")
p = .01-.05 Moderate evidence against H₀
p = .001-.01 Strong evidence against H₀
p < .001 Very strong evidence against H₀

Types of Errors

H₀ is Actually True H₀ is Actually False
Reject H₀ Type I Error
False positive
Probability = α
Correct!
True positive
Power = 1 - β
Fail to Reject H₀ Correct!
True negative
Type II Error
False negative
Probability = β
Type I Error Example:

Concluding a new drug works when it actually doesn't (false alarm)

Type II Error Example:

Concluding a drug doesn't work when it actually does (missed opportunity)

Statistical Significance ≠ Practical Importance

A statistically significant result (p < .05) doesn't mean the effect is:

  • Large: With big samples, tiny effects can be significant
  • Important: A 0.5 point difference on a 100-point scale may be significant but trivial
  • Real-world meaningful: Always report and interpret effect sizes!
Summary

Module 09 Key Takeaways

What You've Learned

  • Data preparation (cleaning, coding, handling missing data) is essential before analysis
  • Mean, median, and mode describe central tendency; choose based on data type and distribution
  • Range, IQR, and standard deviation describe variability; always report with central tendency
  • Choose visualizations based on data type and purpose; avoid common chart mistakes
  • Statistical inference uses sample statistics to estimate population parameters; p-values indicate evidence against null hypothesis

Next Steps

In Module 10: Quantitative Analysis, you'll learn about specific statistical tests including t-tests, ANOVA, correlation, and regression—when to use each and how to interpret results.

Continue to Module 10
Practice

Data Analysis Practice Exercises

Applied Analysis Tasks

  1. Data Cleaning: Given a messy dataset, identify and document:
    • Missing values (how many? what percentage?)
    • Out-of-range values
    • Potential outliers
    • Your plan for handling each issue
  2. Descriptive Statistics: Calculate by hand:
    • Mean, median, and mode for: 5, 8, 10, 12, 12, 15, 18, 45
    • Range, IQR, and standard deviation for the same data
    • Explain which measures are most appropriate and why
  3. Chart Selection: For each scenario, select the best chart type:
    • Comparing sales across 4 product categories
    • Showing the relationship between age and income
    • Displaying how student grades are distributed
    • Tracking stock price over 12 months
  4. Hypothesis Testing: For a study comparing two teaching methods:
    • State the null and alternative hypotheses
    • If p = .03, what is your conclusion at α = .05?
    • What type of error could you be making?