Module 09: Introduction to Data Analysis

Topic 1

Data Preparation & Cleaning

Before analyzing data, you must prepare and clean it. Raw data almost always contains errors, inconsistencies, and missing values that need to be addressed. Data preparation can take 60-80% of a data analyst's time—it's tedious but essential for valid results.

The Data Analysis Workflow

Collect Data

→

Prepare & Clean

→

Explore

→

Analyze

→

Report

Setting Up Your Data File

Spreadsheet Structure

ID	Age	Gender	Score_Pre	Score_Post	Group
001	23	1	45	52	1
002	25	2	38	41	2
003	21	1	52	58	1

Rows = Cases/Participants (one row per person)

Columns = Variables (one column per variable)

Variable Naming Rules

Use short, descriptive names

Age, Score_Pre, GroupID

Start with a letter

Q1, Item_1, Var_A

Use underscores for spaces

Pre_Test, Post_Test

Avoid spaces

Pre Test ✗

Avoid special characters

Score#1, Age@start ✗

Avoid very long names

ParticipantPreTestScoreFirstSession ✗

Creating a Codebook

A codebook documents all variables, their meanings, and how values are coded.

Example Codebook

Variable	Description	Type	Values	Missing
ID	Participant identifier	String	001-150	None
Age	Age in years	Numeric	18-65	-99
Gender	Self-reported gender	Categorical	1=Male, 2=Female, 3=Other	-99
Score_Pre	Pre-test score (0-100)	Numeric	0-100	-99
Group	Experimental condition	Categorical	1=Control, 2=Treatment	None

Data Cleaning Checklist

Check for Errors

Out-of-range values: Age = 250? Score = -5?
How: Run frequencies/descriptives and check min/max
Impossible combinations: Pregnant male? 10-year-old with PhD?
How: Cross-tabulate related variables
Typos: "Femalee" instead of "Female"
How: Check unique values for categorical variables
Duplicate entries: Same participant entered twice
How: Check for duplicate IDs

Handle Missing Data

Identify missing values: Are they coded consistently?
Common codes: blank, NA, -99, 999, "missing"
Assess patterns: Is data missing randomly or systematically?
MCAR (completely random), MAR (random given other variables), MNAR (not random)
Decide how to handle:
- Listwise deletion (remove cases with any missing)
- Pairwise deletion (use available data for each analysis)
- Mean/mode imputation (replace with average)
- Multiple imputation (advanced statistical method)

Transform Variables

Recode: Collapse categories, reverse-score items
5-point scale: 1→5, 2→4, 3→3, 4→2, 5→1
Compute: Create new variables (totals, averages)
Total_Score = Q1 + Q2 + Q3 + Q4 + Q5
Categorize: Convert continuous to categorical
Age_Group: 18-25=1, 26-35=2, 36-45=3, etc.

Handling Outliers

Outliers are data points that are unusually far from other observations. They can be legitimate extreme values or errors.

Detection Methods

Visual Inspection

Box plots
Scatter plots
Histograms

Statistical Rules

Z-score > 3 or < -3
IQR method: below Q1-1.5×IQR or above Q3+1.5×IQR
Mahalanobis distance (multivariate)

What to Do with Outliers

Investigate

Is it a data entry error? Can you verify the true value?

Keep

If legitimate, keep it—extreme values are real data

Winsorize

Replace extreme values with less extreme ones (e.g., 95th percentile)

Remove

If error or truly anomalous, remove with documentation

Transform

Log transformation can reduce impact of extreme values

Document Everything!

Keep a log of all data cleaning decisions:

What changes were made
Why each change was made
How many cases were affected
Keep original raw data file untouched

This ensures transparency and reproducibility.

Topic 2

Measures of Central Tendency

Measures of central tendency describe the "typical" or "average" value in a dataset. The three main measures are mean, median, and mode. Each has strengths and is appropriate for different types of data and distributions.

The Three Measures

Mean (Average)

x̄ or M

Sum of all values divided by number of values

Mean = Σx / n

Sum of all values ÷ Number of values

Example: Scores: 12, 15, 18, 20, 25

Mean = (12 + 15 + 18 + 20 + 25) / 5

Mean = 90 / 5 = 18

Strengths

Uses all data points
Most common measure
Basis for many statistics

Limitations

Sensitive to outliers
Can be misleading for skewed data
Only for interval/ratio data

Use when: Data is interval/ratio and approximately normally distributed

Median

Mdn

The middle value when data is ordered from lowest to highest

Odd n: Middle value

Even n: Average of two middle values

Example (odd): 12, 15, 18, 20, 25

Median = 18 (middle value)

Example (even): 12, 15, 18, 20, 25, 30

Median = (18 + 20) / 2 = 19

Strengths

Not affected by outliers
Good for skewed data
Works for ordinal data

Limitations

Ignores actual values
Less common in statistics
Harder to use in formulas

Use when: Data is skewed, contains outliers, or is ordinal

Mode

Mo

The most frequently occurring value

Value that appears most often

Can have no mode, one mode (unimodal), or multiple modes (bimodal, multimodal)

Example: 12, 15, 15, 15, 18, 20, 25

Mode = 15 (appears 3 times)

Bimodal: 10, 10, 15, 20, 20, 25

Modes = 10 and 20

Strengths

Works for all data types
Only option for nominal data
Easy to understand

Limitations

May not exist or may not be unique
Ignores other values
Less useful for continuous data

Use when: Data is nominal/categorical, or you want the most common response

Choosing the Right Measure

Data Level	Best Measure	Example
Nominal	Mode only	Most common major: "Psychology"
Ordinal	Median (or mode)	Median satisfaction: "Satisfied"
Interval/Ratio (symmetric)	Mean	Mean test score: 78.5
Interval/Ratio (skewed)	Median	Median income: $52,000

Effect of Skewness

Left (Negative) Skew

Mean < Median < Mode

Example: Easy test scores (most high, few low)

Normal (Symmetric)

Mean ≈ Median ≈ Mode

Example: Height in a population

Right (Positive) Skew

Mode < Median < Mean

Example: Income (most low/moderate, few very high)

Real-World Example: Income

Why do we report median household income, not mean?

U.S. Median household income: ~$70,000
U.S. Mean household income: ~$95,000

The mean is pulled up by billionaires! The median better represents the "typical" household because income is right-skewed.

Topic 3

Measures of Variability

Measures of variability (or dispersion) describe how spread out the data is. Two datasets can have the same mean but very different spreads. Understanding variability is essential for interpreting data and making statistical inferences.

Why Variability Matters

Class A

70, 72, 74, 76, 78

Mean = 74

Consistent performance—all students similar

Class B

50, 60, 74, 88, 98

Mean = 74

High variability—students very different

Same mean, very different stories! We need measures of spread to understand the full picture.

Range

Difference between highest and lowest values

Range = Maximum - Minimum

Example: Scores: 45, 52, 67, 73, 89

Range = 89 - 45 = 44

Pro: Easy to calculate and understand

Con: Based on only 2 values; very sensitive to outliers

Interquartile Range (IQR)

Interquartile Range

Range of the middle 50% of the data

IQR = Q3 - Q1

Q1 = 25th percentile, Q3 = 75th percentile

25%

Min Q1 Median (Q2) Q3 Max

IQR = Middle 50%

Pro: Not affected by outliers; good for skewed data

Con: Ignores 50% of the data

Variance and Standard Deviation

Variance (s²)

Average of squared deviations from the mean

s² = Σ(x - x̄)² / (n - 1)

For sample variance; population uses n instead of n-1

Variance is in squared units (e.g., years²), which is hard to interpret directly.

Standard Deviation (s or SD)

Square root of variance—average distance from the mean

s = √[Σ(x - x̄)² / (n - 1)]

In original units—much more interpretable than variance.

Calculating Standard Deviation: Step by Step

Data: 4, 8, 6, 5, 7

x	x - x̄	(x - x̄)²
4	4 - 6 = -2	4
8	8 - 6 = 2	4
6	6 - 6 = 0	0
5	5 - 6 = -1	1
7	7 - 6 = 1	1
Σ = 30	Σ = 0	Σ = 10

Step 1: Calculate mean: x̄ = 30/5 = 6

Step 2: Calculate deviations (x - x̄)

Step 3: Square each deviation

Step 4: Sum squared deviations: Σ(x - x̄)² = 10

Step 5: Divide by n-1: s² = 10/4 = 2.5 (variance)

Step 6: Take square root: s = √2.5 = 1.58 (SD)

Interpreting Standard Deviation

The Empirical Rule (68-95-99.7 Rule)

For normally distributed data:

68%

of data falls within ±1 SD of mean

95%

of data falls within ±2 SD of mean

99.7%

of data falls within ±3 SD of mean

Example Application

If test scores have Mean = 75, SD = 10:

68% of students score between 65 and 85
95% of students score between 55 and 95
99.7% of students score between 45 and 105

Summary: Choosing a Measure of Variability

Measure	Best For	Pair With
Range	Quick overview; ordinal data	Any central tendency
IQR	Skewed data; outliers present	Median
Standard Deviation	Normally distributed interval/ratio data	Mean

Report Both Central Tendency and Variability!

Always report a measure of center AND spread together:

"Mean score was 75.4 (SD = 12.3)"
"Median income was $52,000 (IQR = $28,000-$78,000)"

One without the other tells an incomplete story.

Topic 4

Data Visualization Basics

Visualizations help you explore data, identify patterns, and communicate findings. The right chart makes data accessible; the wrong chart misleads. This topic covers essential chart types and when to use each.

Charts for Categorical Data

Bar Chart

A

B

C

D

Use for: Comparing counts or values across categories

Example: Number of students in each major

Tips:

Start y-axis at zero
Order bars meaningfully (by size or logically)
Use horizontal bars for long category names

Pie Chart

Use for: Showing parts of a whole (percentages)

Example: Percentage of budget by category

Tips:

Limit to 5-7 slices maximum
Include percentages as labels
Consider bar chart instead for precise comparisons

Charts for Continuous Data

Histogram

Use for: Showing distribution of continuous variable

Example: Distribution of test scores

Tips:

Bars touch (continuous data)
Shows shape: symmetric, skewed, bimodal
Experiment with bin width

Box Plot (Box-and-Whisker)

Use for: Showing distribution with quartiles; comparing groups

Example: Comparing test scores across classes

Anatomy:

Box: Middle 50% (IQR)
Line in box: Median
Whiskers: Extend to min/max or 1.5×IQR
Dots: Outliers beyond whiskers

Charts for Relationships

Scatter Plot

Use for: Showing relationship between two continuous variables

Example: Study hours vs. test score

Look for:

Direction: Positive, negative, or no relationship
Strength: Tight cluster vs. wide spread
Form: Linear or curved
Outliers: Points far from pattern

Line Graph

Use for: Showing change over time (trends)

Example: Monthly sales over a year

Tips:

X-axis should be time
Connect points only when continuity makes sense
Can show multiple lines for comparison

Choosing the Right Chart

Your Goal	Data Type	Best Chart
Compare categories	Categorical	Bar chart
Show parts of whole	Categorical (percentages)	Pie chart (or stacked bar)
Show distribution	Continuous	Histogram
Compare distributions	Continuous by group	Box plot
Show relationship	Two continuous	Scatter plot
Show trend over time	Continuous + time	Line graph

Common Visualization Mistakes

Truncated Y-Axis

Not starting at zero exaggerates differences

Fix: Start at zero or clearly indicate break

3D Effects

3D charts distort perception and add no information

Fix: Use 2D charts—cleaner and more accurate

Too Many Categories

Pie charts with 15 slices are unreadable

Fix: Combine small categories into "Other"

Missing Labels

Charts without axis labels or titles are meaningless

Fix: Always label axes, include title and units

Dual Y-Axes

Two y-axes can be manipulated to show false relationships

Fix: Use separate charts or normalize data

Wrong Chart Type

Using pie chart for non-percentage data, line for categories

Fix: Match chart to data and purpose

                                    Good Visualization Principles
                                    Clarity: The message should be immediately clear
Accuracy: Data should be represented truthfully
Efficiency: Minimize ink-to-data ratio (no chartjunk)
Accessibility: Consider colorblind-friendly palettes
Context: Provide enough information to interpret

                                

Topic 5

Introduction to Statistical Inference

Statistical inference allows us to draw conclusions about populations based on sample data. This topic introduces the foundational concepts: populations vs. samples, sampling distributions, hypothesis testing logic, and p-values.

Populations and Samples

Population

The entire group you want to study

Parameters: Characteristics of the population

Symbols: μ (mean), σ (SD), ρ (correlation)

Example: All university students in Bangladesh

Sample

A subset of the population you actually study

Statistics: Characteristics of the sample (estimates of parameters)

Symbols: x̄ (mean), s (SD), r (correlation)

Example: 500 students from 5 universities

We use sample statistics to estimate population parameters

This is statistical inference

Sampling Error and Sampling Distribution

Sampling Error: The difference between a sample statistic and the population parameter. It occurs because samples are only part of the population.

The Sampling Distribution

Imagine taking many samples from the same population and calculating the mean of each:

Population

μ = 100

Sample 1: x̄ = 98

Sample 2: x̄ = 103

Sample 3: x̄ = 101

Sample 4: x̄ = 97

Sample 5: x̄ = 102

...

Sampling Distribution of Means

Centered at μ = 100

The distribution of all these sample means is called the sampling distribution.

Central Limit Theorem

Regardless of the population distribution, the sampling distribution of the mean:

Approaches a normal distribution as sample size increases
Has mean equal to the population mean (μ)
Has standard deviation (standard error) = σ/√n

Why it matters: This allows us to use normal distribution for inference even when population isn't normal (if n is large enough, typically n ≥ 30)

Standard Error

Standard Error (SE): The standard deviation of the sampling distribution. It measures how much sample statistics vary from sample to sample.

SE = s / √n

Standard deviation divided by square root of sample size

Key Implications:

Larger sample → Smaller SE → More precise estimate
To halve the SE, you need to quadruple the sample size
SE is used to construct confidence intervals and calculate test statistics

The Logic of Hypothesis Testing

1

State Hypotheses

Null Hypothesis (H₀)

No effect, no difference, no relationship

"There is no difference in scores between groups"

Alternative Hypothesis (H₁ or Hₐ)

There IS an effect, difference, or relationship

"There is a difference in scores between groups"

2

Set Significance Level (α)

The threshold for deciding if results are "statistically significant"

Typically α = .05 (5% chance of rejecting H₀ when it's actually true)

3

Collect Data & Calculate Test Statistic

Conduct study and calculate appropriate statistic (t, F, χ², etc.)

4

Find the p-value

Probability of getting results this extreme (or more) IF the null hypothesis were true

5

Make Decision

If p ≤ α: Reject H₀

Result is statistically significant

If p > α: Fail to reject H₀

Result is not statistically significant

Understanding P-Values

P-value: The probability of obtaining results at least as extreme as those observed, assuming the null hypothesis is true.

What P-Value IS:

Probability of the data (or more extreme) given H₀ is true
A measure of evidence against the null hypothesis
Smaller p-value = stronger evidence against H₀

What P-Value is NOT:

NOT the probability that H₀ is true
NOT the probability that results are due to chance
NOT a measure of effect size or importance

Interpreting P-Values

p > .10	Little to no evidence against H₀
p = .05-.10	Weak evidence (sometimes called "marginally significant")
p = .01-.05	Moderate evidence against H₀
p = .001-.01	Strong evidence against H₀
p < .001	Very strong evidence against H₀

Types of Errors

	H₀ is Actually True	H₀ is Actually False
Reject H₀	Type I Error False positive Probability = α	Correct! True positive Power = 1 - β
Fail to Reject H₀	Correct! True negative	Type II Error False negative Probability = β

Type I Error Example:

Concluding a new drug works when it actually doesn't (false alarm)

Type II Error Example:

Concluding a drug doesn't work when it actually does (missed opportunity)

Statistical Significance ≠ Practical Importance

A statistically significant result (p < .05) doesn't mean the effect is:

Large: With big samples, tiny effects can be significant
Important: A 0.5 point difference on a 100-point scale may be significant but trivial
Real-world meaningful: Always report and interpret effect sizes!

Summary

Module 09 Key Takeaways

What You've Learned

Data preparation (cleaning, coding, handling missing data) is essential before analysis
Mean, median, and mode describe central tendency; choose based on data type and distribution
Range, IQR, and standard deviation describe variability; always report with central tendency
Choose visualizations based on data type and purpose; avoid common chart mistakes
Statistical inference uses sample statistics to estimate population parameters; p-values indicate evidence against null hypothesis

Next Steps

In Module 10: Quantitative Analysis, you'll learn about specific statistical tests including t-tests, ANOVA, correlation, and regression—when to use each and how to interpret results.

Continue to Module 10

Practice

Data Analysis Practice Exercises

Applied Analysis Tasks

Data Cleaning: Given a messy dataset, identify and document:
- Missing values (how many? what percentage?)
- Out-of-range values
- Potential outliers
- Your plan for handling each issue
Descriptive Statistics: Calculate by hand:
- Mean, median, and mode for: 5, 8, 10, 12, 12, 15, 18, 45
- Range, IQR, and standard deviation for the same data
- Explain which measures are most appropriate and why
Chart Selection: For each scenario, select the best chart type:
- Comparing sales across 4 product categories
- Showing the relationship between age and income
- Displaying how student grades are distributed
- Tracking stock price over 12 months
Hypothesis Testing: For a study comparing two teaching methods:
- State the null and alternative hypotheses
- If p = .03, what is your conclusion at α = .05?
- What type of error could you be making?