Use for: Comparing counts or values across categories
Example: Number of students in each major
Tips:
- Start y-axis at zero
- Order bars meaningfully (by size or logically)
- Use horizontal bars for long category names
Learn the fundamentals of preparing, organizing, and analyzing research data. This module covers data preparation, descriptive statistics, data visualization basics, and introduces the logic of statistical analysis. Build a solid foundation for understanding and interpreting quantitative research findings.
Module Status: Complete with comprehensive coverage of data preparation, descriptive statistics, visualization, and analysis foundations.
Before analyzing data, you must prepare and clean it. Raw data almost always contains errors, inconsistencies, and missing values that need to be addressed. Data preparation can take 60-80% of a data analyst's time—it's tedious but essential for valid results.
| ID | Age | Gender | Score_Pre | Score_Post | Group |
|---|---|---|---|---|---|
| 001 | 23 | 1 | 45 | 52 | 1 |
| 002 | 25 | 2 | 38 | 41 | 2 |
| 003 | 21 | 1 | 52 | 58 | 1 |
Rows = Cases/Participants (one row per person)
Columns = Variables (one column per variable)
Use short, descriptive names
Age, Score_Pre, GroupID
Start with a letter
Q1, Item_1, Var_A
Use underscores for spaces
Pre_Test, Post_Test
Avoid spaces
Pre Test ✗
Avoid special characters
Score#1, Age@start ✗
Avoid very long names
ParticipantPreTestScoreFirstSession ✗
A codebook documents all variables, their meanings, and how values are coded.
| Variable | Description | Type | Values | Missing |
|---|---|---|---|---|
| ID | Participant identifier | String | 001-150 | None |
| Age | Age in years | Numeric | 18-65 | -99 |
| Gender | Self-reported gender | Categorical | 1=Male, 2=Female, 3=Other | -99 |
| Score_Pre | Pre-test score (0-100) | Numeric | 0-100 | -99 |
| Group | Experimental condition | Categorical | 1=Control, 2=Treatment | None |
How: Run frequencies/descriptives and check min/max
How: Cross-tabulate related variables
How: Check unique values for categorical variables
How: Check for duplicate IDs
Common codes: blank, NA, -99, 999, "missing"
MCAR (completely random), MAR (random given other variables), MNAR (not random)
5-point scale: 1→5, 2→4, 3→3, 4→2, 5→1
Total_Score = Q1 + Q2 + Q3 + Q4 + Q5
Age_Group: 18-25=1, 26-35=2, 36-45=3, etc.
Outliers are data points that are unusually far from other observations. They can be legitimate extreme values or errors.
Keep a log of all data cleaning decisions:
This ensures transparency and reproducibility.
Measures of central tendency describe the "typical" or "average" value in a dataset. The three main measures are mean, median, and mode. Each has strengths and is appropriate for different types of data and distributions.
Sum of all values divided by number of values
Mean = Σx / n
Sum of all values ÷ Number of values
Example: Scores: 12, 15, 18, 20, 25
Mean = (12 + 15 + 18 + 20 + 25) / 5
Mean = 90 / 5 = 18
Use when: Data is interval/ratio and approximately normally distributed
The middle value when data is ordered from lowest to highest
Odd n: Middle value
Even n: Average of two middle values
Example (odd): 12, 15, 18, 20, 25
Median = 18 (middle value)
Example (even): 12, 15, 18, 20, 25, 30
Median = (18 + 20) / 2 = 19
Use when: Data is skewed, contains outliers, or is ordinal
The most frequently occurring value
Value that appears most often
Can have no mode, one mode (unimodal), or multiple modes (bimodal, multimodal)
Example: 12, 15, 15, 15, 18, 20, 25
Mode = 15 (appears 3 times)
Bimodal: 10, 10, 15, 20, 20, 25
Modes = 10 and 20
Use when: Data is nominal/categorical, or you want the most common response
| Data Level | Best Measure | Example |
|---|---|---|
| Nominal | Mode only | Most common major: "Psychology" |
| Ordinal | Median (or mode) | Median satisfaction: "Satisfied" |
| Interval/Ratio (symmetric) | Mean | Mean test score: 78.5 |
| Interval/Ratio (skewed) | Median | Median income: $52,000 |
Mean < Median < Mode
Example: Easy test scores (most high, few low)
Mean ≈ Median ≈ Mode
Example: Height in a population
Mode < Median < Mean
Example: Income (most low/moderate, few very high)
Why do we report median household income, not mean?
The mean is pulled up by billionaires! The median better represents the "typical" household because income is right-skewed.
Measures of variability (or dispersion) describe how spread out the data is. Two datasets can have the same mean but very different spreads. Understanding variability is essential for interpreting data and making statistical inferences.
70, 72, 74, 76, 78
Mean = 74
Consistent performance—all students similar
50, 60, 74, 88, 98
Mean = 74
High variability—students very different
Same mean, very different stories! We need measures of spread to understand the full picture.
Difference between highest and lowest values
Range = Maximum - Minimum
Example: Scores: 45, 52, 67, 73, 89
Range = 89 - 45 = 44
Pro: Easy to calculate and understand
Con: Based on only 2 values; very sensitive to outliers
Range of the middle 50% of the data
IQR = Q3 - Q1
Q1 = 25th percentile, Q3 = 75th percentile
Pro: Not affected by outliers; good for skewed data
Con: Ignores 50% of the data
Average of squared deviations from the mean
s² = Σ(x - x̄)² / (n - 1)
For sample variance; population uses n instead of n-1
Variance is in squared units (e.g., years²), which is hard to interpret directly.
Square root of variance—average distance from the mean
s = √[Σ(x - x̄)² / (n - 1)]
In original units—much more interpretable than variance.
Data: 4, 8, 6, 5, 7
| x | x - x̄ | (x - x̄)² |
|---|---|---|
| 4 | 4 - 6 = -2 | 4 |
| 8 | 8 - 6 = 2 | 4 |
| 6 | 6 - 6 = 0 | 0 |
| 5 | 5 - 6 = -1 | 1 |
| 7 | 7 - 6 = 1 | 1 |
| Σ = 30 | Σ = 0 | Σ = 10 |
Step 1: Calculate mean: x̄ = 30/5 = 6
Step 2: Calculate deviations (x - x̄)
Step 3: Square each deviation
Step 4: Sum squared deviations: Σ(x - x̄)² = 10
Step 5: Divide by n-1: s² = 10/4 = 2.5 (variance)
Step 6: Take square root: s = √2.5 = 1.58 (SD)
For normally distributed data:
of data falls within ±1 SD of mean
of data falls within ±2 SD of mean
of data falls within ±3 SD of mean
If test scores have Mean = 75, SD = 10:
| Measure | Best For | Pair With |
|---|---|---|
| Range | Quick overview; ordinal data | Any central tendency |
| IQR | Skewed data; outliers present | Median |
| Standard Deviation | Normally distributed interval/ratio data | Mean |
Always report a measure of center AND spread together:
One without the other tells an incomplete story.
Visualizations help you explore data, identify patterns, and communicate findings. The right chart makes data accessible; the wrong chart misleads. This topic covers essential chart types and when to use each.
Use for: Comparing counts or values across categories
Example: Number of students in each major
Use for: Showing parts of a whole (percentages)
Example: Percentage of budget by category
Use for: Showing distribution of continuous variable
Example: Distribution of test scores
Use for: Showing distribution with quartiles; comparing groups
Example: Comparing test scores across classes
Use for: Showing relationship between two continuous variables
Example: Study hours vs. test score
Use for: Showing change over time (trends)
Example: Monthly sales over a year
| Your Goal | Data Type | Best Chart |
|---|---|---|
| Compare categories | Categorical | Bar chart |
| Show parts of whole | Categorical (percentages) | Pie chart (or stacked bar) |
| Show distribution | Continuous | Histogram |
| Compare distributions | Continuous by group | Box plot |
| Show relationship | Two continuous | Scatter plot |
| Show trend over time | Continuous + time | Line graph |
Not starting at zero exaggerates differences
Fix: Start at zero or clearly indicate break
3D charts distort perception and add no information
Fix: Use 2D charts—cleaner and more accurate
Pie charts with 15 slices are unreadable
Fix: Combine small categories into "Other"
Charts without axis labels or titles are meaningless
Fix: Always label axes, include title and units
Two y-axes can be manipulated to show false relationships
Fix: Use separate charts or normalize data
Using pie chart for non-percentage data, line for categories
Fix: Match chart to data and purpose
Statistical inference allows us to draw conclusions about populations based on sample data. This topic introduces the foundational concepts: populations vs. samples, sampling distributions, hypothesis testing logic, and p-values.
The entire group you want to study
Parameters: Characteristics of the population
Symbols: μ (mean), σ (SD), ρ (correlation)
Example: All university students in Bangladesh
A subset of the population you actually study
Statistics: Characteristics of the sample (estimates of parameters)
Symbols: x̄ (mean), s (SD), r (correlation)
Example: 500 students from 5 universities
We use sample statistics to estimate population parameters
This is statistical inference
Sampling Error: The difference between a sample statistic and the population parameter. It occurs because samples are only part of the population.
Imagine taking many samples from the same population and calculating the mean of each:
Population
μ = 100
Sampling Distribution of Means
Centered at μ = 100
The distribution of all these sample means is called the sampling distribution.
Regardless of the population distribution, the sampling distribution of the mean:
Why it matters: This allows us to use normal distribution for inference even when population isn't normal (if n is large enough, typically n ≥ 30)
Standard Error (SE): The standard deviation of the sampling distribution. It measures how much sample statistics vary from sample to sample.
SE = s / √n
Standard deviation divided by square root of sample size
No effect, no difference, no relationship
"There is no difference in scores between groups"
There IS an effect, difference, or relationship
"There is a difference in scores between groups"
The threshold for deciding if results are "statistically significant"
Typically α = .05 (5% chance of rejecting H₀ when it's actually true)
Conduct study and calculate appropriate statistic (t, F, χ², etc.)
Probability of getting results this extreme (or more) IF the null hypothesis were true
If p ≤ α: Reject H₀
Result is statistically significant
If p > α: Fail to reject H₀
Result is not statistically significant
P-value: The probability of obtaining results at least as extreme as those observed, assuming the null hypothesis is true.
| p > .10 | Little to no evidence against H₀ |
| p = .05-.10 | Weak evidence (sometimes called "marginally significant") |
| p = .01-.05 | Moderate evidence against H₀ |
| p = .001-.01 | Strong evidence against H₀ |
| p < .001 | Very strong evidence against H₀ |
| H₀ is Actually True | H₀ is Actually False | |
|---|---|---|
| Reject H₀ |
Type I Error False positive Probability = α |
Correct! True positive Power = 1 - β |
| Fail to Reject H₀ |
Correct! True negative |
Type II Error False negative Probability = β |
Concluding a new drug works when it actually doesn't (false alarm)
Concluding a drug doesn't work when it actually does (missed opportunity)
A statistically significant result (p < .05) doesn't mean the effect is: