flowchart TD A[Raw Data] --> B[Sort into Bins] B --> C[Count Observations\nin Each Bin] C --> D[Plot as Histogram] D --> E[Analyze Distribution Shape] E --> F[Central Tendency\nMean & Median] E --> G[Spread\nStandard Deviation & IQR] E --> H[Skewness\nSymmetry] E --> I[Kurtosis\nTail Heaviness] E --> J[Modality\nNumber of Peaks]
Key Takeaways: Histogram Distribution Analysis
- Purpose: Visualize and analyze data distribution shapes using histograms
- What it shows: Data spread, central tendency, skewness, kurtosis, and potential outliers
- Key metrics: Mean, median, standard deviation, skewness, and kurtosis
- Normal distribution: Appears as a symmetric bell-shaped curve with mean = median
- Skewed distributions: Right-skewed (tail extends to right) or left-skewed (tail extends to left)
- Applications: Normality testing, distribution comparison, data exploration
- Advantages over formal tests: Provides visual intuition about distribution characteristics
What are Histogram Distribution Analyses?
Histograms are powerful graphical tools that display the frequency distribution of a dataset by dividing it into bins (intervals) and showing the count or density of observations in each bin. They provide immediate visual insights into a dataset’s shape, center, spread, and potential outliers, making them one of the most fundamental and useful tools in data analysis.
When to use histogram distribution analysis:
- When exploring a new dataset to understand its basic distribution characteristics
- When checking if data follows a normal distribution or other theoretical distribution
- When identifying skewness, kurtosis, or multimodality in your data
- When deciding whether to use parametric or non-parametric statistical methods
- When selecting appropriate data transformations
- When communicating data distribution features to non-technical audiences
- When checking for outliers or unusual patterns in your data
This interactive tool allows you to quickly generate histograms, analyze distribution shapes, and receive detailed interpretations to guide your statistical analysis decisions.
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 1400
library(shiny)
library(bslib)
library(ggplot2)
library(bsicons)
library(vroom)
library(shinyjs)
library(moments) # For skewness and kurtosis
ui <- page_sidebar(
title = "Distribution Histogram Analysis: Assess Data Normality",
useShinyjs(), # Enable shinyjs for dynamic UI updates
sidebar = sidebar(
width = 400,
card(
card_header("Data Input"),
accordion(
accordion_panel(
"Manual Input",
textAreaInput("data_input", "Enter your data (one value per row):", rows = 8,
placeholder = "Paste values here..."),
div(
actionLink("use_example", "Use example data", style = "color:#0275d8;"),
tags$span(bs_icon("file-earmark-text"), style = "margin-left: 5px; color: #0275d8;")
)
),
accordion_panel(
"File Upload",
fileInput("file_upload", "Upload CSV or TXT file:",
accept = c("text/csv", "text/plain", ".csv", ".txt")),
checkboxInput("header", "File has header", TRUE),
conditionalPanel(
condition = "output.file_uploaded",
div(
selectInput("selected_var", "Select variable:", choices = NULL),
actionButton("clear_file", "Clear File", class = "btn-danger btn-sm")
)
)
),
id = "input_method",
open = 1
),
# Plot Settings accordion
accordion(
accordion_panel(
"Histogram Settings",
card(
card_header("Bin Options:"),
card_body(
selectInput("bin_method", "Bin Width Method:",
choices = c("Auto" = "auto",
"Sturges" = "sturges",
"Scott" = "scott",
"Freedman-Diaconis" = "fd",
"Manual" = "manual"),
selected = "auto"),
conditionalPanel(
condition = "input.bin_method == 'manual'",
sliderInput("bin_count", "Number of Bins:", min = 5, max = 100, value = 30, step = 1)
)
)
),
card(
card_header("Distribution Options:"),
card_body(
checkboxInput("show_density", "Show Density Curve", TRUE),
checkboxInput("show_normal", "Show Normal Curve", TRUE),
checkboxInput("show_rug", "Show Data Rug", FALSE),
selectInput("compare_dist", "Compare with Distribution:",
choices = c("Normal" = "norm",
"t (Student's)" = "t",
"Exponential" = "exp",
"Lognormal" = "lnorm",
"Uniform" = "unif"),
selected = "norm"),
conditionalPanel(
condition = "input.compare_dist == 't'",
sliderInput("t_df", "Degrees of Freedom:", min = 1, max = 30, value = 5, step = 1)
)
)
),
card(
card_header("Visual Options:"),
card_body(
checkboxInput("show_stats", "Show Statistics Overlay", TRUE),
selectInput("fill_color", "Histogram Color:",
choices = c("Blue" = "blue",
"Green" = "green",
"Red" = "red",
"Purple" = "purple",
"Orange" = "orange"),
selected = "blue"),
selectInput("theme_choice", "Plot Theme:",
choices = c("Minimal" = "minimal",
"Classic" = "classic",
"Light" = "light",
"Dark" = "dark"),
selected = "minimal")
)
)
),
open = FALSE
),
actionButton("analyze", "Analyze Distribution", class = "btn btn-primary")
),
hr(),
card(
card_header("Interpretation Guide"),
card_body(
div(class = "alert alert-info",
tags$p("The histogram reveals important characteristics of your data distribution:"),
tags$ul(
tags$li(tags$b("Bell shape:"), " Suggests normal distribution."),
tags$li(tags$b("Right skew:"), " Tail extends to the right (higher values)."),
tags$li(tags$b("Left skew:"), " Tail extends to the left (lower values)."),
tags$li(tags$b("Bimodal:"), " Two peaks, suggesting two subgroups."),
tags$li(tags$b("Uniform:"), " Similar frequency across all values."),
tags$li(tags$b("Heavy tails:"), " More extreme values than expected in normal distribution.")
)
)
)
)
),
layout_column_wrap(
width = 1,
card(
card_header("Distribution Visualization"),
card_body(
navset_tab(
nav_panel("Histogram", plotOutput("histogram", height = "500px")),
nav_panel("Density Plot", plotOutput("density_plot", height = "500px")),
nav_panel("Cumulative Distribution", plotOutput("cdf_plot", height = "500px"))
)
)
),
card(
card_header("Distribution Assessment"),
card_body(
navset_tab(
nav_panel("Results", uiOutput("error_message"), verbatimTextOutput("summary_results")),
nav_panel("Interpretation", div(style = "font-size: 0.9rem;",
uiOutput("interpretation"),
hr(),
h5("Common Distribution Shapes"),
div(
class = "row",
div(class = "col-md-6",
h6("Normal Distribution:"),
p("Symmetric bell-shaped curve, mean = median."),
h6("Right-Skewed (Positive):"),
p("Longer tail to the right, mean > median."),
h6("Left-Skewed (Negative):"),
p("Longer tail to the left, mean < median.")
),
div(class = "col-md-6",
h6("Bimodal Distribution:"),
p("Two distinct peaks, suggesting two subgroups."),
h6("Uniform Distribution:"),
p("Similar frequency across all values."),
h6("Multimodal Distribution:"),
p("Multiple peaks, suggesting multiple subgroups.")
)
)
))
)
)
)
)
)
server <- function(input, output, session) {
# Example data - mixed normal distribution with slight skew
example_data <- "23.2\n24.5\n21.6\n22.8\n25.3\n20.9\n23.7\n22.2\n24.8\n21.5\n23.9\n25.6\n22.3\n24.1\n23.4\n21.8\n25.1\n24.3\n23.6\n22.1\n24.9\n23.2\n21.7\n25.8\n22.9\n24.4\n23.8\n21.3\n26.4\n28.7\n19.5\n18.8"
# Track input method
input_method <- reactiveVal("manual")
# Function to clear file inputs
clear_file_inputs <- function() {
updateSelectInput(session, "selected_var", choices = NULL)
reset("file_upload")
}
# Function to clear text inputs
clear_text_inputs <- function() {
updateTextAreaInput(session, "data_input", value = "")
}
# When example data is used, clear file inputs and set text inputs
observeEvent(input$use_example, {
input_method("manual")
clear_file_inputs()
updateTextAreaInput(session, "data_input", value = example_data)
})
# When file is uploaded, clear text inputs and set file method
observeEvent(input$file_upload, {
if (!is.null(input$file_upload)) {
input_method("file")
clear_text_inputs()
}
})
# When clear file button is clicked, clear file and set manual method
observeEvent(input$clear_file, {
input_method("manual")
clear_file_inputs()
})
# When text input changes, clear file inputs if it has content
observeEvent(input$data_input, {
if (!is.null(input$data_input) && nchar(input$data_input) > 0) {
input_method("manual")
clear_file_inputs()
}
}, ignoreInit = TRUE)
# Process uploaded file
file_data <- reactive({
req(input$file_upload)
tryCatch({
vroom::vroom(input$file_upload$datapath, delim = NULL, col_names = input$header, show_col_types = FALSE)
}, error = function(e) {
showNotification(paste("File read error:", e$message), type = "error")
NULL
})
})
# Update variable selection dropdown with numeric columns from uploaded file
observe({
df <- file_data()
if (!is.null(df)) {
num_vars <- names(df)[sapply(df, is.numeric)]
updateSelectInput(session, "selected_var", choices = num_vars)
}
})
output$file_uploaded <- reactive({
!is.null(input$file_upload)
})
outputOptions(output, "file_uploaded", suspendWhenHidden = FALSE)
# Function to parse text input
parse_text_input <- function(text) {
if (is.null(text) || text == "") return(NULL)
input_lines <- strsplit(text, "\\r?\\n")[[1]]
input_lines <- input_lines[input_lines != ""]
numeric_values <- suppressWarnings(as.numeric(input_lines))
if (all(is.na(numeric_values))) return(NULL)
return(na.omit(numeric_values))
}
# Get data values based on input method
data_values <- reactive({
if (input_method() == "file" && !is.null(file_data()) && !is.null(input$selected_var)) {
df <- file_data()
return(na.omit(df[[input$selected_var]]))
} else {
return(parse_text_input(input$data_input))
}
})
# Validate input data
validate_data <- reactive({
values <- data_values()
if (is.null(values)) {
return("Error: Please check your input. Make sure all values are numeric.")
}
if (length(values) < 5) {
return("Error: At least 5 values are recommended for meaningful distribution analysis.")
}
if (length(unique(values)) == 1) {
return("Error: All values are identical. Distribution analysis requires variation in the data.")
}
return(NULL)
})
# Display error messages
output$error_message <- renderUI({
error <- validate_data()
if (!is.null(error) && input$analyze > 0) {
if (startsWith(error, "Warning")) {
div(class = "alert alert-warning", error)
} else {
div(class = "alert alert-danger", error)
}
}
})
# Prepare data for analysis
processed_data <- reactive({
req(input$analyze > 0)
error <- validate_data()
if (!is.null(error) && startsWith(error, "Error")) return(NULL)
values <- data_values()
values
})
# Calculate bin width based on selected method
bin_width <- reactive({
req(processed_data())
values <- processed_data()
if (input$bin_method == "auto") {
# Automatic binning
NULL
} else if (input$bin_method == "manual") {
# Manual bin count
diff(range(values)) / input$bin_count
} else {
# Other methods
h <- switch(input$bin_method,
"sturges" = diff(range(values)) / (1 + log2(length(values))),
"scott" = 3.5 * sd(values) * length(values)^(-1/3),
"fd" = 2 * IQR(values) * length(values)^(-1/3),
NULL)
h
}
})
# Calculate number of bins
bin_count <- reactive({
req(processed_data())
values <- processed_data()
if (input$bin_method == "manual") {
input$bin_count
} else if (input$bin_method == "auto") {
min(30, max(10, ceiling(length(values)/5)))
} else {
ceiling(diff(range(values)) / bin_width())
}
})
# Generate histogram
output$histogram <- renderPlot({
req(processed_data())
values <- processed_data()
# Determine bin parameter based on method
if (input$bin_method == "manual") {
bin_param <- input$bin_count
} else {
bin_param <- bin_width()
}
# Create base histogram
p <- ggplot(data.frame(x = values), aes(x = x))
# Choose color based on input
fill_colors <- c(
"blue" = "#3498db",
"green" = "#2ecc71",
"red" = "#e74c3c",
"purple" = "#9b59b6",
"orange" = "#e67e22"
)
fill_color <- fill_colors[input$fill_color]
line_color <- colorspace::darken(fill_color, 0.3)
# Add histogram with density scaling
if (input$bin_method == "manual") {
p <- p + geom_histogram(aes(y = ..density..), bins = bin_param,
fill = fill_color, color = line_color, alpha = 0.7)
} else {
p <- p + geom_histogram(aes(y = ..density..), binwidth = bin_param,
fill = fill_color, color = line_color, alpha = 0.7)
}
# Add density curve if selected
if (input$show_density) {
p <- p + geom_density(color = "#c0392b", linewidth = 1.2)
}
# Add rug plot if selected
if (input$show_rug) {
p <- p + geom_rug(color = "#34495e", alpha = 0.5)
}
# Add normal curve if selected
if (input$show_normal) {
if (input$compare_dist == "norm") {
p <- p + stat_function(fun = dnorm, args = list(mean = mean(values), sd = sd(values)),
color = "#2471a3", linewidth = 1.2, linetype = "dashed")
} else if (input$compare_dist == "t") {
# Scale the t-distribution to match data
scaled_dt <- function(x) {
sd_val <- sd(values)
mean_val <- mean(values)
dt((x - mean_val) / sd_val, df = input$t_df) / sd_val
}
p <- p + stat_function(fun = scaled_dt, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
} else if (input$compare_dist == "exp") {
# Scale exponential to match data
rate_param <- 1 / mean(values - min(values) + 0.01)
scaled_dexp <- function(x) {
dexp(x - min(values) + 0.01, rate = rate_param)
}
p <- p + stat_function(fun = scaled_dexp, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
} else if (input$compare_dist == "lnorm") {
# Scale lognormal to match data
if (min(values) <= 0) {
# Shift data to positive domain for lognormal
shift <- abs(min(values)) + 1
shifted_values <- values + shift
log_mean <- mean(log(shifted_values))
log_sd <- sd(log(shifted_values))
scaled_dlnorm <- function(x) {
dlnorm(x + shift, meanlog = log_mean, sdlog = log_sd)
}
} else {
log_mean <- mean(log(values))
log_sd <- sd(log(values))
scaled_dlnorm <- function(x) {
dlnorm(x, meanlog = log_mean, sdlog = log_sd)
}
}
p <- p + stat_function(fun = scaled_dlnorm, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
} else if (input$compare_dist == "unif") {
p <- p + stat_function(fun = dunif, args = list(min = min(values), max = max(values)),
color = "#2471a3", linewidth = 1.2, linetype = "dashed")
}
}
# Add statistics overlay if selected
if (input$show_stats) {
# Calculate statistics
mean_val <- mean(values)
median_val <- median(values)
sd_val <- sd(values)
# Add vertical lines for mean and median
p <- p +
geom_vline(xintercept = mean_val, color = "#e74c3c", linetype = "dashed", linewidth = 1) +
geom_vline(xintercept = median_val, color = "#27ae60", linetype = "dotted", linewidth = 1)
# Add text annotations
p <- p +
annotate("text", x = mean_val, y = Inf, label = paste("Mean =", round(mean_val, 2)),
vjust = 2, hjust = ifelse(mean_val > median_val, -0.1, 1.1), color = "#e74c3c") +
annotate("text", x = median_val, y = Inf, label = paste("Median =", round(median_val, 2)),
vjust = 4, hjust = ifelse(median_val > mean_val, -0.1, 1.1), color = "#27ae60")
}
# Set title and labels
p <- p + labs(
title = "Distribution Histogram",
subtitle = paste("n =", length(values), ", bins =", bin_count()),
x = "Value",
y = "Density"
)
# Apply selected theme
if (input$theme_choice == "minimal") {
p <- p + theme_minimal(base_size = 14)
} else if (input$theme_choice == "classic") {
p <- p + theme_classic(base_size = 14)
} else if (input$theme_choice == "light") {
p <- p + theme_light(base_size = 14)
} else if (input$theme_choice == "dark") {
p <- p + theme_dark(base_size = 14) +
theme(
plot.background = element_rect(fill = "gray10"),
panel.background = element_rect(fill = "gray15"),
panel.grid = element_line(color = "gray30")
)
}
p
})
# Generate density plot
output$density_plot <- renderPlot({
req(processed_data())
values <- processed_data()
# Create density plot
p <- ggplot(data.frame(x = values), aes(x = x)) +
geom_density(fill = fill_colors[input$fill_color], alpha = 0.5, color = "#2c3e50", linewidth = 1)
# Add rug plot if selected
if (input$show_rug) {
p <- p + geom_rug(color = "#34495e", alpha = 0.5)
}
# Add comparative distribution if selected
if (input$show_normal) {
if (input$compare_dist == "norm") {
p <- p + stat_function(fun = dnorm, args = list(mean = mean(values), sd = sd(values)),
color = "#2471a3", linewidth = 1.2, linetype = "dashed")
} else if (input$compare_dist == "t") {
# Scale the t-distribution to match data
scaled_dt <- function(x) {
sd_val <- sd(values)
mean_val <- mean(values)
dt((x - mean_val) / sd_val, df = input$t_df) / sd_val
}
p <- p + stat_function(fun = scaled_dt, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
} else if (input$compare_dist == "exp") {
# Scale exponential to match data
rate_param <- 1 / mean(values - min(values) + 0.01)
scaled_dexp <- function(x) {
dexp(x - min(values) + 0.01, rate = rate_param)
}
p <- p + stat_function(fun = scaled_dexp, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
} else if (input$compare_dist == "lnorm") {
# Scale lognormal to match data
if (min(values) <= 0) {
# Shift data to positive domain for lognormal
shift <- abs(min(values)) + 1
shifted_values <- values + shift
log_mean <- mean(log(shifted_values))
log_sd <- sd(log(shifted_values))
scaled_dlnorm <- function(x) {
dlnorm(x + shift, meanlog = log_mean, sdlog = log_sd)
}
} else {
log_mean <- mean(log(values))
log_sd <- sd(log(values))
scaled_dlnorm <- function(x) {
dlnorm(x, meanlog = log_mean, sdlog = log_sd)
}
}
p <- p + stat_function(fun = scaled_dlnorm, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
} else if (input$compare_dist == "unif") {
p <- p + stat_function(fun = dunif, args = list(min = min(values), max = max(values)),
color = "#2471a3", linewidth = 1.2, linetype = "dashed")
}
}
# Add statistics overlay if selected
if (input$show_stats) {
# Calculate statistics
mean_val <- mean(values)
median_val <- median(values)
# Add vertical lines for mean and median
p <- p +
geom_vline(xintercept = mean_val, color = "#e74c3c", linetype = "dashed", linewidth = 1) +
geom_vline(xintercept = median_val, color = "#27ae60", linetype = "dotted", linewidth = 1)
# Add text annotations
p <- p +
annotate("text", x = mean_val, y = Inf, label = paste("Mean =", round(mean_val, 2)),
vjust = 2, hjust = ifelse(mean_val > median_val, -0.1, 1.1), color = "#e74c3c") +
annotate("text", x = median_val, y = Inf, label = paste("Median =", round(median_val, 2)),
vjust = 4, hjust = ifelse(median_val > mean_val, -0.1, 1.1), color = "#27ae60")
}
# Set title and labels
p <- p + labs(
title = "Density Plot",
subtitle = paste("n =", length(values)),
x = "Value",
y = "Density"
)
# Apply selected theme
if (input$theme_choice == "minimal") {
p <- p + theme_minimal(base_size = 14)
} else if (input$theme_choice == "classic") {
p <- p + theme_classic(base_size = 14)
} else if (input$theme_choice == "light") {
p <- p + theme_light(base_size = 14)
} else if (input$theme_choice == "dark") {
p <- p + theme_dark(base_size = 14) +
theme(
plot.background = element_rect(fill = "gray10"),
panel.background = element_rect(fill = "gray15"),
panel.grid = element_line(color = "gray30")
)
}
p
})
# Generate cumulative distribution function plot
output$cdf_plot <- renderPlot({
req(processed_data())
values <- processed_data()
# Create empirical CDF data
n <- length(values)
ordered_values <- sort(values)
ecdf_y <- seq(1/n, 1, by = 1/n)
ecdf_df <- data.frame(
x = ordered_values,
y = ecdf_y
)
# Create theoretical CDF data for comparison
x_range <- seq(min(values) - 0.5*sd(values), max(values) + 0.5*sd(values), length.out = 500)
if (input$compare_dist == "norm") {
theoretical_cdf <- pnorm(x_range, mean = mean(values), sd = sd(values))
dist_name <- "Normal"
} else if (input$compare_dist == "t") {
# Scale t distribution to match data mean and sd
z_scores <- (x_range - mean(values)) / sd(values)
theoretical_cdf <- pt(z_scores, df = input$t_df)
dist_name <- paste0("t (df = ", input$t_df, ")")
} else if (input$compare_dist == "exp") {
# Shift to positive domain if needed
min_shift <- min(values) - 0.01
theoretical_cdf <- pexp(x_range - min_shift, rate = 1/mean(values - min_shift))
dist_name <- "Exponential"
} else if (input$compare_dist == "lnorm") {
if (min(values) <= 0) {
# Shift data for lognormal
shift <- abs(min(values)) + 1
shifted_x <- x_range + shift
log_mean <- mean(log(values + shift))
log_sd <- sd(log(values + shift))
theoretical_cdf <- plnorm(shifted_x, meanlog = log_mean, sdlog = log_sd)
} else {
log_mean <- mean(log(values))
log_sd <- sd(log(values))
theoretical_cdf <- plnorm(x_range, meanlog = log_mean, sdlog = log_sd)
}
dist_name <- "Log-normal"
} else if (input$compare_dist == "unif") {
theoretical_cdf <- punif(x_range, min = min(values), max = max(values))
dist_name <- "Uniform"
}
theoretical_df <- data.frame(
x = x_range,
y = theoretical_cdf
)
# Create CDF plot
p <- ggplot() +
geom_step(data = ecdf_df, aes(x = x, y = y), color = "#c0392b", linewidth = 1.2) +
geom_line(data = theoretical_df, aes(x = x, y = y), color = "#2471a3",
linewidth = 1.2, linetype = "dashed")
# Add statistics overlay if selected
if (input$show_stats) {
# Calculate statistics
mean_val <- mean(values)
median_val <- median(values)
# Add vertical lines for mean and median
p <- p +
geom_vline(xintercept = mean_val, color = "#e74c3c", linetype = "dashed", linewidth = 1) +
geom_vline(xintercept = median_val, color = "#27ae60", linetype = "dotted", linewidth = 1)
# Add text annotations
p <- p +
annotate("text", x = mean_val, y = 0, label = paste("Mean =", round(mean_val, 2)),
vjust = -1, hjust = ifelse(mean_val > median_val, -0.1, 1.1), color = "#e74c3c") +
annotate("text", x = median_val, y = 0, label = paste("Median =", round(median_val, 2)),
vjust = -3, hjust = ifelse(median_val > mean_val, -0.1, 1.1), color = "#27ae60")
}
# Set title and labels
p <- p + labs(
title = "Cumulative Distribution Function",
subtitle = paste("Empirical CDF vs", dist_name, "CDF"),
x = "Value",
y = "Cumulative Probability"
) +
scale_y_continuous(limits = c(0, 1))
# Apply selected theme
if (input$theme_choice == "minimal") {
p <- p + theme_minimal(base_size = 14)
} else if (input$theme_choice == "classic") {
p <- p + theme_classic(base_size = 14)
} else if (input$theme_choice == "light") {
p <- p + theme_light(base_size = 14)
} else if (input$theme_choice == "dark") {
p <- p + theme_dark(base_size = 14) +
theme(
plot.background = element_rect(fill = "gray10"),
panel.background = element_rect(fill = "gray15"),
panel.grid = element_line(color = "gray30")
)
}
# Add legend
p <- p +
annotate("text", x = -Inf, y = 0.25, label = "Empirical CDF", color = "#c0392b",
hjust = -0.1, vjust = 0, fontface = "bold") +
annotate("text", x = -Inf, y = 0.15, label = paste(dist_name, "CDF"), color = "#2471a3",
hjust = -0.1, vjust = 0, fontface = "bold")
p
})
# Calculate distribution statistics
distribution_stats <- reactive({
req(processed_data())
values <- processed_data()
# Calculate basic statistics
n <- length(values)
mean_val <- mean(values)
median_val <- median(values)
sd_val <- sd(values)
min_val <- min(values)
max_val <- max(values)
range_val <- max_val - min_val
q1 <- quantile(values, 0.25)
q3 <- quantile(values, 0.75)
iqr <- IQR(values)
# Calculate skewness and kurtosis
skew_val <- tryCatch({
moments::skewness(values)
}, error = function(e) {
NA
})
kurt_val <- tryCatch({
moments::kurtosis(values) - 3 # Excess kurtosis (normal = 0)
}, error = function(e) {
NA
})
# Calculate normality test results
sw_test <- tryCatch({
shapiro.test(values)
}, error = function(e) {
list(statistic = NA, p.value = NA)
})
# Calculate mean-median relationship
mean_median_diff <- mean_val - median_val
mean_median_ratio <- ifelse(median_val != 0, mean_val / median_val, NA)
# Return all statistics
list(
n = n,
mean = mean_val,
median = median_val,
sd = sd_val,
min = min_val,
max = max_val,
range = range_val,
q1 = q1,
q3 = q3,
iqr = iqr,
skewness = skew_val,
kurtosis = kurt_val,
sw_stat = sw_test$statistic,
sw_p = sw_test$p.value,
mean_median_diff = mean_median_diff,
mean_median_ratio = mean_median_ratio,
bins = bin_count()
)
})
# Display summary results
output$summary_results <- renderPrint({
req(input$analyze > 0)
error <- validate_data()
if (!is.null(error) && startsWith(error, "Error")) return(NULL)
stats <- distribution_stats()
cat("Distribution Analysis Results:\n")
cat("==============================\n\n")
cat("Basic Statistics:\n")
cat("----------------\n")
cat("Sample size:", stats$n, "\n")
cat("Minimum:", round(stats$min, 4), "\n")
cat("Maximum:", round(stats$max, 4), "\n")
cat("Range:", round(stats$range, 4), "\n")
cat("Mean:", round(stats$mean, 4), "\n")
cat("Median:", round(stats$median, 4), "\n")
cat("Standard deviation:", round(stats$sd, 4), "\n")
cat("Q1 (25th percentile):", round(stats$q1, 4), "\n")
cat("Q3 (75th percentile):", round(stats$q3, 4), "\n")
cat("Interquartile range (IQR):", round(stats$iqr, 4), "\n\n")
cat("Shape Characteristics:\n")
cat("---------------------\n")
if (!is.na(stats$skewness)) {
cat("Skewness:", round(stats$skewness, 4),
ifelse(abs(stats$skewness) < 0.5, " (approximately symmetric)",
ifelse(stats$skewness > 0, " (right-skewed/positive)", " (left-skewed/negative)")), "\n")
}
if (!is.na(stats$kurtosis)) {
cat("Excess kurtosis:", round(stats$kurtosis, 4),
ifelse(abs(stats$kurtosis) < 0.5, " (approximately mesokurtic)",
ifelse(stats$kurtosis > 0, " (leptokurtic/heavy-tailed)", " (platykurtic/light-tailed)")), "\n")
}
cat("Mean-median difference:", round(stats$mean_median_diff, 4),
ifelse(abs(stats$mean_median_diff) < 0.1*stats$sd, " (close)",
ifelse(stats$mean_median_diff > 0, " (mean > median, suggests right skew)",
" (mean < median, suggests left skew)")), "\n\n")
cat("Normality Assessment:\n")
cat("--------------------\n")
if (!is.na(stats$sw_stat)) {
cat("Shapiro-Wilk test: W =", round(stats$sw_stat, 4),
", p-value =", format.pval(stats$sw_p, digits = 4), "\n")
if (stats$sw_p < 0.05) {
cat("The p-value is less than 0.05, suggesting the data significantly\n")
cat("deviates from a normal distribution.\n\n")
} else {
cat("The p-value is greater than or equal to 0.05, suggesting the data\n")
cat("does not significantly deviate from a normal distribution.\n\n")
}
}
cat("Histogram Information:\n")
cat("----------------------\n")
cat("Number of bins:", stats$bins, "\n")
cat("Average observations per bin:", round(stats$n / stats$bins, 1), "\n")
})
# Generate interpretation text
output$interpretation <- renderUI({
req(input$analyze > 0)
error <- validate_data()
if (!is.null(error) && startsWith(error, "Error")) return(NULL)
stats <- distribution_stats()
# Interpretation based on statistics
interpretation_text <- div(
h5("Distribution Shape Assessment"),
p("Based on the histogram and statistical analysis, your data shows the following characteristics:")
)
# Determine distribution characteristics
dist_shape <- ""
skew_text <- ""
kurt_text <- ""
normality_text <- ""
transformation_text <- ""
# Assess skewness
if (!is.na(stats$skewness)) {
if (abs(stats$skewness) < 0.5) {
skew_text <- "The distribution appears approximately symmetric (skewness near zero)."
} else if (stats$skewness > 1) {
skew_text <- "The distribution is strongly right-skewed (positive skew), with a long tail extending toward higher values."
} else if (stats$skewness > 0.5) {
skew_text <- "The distribution is moderately right-skewed (positive skew)."
} else if (stats$skewness < -1) {
skew_text <- "The distribution is strongly left-skewed (negative skew), with a long tail extending toward lower values."
} else if (stats$skewness < -0.5) {
skew_text <- "The distribution is moderately left-skewed (negative skew)."
}
}
# Assess kurtosis
if (!is.na(stats$kurtosis)) {
if (abs(stats$kurtosis) < 0.5) {
kurt_text <- "The distribution has a similar tail thickness to a normal distribution (mesokurtic)."
} else if (stats$kurtosis > 1) {
kurt_text <- "The distribution has heavy tails (leptokurtic), with more extreme values than expected in a normal distribution."
} else if (stats$kurtosis > 0.5) {
kurt_text <- "The distribution has slightly heavier tails than a normal distribution."
} else if (stats$kurtosis < -1) {
kurt_text <- "The distribution has light tails (platykurtic), with fewer extreme values than expected in a normal distribution."
} else if (stats$kurtosis < -0.5) {
kurt_text <- "The distribution has slightly lighter tails than a normal distribution."
}
}
# Assess normality based on Shapiro-Wilk test
if (!is.na(stats$sw_p)) {
if (stats$sw_p < 0.05) {
normality_text <- "Formal testing (Shapiro-Wilk) suggests the data significantly deviates from a normal distribution."
} else {
normality_text <- "Formal testing (Shapiro-Wilk) does not detect significant deviation from a normal distribution."
}
}
# Generate transformation recommendations
if (!is.na(stats$skewness)) {
if (stats$skewness > 0.7) {
transformation_text <- "For right-skewed data, consider applying a log, square root, or reciprocal transformation to achieve a more normal distribution."
} else if (stats$skewness < -0.7) {
transformation_text <- "For left-skewed data, consider applying a square, cube, or exponential transformation to achieve a more normal distribution."
}
}
# Complete interpretation
interpretation_text <- tagList(
interpretation_text,
tags$ul(
if (skew_text != "") tags$li(tags$strong(skew_text)),
if (kurt_text != "") tags$li(tags$strong(kurt_text)),
if (normality_text != "") tags$li(normality_text)
)
)
# Add advice based on the analysis
advice <- div(
h5("Recommendations"),
p("Based on this histogram analysis, consider the following:")
)
advice_items <- list()
if (!is.na(stats$sw_p) && stats$sw_p >= 0.05 && abs(stats$skewness) < 0.5) {
# Appears approximately normal
advice_items <- c(advice_items,
list(tags$li("Your data appears approximately normally distributed. Parametric statistical methods are likely appropriate.")))
} else {
# Not clearly normal
advice_items <- c(advice_items,
list(tags$li("Your data shows deviations from normality. Consider using non-parametric methods if normality is an assumption in your analysis.")))
# Add transformation advice if appropriate
if (transformation_text != "") {
advice_items <- c(advice_items, list(tags$li(transformation_text)))
}
}
# Check for possible bimodality
# This is a simple heuristic and might not detect all bimodal distributions
if (!is.na(stats$kurtosis) && stats$kurtosis < -1.2) {
advice_items <- c(advice_items,
list(tags$li("Your data may have a bimodal or multimodal distribution. Consider density estimation or mixture modeling approaches.")))
}
# Always suggest visual confirmation
advice_items <- c(advice_items,
list(tags$li("Always supplement this histogram analysis with other normality checks like Q-Q plots and formal tests.")))
# Complete advice
advice <- tagList(
advice,
tags$ul(advice_items)
)
# Return complete interpretation
tagList(interpretation_text, advice)
})
}
# Color palettes for use in the app
fill_colors <- c(
"blue" = "#3498db",
"green" = "#2ecc71",
"red" = "#e74c3c",
"purple" = "#9b59b6",
"orange" = "#e67e22"
)
shinyApp(ui = ui, server = server)
How Histograms Reveal Distribution Properties
Histograms provide visual insights into multiple aspects of your data distribution:
Key Distribution Characteristics
When analyzing a histogram, look for these key characteristics:
- Central Tendency: Where is the center of the distribution?
- Mean: The arithmetic average (affected by outliers)
- Median: The middle value (robust to outliers)
- In a normal distribution, mean = median
- In skewed distributions, mean and median differ
- Spread: How dispersed is the data?
- Standard Deviation: Average distance from the mean
- Range: Distance from minimum to maximum
- Interquartile Range (IQR): Distance between 25th and 75th percentiles
- Skewness: Is the distribution symmetric or asymmetric?
- Right-skewed (positive): Longer tail to the right, mean > median
- Left-skewed (negative): Longer tail to the left, mean < median
- Symmetric: Balanced on both sides, mean ≈ median
- Kurtosis: How heavy are the tails compared to a normal distribution?
- Mesokurtic: Normal tail thickness (normal distribution)
- Leptokurtic (positive kurtosis): Heavier tails, more extreme values
- Platykurtic (negative kurtosis): Lighter tails, fewer extreme values
- Modality: How many peaks does the distribution have?
- Unimodal: One peak (e.g., normal distribution)
- Bimodal: Two peaks (possibly two subpopulations)
- Multimodal: Multiple peaks (possibly multiple subpopulations)
Mathematical Formulation
The mathematical calculations behind histogram analysis include:
Bin Width Calculation: Different methods exist for determining optimal bin width:
- Sturges’ Rule: \(k = 1 + \log_2(n)\)
- Scott’s Rule: \(h = 3.5\sigma n^{-1/3}\)
- Freedman-Diaconis Rule: \(h = 2 \times IQR \times n^{-1/3}\)
Where \(n\) is sample size, \(\sigma\) is standard deviation, and \(IQR\) is interquartile range.
Skewness Calculation:
\[\text{Skewness} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^3/n}{s^3}\]
Where \(\bar{x}\) is the mean and \(s\) is the standard deviation.
Kurtosis Calculation:
\[\text{Kurtosis} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^4/n}{s^4} - 3\]
The “-3” term gives excess kurtosis, where 0 represents a normal distribution.
Histograms vs. Other Visualization Methods
Method | Strengths | Limitations |
---|---|---|
Histogram | Shows full distribution shape; easy to interpret | Bin width selection affects appearance; discontinuous |
Box Plot | Shows outliers and quartiles concisely | Hides distribution shape details |
Density Plot | Smooth continuous estimate of distribution | Requires bandwidth selection; somewhat abstract |
Q-Q Plot | Excellent for comparing to theoretical distributions | More difficult to interpret for non-specialists |
ECDF Plot | Shows full distribution without binning decisions | Less intuitive for general audience |
For the most comprehensive assessment, use histograms alongside other visualization methods.
Example 1: Normally Distributed Data
A university instructor collected final exam scores for a class of 120 students and wants to understand the distribution of scores.
Data Summary:
- Sample size: 120 students
- Mean score: 72.5
- Median score: 73.0
- Standard deviation: 8.6
- Range: 48 to 97
- Skewness: -0.12
- Kurtosis: -0.08
Analysis:
- Histogram Assessment:
- The histogram shows a roughly symmetric bell-shaped curve
- The peak is centered near 73
- Tails on both sides are approximately equal
- Mean (72.5) and median (73.0) are very close
- Statistical Interpretation:
- Skewness near zero (-0.12) indicates good symmetry
- Kurtosis near zero (-0.08) indicates normal tail thickness
- Shapiro-Wilk test: W = 0.987, p = 0.294 (supports normality)
Conclusion:
The exam scores follow an approximately normal distribution. The slight negative skewness is minimal and not statistically significant. This normal distribution suggests that the exam difficulty was appropriate for the student population, with most students scoring near the average and fewer students at the extremes.
How to Report: “Examination of the histogram and statistical analysis revealed that student exam scores (M = 72.5, SD = 8.6) followed an approximately normal distribution (Shapiro-Wilk: W = 0.987, p = 0.294). The symmetric bell-shaped distribution with nearly identical mean and median (73.0) supports the use of parametric statistical methods for further analysis of this data.”
Example 2: Right-Skewed Data
A researcher collected data on customer wait times (in minutes) at a service center.
Data Summary:
- Sample size: 75 customers
- Mean wait time: 12.4 minutes
- Median wait time: 9.8 minutes
- Standard deviation: 8.3 minutes
- Range: 1.2 to 42.5 minutes
- Skewness: 1.63
- Kurtosis: 2.45
Analysis:
- Histogram Assessment:
- The histogram shows a clear right-skewed (positive skew) distribution
- Many customers had short wait times (peak at left)
- A long tail extends to the right (some customers waited much longer)
- Mean (12.4) is noticeably larger than median (9.8)
- Statistical Interpretation:
- High positive skewness (1.63) confirms right skew
- Positive kurtosis (2.45) indicates heavy tails with some extreme values
- Shapiro-Wilk test: W = 0.824, p < 0.001 (rejects normality)
Conclusion: The wait time data follows a strongly right-skewed distribution, which is common for time-related measurements. Most customers experience relatively short wait times, but a small percentage face much longer waits. This skewed pattern suggests the service center generally handles customers efficiently but occasionally experiences backups or complex cases.
How to Report: “Histogram analysis of customer wait times revealed a strongly right-skewed distribution (skewness = 1.63), with most customers experiencing relatively short waits (Mdn = 9.8 minutes) but some facing much longer times, pulling the mean higher (M = 12.4 minutes). For statistical analyses of this non-normal data (Shapiro-Wilk: W = 0.824, p < 0.001), non-parametric methods or log transformation would be appropriate.”
Common Histogram Shapes and What They Mean
Understanding specific histogram shapes helps identify the exact nature of your data:
Normal Distribution
A normal distribution shows: - Symmetric bell-shaped curve - Mean and median are nearly identical - Approximately 68% of data within one standard deviation of the mean - Common in natural phenomena, measurement errors, and averages of random variables - Example applications: Height, IQ scores, measurement errors
Right-Skewed (Positive Skew) Distribution
A right-skewed distribution shows: - Peak toward the left (lower values) - Long tail extending to the right (higher values) - Mean > median - Common in time-to-event data, income, and counts with lower bounds - Example applications: Income distributions, reaction times, house prices
Left-Skewed (Negative Skew) Distribution
A left-skewed distribution shows: - Peak toward the right (higher values) - Long tail extending to the left (lower values) - Mean < median - Common in bounded distributions near their upper limit - Example applications: Age at death, exam scores in easy tests, purity measures
Bimodal Distribution
A bimodal distribution shows: - Two distinct peaks - Suggests two subpopulations or groups within the data - May indicate a mixture of two different distributions - Example applications: Heights of mixed gender groups, political opinions in polarized contexts
Uniform Distribution
A uniform distribution shows: - Roughly equal frequency across all bins - No clear peak - Common in randomly generated numbers or well-mixed populations - Example applications: Random numbers, positions of randomly placed objects
Heavy-Tailed Distribution
A heavy-tailed distribution shows: - More extreme values than expected in a normal distribution - Central peak may be higher and narrower - Positive kurtosis value - Example applications: Financial returns, catastrophic event magnitudes
Choosing Data Transformations Based on Histograms
Histograms can guide your choice of data transformation to achieve a more normal distribution:
Distribution Pattern | Recommended Transformations |
---|---|
Right-skewed (positive) | Log, Square root, Reciprocal, Box-Cox |
Left-skewed (negative) | Square, Cube, Exponential, Reflect-and-log |
Heavy-tailed | Winsorization, Trimming, Rank transformation |
Bimodal | Consider analyzing subgroups separately |
After applying a transformation, create a new histogram to assess whether normality has improved.
Test Your Understanding
- What does a right-skewed (positively skewed) histogram indicate about the relationship between mean and median?
- Mean equals median
- Mean is greater than median
- Mean is less than median
- No consistent relationship
- What bin width method is generally recommended for histograms when you want to detect bimodality?
- Sturges’ rule
- Scott’s rule
- Freedman-Diaconis rule
- Equal-width bins
- What transformation is typically most effective for right-skewed data?
- Square transformation
- Log transformation
- Subtraction of a constant
- Addition of a constant
- What does positive kurtosis (leptokurtic distribution) indicate about a distribution’s shape?
- It has heavier tails than a normal distribution
- It has lighter tails than a normal distribution
- It is more skewed than a normal distribution
- It has multiple peaks
- Why is the histogram a better visualization than a boxplot for detecting bimodality?
- Histograms are always more precise
- Histograms show the complete shape of the distribution
- Histograms require fewer data points
- Histograms never show outliers
Answers: 1-B, 2-C, 3-B, 4-A, 5-B
Common Questions About Histogram Analysis
The number of bins in a histogram significantly affects its appearance and interpretation:
- Too few bins: Obscures important details and patterns in the data
- Too many bins: Creates a noisy, jagged appearance with many empty or near-empty bins
Several methods exist for determining optimal bin count:
- Sturges’ rule: \(k = 1 + \log_2(n)\) where \(n\) is sample size
- Scott’s rule: Based on minimizing integrated mean squared error for normal distributions
- Freedman-Diaconis rule: Uses IQR to be robust against outliers
- Square root rule: \(k = \sqrt{n}\)
For detecting bimodality or fine structure, the Freedman-Diaconis rule often works best. For general purposes, using 10-20 bins for small to medium datasets is a reasonable starting point. Always try multiple bin widths to check if the apparent distribution shape is sensitive to bin choice.
Both display the same data but with different y-axis scaling:
- Frequency histogram: Y-axis shows the count of observations in each bin
- Advantage: Shows actual observation counts, which can be helpful for understanding sample size
- Disadvantage: Difficult to compare distributions with different sample sizes
- Density histogram: Y-axis shows the proportion or probability density
- Advantage: Area under the histogram equals 1, making it comparable to theoretical density functions
- Advantage: Allows comparison between datasets of different sizes
- Disadvantage: Less intuitive interpretation of the y-axis values
Most statistical analysis focuses on density histograms because they facilitate comparison with theoretical distributions (like normal curves) and between different samples.
Skewness measures the asymmetry of a distribution:
- Values near 0: Approximately symmetric
- Positive values: Right-skewed (tail extends to the right)
- Negative values: Left-skewed (tail extends to the left)
- Rule of thumb for significance:
- |Skewness| < 0.5: Approximately symmetric
- 0.5 < |Skewness| < 1: Moderately skewed
- |Skewness| > 1: Highly skewed
Kurtosis (specifically excess kurtosis) measures the heaviness of the tails relative to a normal distribution:
- Value of 0: Same tail thickness as a normal distribution (mesokurtic)
- Positive values: Heavier tails than normal (leptokurtic)
- Negative values: Lighter tails than normal (platykurtic)
- Rule of thumb for significance:
- |Kurtosis| < 0.5: Approximately normal tails
- 0.5 < |Kurtosis| < 1: Moderately non-normal tails
- |Kurtosis| > 1: Extremely non-normal tails
Yes, histograms can help identify outliers, but with some limitations:
Advantages:
- Provide visual context for what values might be unusually far from the central tendency
- Show whether potential outliers align with the overall distribution shape
- Can reveal if “outliers” are actually part of a heavy-tailed distribution
Limitations:
- Binning can obscure individual extreme values
- Very extreme outliers might not appear if they fall outside the plot range
- Not as precise as dedicated outlier detection methods
For a more complete outlier assessment, combine histograms with:
- Box plots (which explicitly mark outliers)
- Z-scores or modified Z-scores
- Statistical tests like Grubbs’ test or Dixon’s Q test
Histograms alone are not sufficient for definitive normality testing, but they provide valuable initial insight:
What histograms can tell you: - Obvious deviations from normality (strong skewness, multiple peaks) - General distribution shape and potential issues - Whether transformation might be necessary
Why histograms aren’t sufficient: - Bin width choice affects the apparent shape - Visual assessment is subjective - Small samples may not show clear patterns - Cannot provide objective criteria for decision-making
For comprehensive normality assessment, combine histograms with: 1. Q-Q plots (more sensitive to deviations in the tails) 2. Formal tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov) 3. Skewness and kurtosis statistics
Bimodality (having two peaks) can indicate mixed populations or interesting subgroups:
Tips for detecting bimodality:
- Use an appropriate bin width - too few bins might obscure bimodality
- The Freedman-Diaconis rule is often good for detecting multiple modes
- Look for a “valley” between two peaks
- Try kernel density estimation as a smoothed alternative
Formal approaches:
- Hartigan’s dip test - tests specifically for unimodality vs. multimodality
- Silverman’s test - uses kernel density estimation to test for modes
- Mixture model fitting - statistical test of whether two component distributions fit better than one
Remember that sampling variation can create apparent bimodality, so be cautious with small samples.
Examples of When to Use Histogram Analysis
- Data exploration: Initial assessment of any new dataset’s distribution
- Normality checking: Before applying parametric statistical methods
- Transformation selection: To determine appropriate data transformations
- Outlier detection: To identify unusual values in their distributional context
- Quality control: To monitor process distributions and detect shifts
- Finance: To analyze return distributions and assess risk
- Healthcare: To examine patient metrics and lab test distributions
- Education: To evaluate test score distributions and identify student groups
- Manufacturing: To assess product variation and quality metrics
- Environmental science: To understand distributions of pollutants or natural phenomena
References
- Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66(3), 605-610.
- Freedman, D., & Diaconis, P. (1981). On the histogram as a density estimator: L2 theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57(4), 453-476.
- Sturges, H. A. (1926). The choice of a class interval. Journal of the American Statistical Association, 21(153), 65-66.
- Joanes, D. N., & Gill, C. A. (1998). Comparing measures of sample skewness and kurtosis. Journal of the Royal Statistical Society: Series D (The Statistician), 47(1), 183-189.
- Knuth, D. E. (2014). Art of Computer Programming, Volume 2: Seminumerical Algorithms. Addison-Wesley Professional.
- Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman and Hall.
- Hartigan, J. A., & Hartigan, P. M. (1985). The dip test of unimodality. The Annals of Statistics, 13(1), 70-84.
Reuse
Citation
@online{kassambara2025,
author = {Kassambara, Alboukadel},
title = {Histogram {Analysis} {Tool} \textbar{} {Check} {Data}
{Distribution} \& {Normality}},
date = {2025-04-14},
url = {https://www.datanovia.com/apps/statfusion/analysis/inferential/goodness-fit/normality/histogram-distribution-analysis.html},
langid = {en}
}