Histogram Analysis Tool | Check Data Distribution & Normality Online

Key Takeaways: Histogram Distribution Analysis

Tip

Purpose: Visualize and analyze data distribution shapes using histograms
What it shows: Data spread, central tendency, skewness, kurtosis, and potential outliers
Key metrics: Mean, median, standard deviation, skewness, and kurtosis
Normal distribution: Appears as a symmetric bell-shaped curve with mean = median
Skewed distributions: Right-skewed (tail extends to right) or left-skewed (tail extends to left)
Applications: Normality testing, distribution comparison, data exploration
Advantages over formal tests: Provides visual intuition about distribution characteristics

What are Histogram Distribution Analyses?

Histograms are powerful graphical tools that display the frequency distribution of a dataset by dividing it into bins (intervals) and showing the count or density of observations in each bin. They provide immediate visual insights into a dataset’s shape, center, spread, and potential outliers, making them one of the most fundamental and useful tools in data analysis.

Tip

When to use histogram distribution analysis:

When exploring a new dataset to understand its basic distribution characteristics
When checking if data follows a normal distribution or other theoretical distribution
When identifying skewness, kurtosis, or multimodality in your data
When deciding whether to use parametric or non-parametric statistical methods
When selecting appropriate data transformations
When communicating data distribution features to non-technical audiences
When checking for outliers or unusual patterns in your data

This interactive tool allows you to quickly generate histograms, analyze distribution shapes, and receive detailed interpretations to guide your statistical analysis decisions.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 1400

library(shiny)
library(bslib)
library(ggplot2)
library(bsicons)
library(vroom)
library(shinyjs)
library(moments) # For skewness and kurtosis

ui <- page_sidebar(
  title = "Distribution Histogram Analysis: Assess Data Normality",
  useShinyjs(),  # Enable shinyjs for dynamic UI updates
  sidebar = sidebar(
    width = 400,
    
    card(
      card_header("Data Input"),
      accordion(
        accordion_panel(
          "Manual Input",
          textAreaInput("data_input", "Enter your data (one value per row):", rows = 8,
                      placeholder = "Paste values here..."),
          div(
            actionLink("use_example", "Use example data", style = "color:#0275d8;"),
            tags$span(bs_icon("file-earmark-text"), style = "margin-left: 5px; color: #0275d8;")
          )
        ),
        accordion_panel(
          "File Upload",
          fileInput("file_upload", "Upload CSV or TXT file:",
                   accept = c("text/csv", "text/plain", ".csv", ".txt")),
          checkboxInput("header", "File has header", TRUE),
          conditionalPanel(
            condition = "output.file_uploaded",
            div(
              selectInput("selected_var", "Select variable:", choices = NULL),
              actionButton("clear_file", "Clear File", class = "btn-danger btn-sm")
            )
          )
        ),
        id = "input_method",
        open = 1
      ),
      
      # Plot Settings accordion
      accordion(
        accordion_panel(
          "Histogram Settings",
          
          card(
            card_header("Bin Options:"),
            card_body(
              selectInput("bin_method", "Bin Width Method:",
                         choices = c("Auto" = "auto", 
                                    "Sturges" = "sturges",
                                    "Scott" = "scott",
                                    "Freedman-Diaconis" = "fd",
                                    "Manual" = "manual"),
                         selected = "auto"),
              conditionalPanel(
                condition = "input.bin_method == 'manual'",
                sliderInput("bin_count", "Number of Bins:", min = 5, max = 100, value = 30, step = 1)
              )
            )
          ),
          
          card(
            card_header("Distribution Options:"),
            card_body(
              checkboxInput("show_density", "Show Density Curve", TRUE),
              checkboxInput("show_normal", "Show Normal Curve", TRUE),
              checkboxInput("show_rug", "Show Data Rug", FALSE),
              selectInput("compare_dist", "Compare with Distribution:",
                         choices = c("Normal" = "norm", 
                                    "t (Student's)" = "t",
                                    "Exponential" = "exp",
                                    "Lognormal" = "lnorm",
                                    "Uniform" = "unif"),
                         selected = "norm"),
              conditionalPanel(
                condition = "input.compare_dist == 't'",
                sliderInput("t_df", "Degrees of Freedom:", min = 1, max = 30, value = 5, step = 1)
              )
            )
          ),
          
          card(
            card_header("Visual Options:"),
            card_body(
              checkboxInput("show_stats", "Show Statistics Overlay", TRUE),
              selectInput("fill_color", "Histogram Color:",
                         choices = c("Blue" = "blue", 
                                    "Green" = "green",
                                    "Red" = "red",
                                    "Purple" = "purple",
                                    "Orange" = "orange"),
                         selected = "blue"),
              selectInput("theme_choice", "Plot Theme:",
                         choices = c("Minimal" = "minimal",
                                    "Classic" = "classic",
                                    "Light" = "light",
                                    "Dark" = "dark"),
                         selected = "minimal")
            )
          )
        ),
        open = FALSE
      ),
      
      actionButton("analyze", "Analyze Distribution", class = "btn btn-primary")
    ),

    hr(),

    card(
      card_header("Interpretation Guide"),
      card_body(
        div(class = "alert alert-info",
          tags$p("The histogram reveals important characteristics of your data distribution:"),
          tags$ul(
            tags$li(tags$b("Bell shape:"), " Suggests normal distribution."),
            tags$li(tags$b("Right skew:"), " Tail extends to the right (higher values)."),
            tags$li(tags$b("Left skew:"), " Tail extends to the left (lower values)."),
            tags$li(tags$b("Bimodal:"), " Two peaks, suggesting two subgroups."),
            tags$li(tags$b("Uniform:"), " Similar frequency across all values."),
            tags$li(tags$b("Heavy tails:"), " More extreme values than expected in normal distribution.")
          )
        )
      )
    )
  ),

  layout_column_wrap(
    width = 1,

    card(
      card_header("Distribution Visualization"),
      card_body(
        navset_tab(
          nav_panel("Histogram", plotOutput("histogram", height = "500px")),
          nav_panel("Density Plot", plotOutput("density_plot", height = "500px")),
          nav_panel("Cumulative Distribution", plotOutput("cdf_plot", height = "500px"))
        )
      )
    ),

    card(
      card_header("Distribution Assessment"),
      card_body(
        navset_tab(
          nav_panel("Results", uiOutput("error_message"), verbatimTextOutput("summary_results")),
          nav_panel("Interpretation", div(style = "font-size: 0.9rem;",
            uiOutput("interpretation"),
            hr(),
            h5("Common Distribution Shapes"),
            div(
              class = "row",
              div(class = "col-md-6",
                  h6("Normal Distribution:"),
                  p("Symmetric bell-shaped curve, mean = median."),
                  h6("Right-Skewed (Positive):"),
                  p("Longer tail to the right, mean > median."),
                  h6("Left-Skewed (Negative):"),
                  p("Longer tail to the left, mean < median.")
              ),
              div(class = "col-md-6",
                  h6("Bimodal Distribution:"),
                  p("Two distinct peaks, suggesting two subgroups."),
                  h6("Uniform Distribution:"),
                  p("Similar frequency across all values."),
                  h6("Multimodal Distribution:"),
                  p("Multiple peaks, suggesting multiple subgroups.")
              )
            )
          ))
        )
      )
    )
  )
)

server <- function(input, output, session) {
  # Example data - mixed normal distribution with slight skew
  example_data <- "23.2\n24.5\n21.6\n22.8\n25.3\n20.9\n23.7\n22.2\n24.8\n21.5\n23.9\n25.6\n22.3\n24.1\n23.4\n21.8\n25.1\n24.3\n23.6\n22.1\n24.9\n23.2\n21.7\n25.8\n22.9\n24.4\n23.8\n21.3\n26.4\n28.7\n19.5\n18.8"

  # Track input method
  input_method <- reactiveVal("manual")
  
  # Function to clear file inputs
  clear_file_inputs <- function() {
    updateSelectInput(session, "selected_var", choices = NULL)
    reset("file_upload")
  }
  
  # Function to clear text inputs
  clear_text_inputs <- function() {
    updateTextAreaInput(session, "data_input", value = "")
  }

  # When example data is used, clear file inputs and set text inputs
  observeEvent(input$use_example, {
    input_method("manual")
    clear_file_inputs()
    updateTextAreaInput(session, "data_input", value = example_data)
  })

  # When file is uploaded, clear text inputs and set file method
  observeEvent(input$file_upload, {
    if (!is.null(input$file_upload)) {
      input_method("file")
      clear_text_inputs()
    }
  })

  # When clear file button is clicked, clear file and set manual method
  observeEvent(input$clear_file, {
    input_method("manual")
    clear_file_inputs()
  })
  
  # When text input changes, clear file inputs if it has content
  observeEvent(input$data_input, {
    if (!is.null(input$data_input) && nchar(input$data_input) > 0) {
      input_method("manual")
      clear_file_inputs()
    }
  }, ignoreInit = TRUE)

  # Process uploaded file
  file_data <- reactive({
    req(input$file_upload)
    tryCatch({
      vroom::vroom(input$file_upload$datapath, delim = NULL, col_names = input$header, show_col_types = FALSE)
    }, error = function(e) {
      showNotification(paste("File read error:", e$message), type = "error")
      NULL
    })
  })

  # Update variable selection dropdown with numeric columns from uploaded file
  observe({
    df <- file_data()
    if (!is.null(df)) {
      num_vars <- names(df)[sapply(df, is.numeric)]
      updateSelectInput(session, "selected_var", choices = num_vars)
    }
  })

  output$file_uploaded <- reactive({
    !is.null(input$file_upload)
  })
  outputOptions(output, "file_uploaded", suspendWhenHidden = FALSE)

  # Function to parse text input
  parse_text_input <- function(text) {
    if (is.null(text) || text == "") return(NULL)
    input_lines <- strsplit(text, "\\r?\\n")[[1]]
    input_lines <- input_lines[input_lines != ""]
    numeric_values <- suppressWarnings(as.numeric(input_lines))
    if (all(is.na(numeric_values))) return(NULL)
    return(na.omit(numeric_values))
  }

  # Get data values based on input method
  data_values <- reactive({
    if (input_method() == "file" && !is.null(file_data()) && !is.null(input$selected_var)) {
      df <- file_data()
      return(na.omit(df[[input$selected_var]]))
    } else {
      return(parse_text_input(input$data_input))
    }
  })

  # Validate input data
  validate_data <- reactive({
    values <- data_values()
    
    if (is.null(values)) {
      return("Error: Please check your input. Make sure all values are numeric.")
    }
    
    if (length(values) < 5) {
      return("Error: At least 5 values are recommended for meaningful distribution analysis.")
    }
    
    if (length(unique(values)) == 1) {
      return("Error: All values are identical. Distribution analysis requires variation in the data.")
    }
    
    return(NULL)
  })

  # Display error messages
  output$error_message <- renderUI({
    error <- validate_data()
    if (!is.null(error) && input$analyze > 0) {
      if (startsWith(error, "Warning")) {
        div(class = "alert alert-warning", error)
      } else {
        div(class = "alert alert-danger", error)
      }
    }
  })

  # Prepare data for analysis
  processed_data <- reactive({
    req(input$analyze > 0)
    error <- validate_data()
    if (!is.null(error) && startsWith(error, "Error")) return(NULL)
    
    values <- data_values()
    values
  })

  # Calculate bin width based on selected method
  bin_width <- reactive({
    req(processed_data())
    values <- processed_data()
    
    if (input$bin_method == "auto") {
      # Automatic binning
      NULL
    } else if (input$bin_method == "manual") {
      # Manual bin count
      diff(range(values)) / input$bin_count
    } else {
      # Other methods
      h <- switch(input$bin_method,
                 "sturges" = diff(range(values)) / (1 + log2(length(values))),
                 "scott" = 3.5 * sd(values) * length(values)^(-1/3),
                 "fd" = 2 * IQR(values) * length(values)^(-1/3),
                 NULL)
      h
    }
  })

  # Calculate number of bins
  bin_count <- reactive({
    req(processed_data())
    values <- processed_data()
    
    if (input$bin_method == "manual") {
      input$bin_count
    } else if (input$bin_method == "auto") {
      min(30, max(10, ceiling(length(values)/5)))
    } else {
      ceiling(diff(range(values)) / bin_width())
    }
  })

  # Generate histogram
  output$histogram <- renderPlot({
    req(processed_data())
    
    values <- processed_data()
    
    # Determine bin parameter based on method
    if (input$bin_method == "manual") {
      bin_param <- input$bin_count
    } else {
      bin_param <- bin_width()
    }
    
    # Create base histogram
    p <- ggplot(data.frame(x = values), aes(x = x))
    
    # Choose color based on input
    fill_colors <- c(
      "blue" = "#3498db",
      "green" = "#2ecc71",
      "red" = "#e74c3c",
      "purple" = "#9b59b6",
      "orange" = "#e67e22"
    )
    
    fill_color <- fill_colors[input$fill_color]
    line_color <- colorspace::darken(fill_color, 0.3)
    
    # Add histogram with density scaling
    if (input$bin_method == "manual") {
      p <- p + geom_histogram(aes(y = ..density..), bins = bin_param, 
                             fill = fill_color, color = line_color, alpha = 0.7)
    } else {
      p <- p + geom_histogram(aes(y = ..density..), binwidth = bin_param, 
                             fill = fill_color, color = line_color, alpha = 0.7)
    }
    
    # Add density curve if selected
    if (input$show_density) {
      p <- p + geom_density(color = "#c0392b", linewidth = 1.2)
    }
    
    # Add rug plot if selected
    if (input$show_rug) {
      p <- p + geom_rug(color = "#34495e", alpha = 0.5)
    }
    
    # Add normal curve if selected
    if (input$show_normal) {
      if (input$compare_dist == "norm") {
        p <- p + stat_function(fun = dnorm, args = list(mean = mean(values), sd = sd(values)), 
                            color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      } else if (input$compare_dist == "t") {
        # Scale the t-distribution to match data
        scaled_dt <- function(x) {
          sd_val <- sd(values)
          mean_val <- mean(values)
          dt((x - mean_val) / sd_val, df = input$t_df) / sd_val
        }
        p <- p + stat_function(fun = scaled_dt, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      } else if (input$compare_dist == "exp") {
        # Scale exponential to match data
        rate_param <- 1 / mean(values - min(values) + 0.01)
        scaled_dexp <- function(x) {
          dexp(x - min(values) + 0.01, rate = rate_param)
        }
        p <- p + stat_function(fun = scaled_dexp, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      } else if (input$compare_dist == "lnorm") {
        # Scale lognormal to match data
        if (min(values) <= 0) {
          # Shift data to positive domain for lognormal
          shift <- abs(min(values)) + 1
          shifted_values <- values + shift
          log_mean <- mean(log(shifted_values))
          log_sd <- sd(log(shifted_values))
          scaled_dlnorm <- function(x) {
            dlnorm(x + shift, meanlog = log_mean, sdlog = log_sd)
          }
        } else {
          log_mean <- mean(log(values))
          log_sd <- sd(log(values))
          scaled_dlnorm <- function(x) {
            dlnorm(x, meanlog = log_mean, sdlog = log_sd)
          }
        }
        p <- p + stat_function(fun = scaled_dlnorm, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      } else if (input$compare_dist == "unif") {
        p <- p + stat_function(fun = dunif, args = list(min = min(values), max = max(values)), 
                            color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      }
    }
    
    # Add statistics overlay if selected
    if (input$show_stats) {
      # Calculate statistics
      mean_val <- mean(values)
      median_val <- median(values)
      sd_val <- sd(values)
      
      # Add vertical lines for mean and median
      p <- p + 
        geom_vline(xintercept = mean_val, color = "#e74c3c", linetype = "dashed", linewidth = 1) +
        geom_vline(xintercept = median_val, color = "#27ae60", linetype = "dotted", linewidth = 1)
      
      # Add text annotations
      p <- p + 
        annotate("text", x = mean_val, y = Inf, label = paste("Mean =", round(mean_val, 2)), 
                vjust = 2, hjust = ifelse(mean_val > median_val, -0.1, 1.1), color = "#e74c3c") +
        annotate("text", x = median_val, y = Inf, label = paste("Median =", round(median_val, 2)), 
                vjust = 4, hjust = ifelse(median_val > mean_val, -0.1, 1.1), color = "#27ae60")
    }
    
    # Set title and labels
    p <- p + labs(
      title = "Distribution Histogram",
      subtitle = paste("n =", length(values), ", bins =", bin_count()),
      x = "Value",
      y = "Density"
    )
    
    # Apply selected theme
    if (input$theme_choice == "minimal") {
      p <- p + theme_minimal(base_size = 14)
    } else if (input$theme_choice == "classic") {
      p <- p + theme_classic(base_size = 14)
    } else if (input$theme_choice == "light") {
      p <- p + theme_light(base_size = 14)
    } else if (input$theme_choice == "dark") {
      p <- p + theme_dark(base_size = 14) +
        theme(
          plot.background = element_rect(fill = "gray10"),
          panel.background = element_rect(fill = "gray15"),
          panel.grid = element_line(color = "gray30")
        )
    }
    
    p
  })

  # Generate density plot
  output$density_plot <- renderPlot({
    req(processed_data())
    
    values <- processed_data()
    
    # Create density plot
    p <- ggplot(data.frame(x = values), aes(x = x)) +
      geom_density(fill = fill_colors[input$fill_color], alpha = 0.5, color = "#2c3e50", linewidth = 1)
    
    # Add rug plot if selected
    if (input$show_rug) {
      p <- p + geom_rug(color = "#34495e", alpha = 0.5)
    }
    
    # Add comparative distribution if selected
    if (input$show_normal) {
      if (input$compare_dist == "norm") {
        p <- p + stat_function(fun = dnorm, args = list(mean = mean(values), sd = sd(values)), 
                            color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      } else if (input$compare_dist == "t") {
        # Scale the t-distribution to match data
        scaled_dt <- function(x) {
          sd_val <- sd(values)
          mean_val <- mean(values)
          dt((x - mean_val) / sd_val, df = input$t_df) / sd_val
        }
        p <- p + stat_function(fun = scaled_dt, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      } else if (input$compare_dist == "exp") {
        # Scale exponential to match data
        rate_param <- 1 / mean(values - min(values) + 0.01)
        scaled_dexp <- function(x) {
          dexp(x - min(values) + 0.01, rate = rate_param)
        }
        p <- p + stat_function(fun = scaled_dexp, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      } else if (input$compare_dist == "lnorm") {
        # Scale lognormal to match data
        if (min(values) <= 0) {
          # Shift data to positive domain for lognormal
          shift <- abs(min(values)) + 1
          shifted_values <- values + shift
          log_mean <- mean(log(shifted_values))
          log_sd <- sd(log(shifted_values))
          scaled_dlnorm <- function(x) {
            dlnorm(x + shift, meanlog = log_mean, sdlog = log_sd)
          }
        } else {
          log_mean <- mean(log(values))
          log_sd <- sd(log(values))
          scaled_dlnorm <- function(x) {
            dlnorm(x, meanlog = log_mean, sdlog = log_sd)
          }
        }
        p <- p + stat_function(fun = scaled_dlnorm, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      } else if (input$compare_dist == "unif") {
        p <- p + stat_function(fun = dunif, args = list(min = min(values), max = max(values)), 
                            color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      }
    }
    
    # Add statistics overlay if selected
    if (input$show_stats) {
      # Calculate statistics
      mean_val <- mean(values)
      median_val <- median(values)
      
      # Add vertical lines for mean and median
      p <- p + 
        geom_vline(xintercept = mean_val, color = "#e74c3c", linetype = "dashed", linewidth = 1) +
        geom_vline(xintercept = median_val, color = "#27ae60", linetype = "dotted", linewidth = 1)
      
      # Add text annotations
      p <- p + 
        annotate("text", x = mean_val, y = Inf, label = paste("Mean =", round(mean_val, 2)), 
                vjust = 2, hjust = ifelse(mean_val > median_val, -0.1, 1.1), color = "#e74c3c") +
        annotate("text", x = median_val, y = Inf, label = paste("Median =", round(median_val, 2)), 
                vjust = 4, hjust = ifelse(median_val > mean_val, -0.1, 1.1), color = "#27ae60")
    }
    
    # Set title and labels
    p <- p + labs(
      title = "Density Plot",
      subtitle = paste("n =", length(values)),
      x = "Value",
      y = "Density"
    )
    
    # Apply selected theme
    if (input$theme_choice == "minimal") {
      p <- p + theme_minimal(base_size = 14)
    } else if (input$theme_choice == "classic") {
      p <- p + theme_classic(base_size = 14)
    } else if (input$theme_choice == "light") {
      p <- p + theme_light(base_size = 14)
    } else if (input$theme_choice == "dark") {
      p <- p + theme_dark(base_size = 14) +
        theme(
          plot.background = element_rect(fill = "gray10"),
          panel.background = element_rect(fill = "gray15"),
          panel.grid = element_line(color = "gray30")
        )
    }
    
    p
  })

  # Generate cumulative distribution function plot
  output$cdf_plot <- renderPlot({
    req(processed_data())
    
    values <- processed_data()
    
    # Create empirical CDF data
    n <- length(values)
    ordered_values <- sort(values)
    ecdf_y <- seq(1/n, 1, by = 1/n)
    
    ecdf_df <- data.frame(
      x = ordered_values,
      y = ecdf_y
    )
    
    # Create theoretical CDF data for comparison
    x_range <- seq(min(values) - 0.5*sd(values), max(values) + 0.5*sd(values), length.out = 500)
    
    if (input$compare_dist == "norm") {
      theoretical_cdf <- pnorm(x_range, mean = mean(values), sd = sd(values))
      dist_name <- "Normal"
    } else if (input$compare_dist == "t") {
      # Scale t distribution to match data mean and sd
      z_scores <- (x_range - mean(values)) / sd(values)
      theoretical_cdf <- pt(z_scores, df = input$t_df)
      dist_name <- paste0("t (df = ", input$t_df, ")")
    } else if (input$compare_dist == "exp") {
      # Shift to positive domain if needed
      min_shift <- min(values) - 0.01
      theoretical_cdf <- pexp(x_range - min_shift, rate = 1/mean(values - min_shift))
      dist_name <- "Exponential"
    } else if (input$compare_dist == "lnorm") {
      if (min(values) <= 0) {
        # Shift data for lognormal
        shift <- abs(min(values)) + 1
        shifted_x <- x_range + shift
        log_mean <- mean(log(values + shift))
        log_sd <- sd(log(values + shift))
        theoretical_cdf <- plnorm(shifted_x, meanlog = log_mean, sdlog = log_sd)
      } else {
        log_mean <- mean(log(values))
        log_sd <- sd(log(values))
        theoretical_cdf <- plnorm(x_range, meanlog = log_mean, sdlog = log_sd)
      }
      dist_name <- "Log-normal"
    } else if (input$compare_dist == "unif") {
      theoretical_cdf <- punif(x_range, min = min(values), max = max(values))
      dist_name <- "Uniform"
    }
    
    theoretical_df <- data.frame(
      x = x_range,
      y = theoretical_cdf
    )
    
    # Create CDF plot
    p <- ggplot() +
      geom_step(data = ecdf_df, aes(x = x, y = y), color = "#c0392b", linewidth = 1.2) +
      geom_line(data = theoretical_df, aes(x = x, y = y), color = "#2471a3", 
               linewidth = 1.2, linetype = "dashed")
    
    # Add statistics overlay if selected
    if (input$show_stats) {
      # Calculate statistics
      mean_val <- mean(values)
      median_val <- median(values)
      
      # Add vertical lines for mean and median
      p <- p + 
        geom_vline(xintercept = mean_val, color = "#e74c3c", linetype = "dashed", linewidth = 1) +
        geom_vline(xintercept = median_val, color = "#27ae60", linetype = "dotted", linewidth = 1)
      
      # Add text annotations
      p <- p + 
        annotate("text", x = mean_val, y = 0, label = paste("Mean =", round(mean_val, 2)), 
                vjust = -1, hjust = ifelse(mean_val > median_val, -0.1, 1.1), color = "#e74c3c") +
        annotate("text", x = median_val, y = 0, label = paste("Median =", round(median_val, 2)), 
                vjust = -3, hjust = ifelse(median_val > mean_val, -0.1, 1.1), color = "#27ae60")
    }
    
    # Set title and labels
    p <- p + labs(
      title = "Cumulative Distribution Function",
      subtitle = paste("Empirical CDF vs", dist_name, "CDF"),
      x = "Value",
      y = "Cumulative Probability"
    ) +
      scale_y_continuous(limits = c(0, 1))
    
    # Apply selected theme
    if (input$theme_choice == "minimal") {
      p <- p + theme_minimal(base_size = 14)
    } else if (input$theme_choice == "classic") {
      p <- p + theme_classic(base_size = 14)
    } else if (input$theme_choice == "light") {
      p <- p + theme_light(base_size = 14)
    } else if (input$theme_choice == "dark") {
      p <- p + theme_dark(base_size = 14) +
        theme(
          plot.background = element_rect(fill = "gray10"),
          panel.background = element_rect(fill = "gray15"),
          panel.grid = element_line(color = "gray30")
        )
    }
    
    # Add legend
    p <- p + 
      annotate("text", x = -Inf, y = 0.25, label = "Empirical CDF", color = "#c0392b", 
              hjust = -0.1, vjust = 0, fontface = "bold") +
      annotate("text", x = -Inf, y = 0.15, label = paste(dist_name, "CDF"), color = "#2471a3", 
              hjust = -0.1, vjust = 0, fontface = "bold")
    
    p
  })

# Calculate distribution statistics
  distribution_stats <- reactive({
    req(processed_data())
    
    values <- processed_data()
    
    # Calculate basic statistics
    n <- length(values)
    mean_val <- mean(values)
    median_val <- median(values)
    sd_val <- sd(values)
    min_val <- min(values)
    max_val <- max(values)
    range_val <- max_val - min_val
    q1 <- quantile(values, 0.25)
    q3 <- quantile(values, 0.75)
    iqr <- IQR(values)
    
    # Calculate skewness and kurtosis
    skew_val <- tryCatch({
      moments::skewness(values)
    }, error = function(e) {
      NA
    })
    
    kurt_val <- tryCatch({
      moments::kurtosis(values) - 3  # Excess kurtosis (normal = 0)
    }, error = function(e) {
      NA
    })
    
    # Calculate normality test results
    sw_test <- tryCatch({
      shapiro.test(values)
    }, error = function(e) {
      list(statistic = NA, p.value = NA)
    })
    
    # Calculate mean-median relationship
    mean_median_diff <- mean_val - median_val
    mean_median_ratio <- ifelse(median_val != 0, mean_val / median_val, NA)
    
    # Return all statistics
    list(
      n = n,
      mean = mean_val,
      median = median_val,
      sd = sd_val,
      min = min_val,
      max = max_val,
      range = range_val,
      q1 = q1,
      q3 = q3,
      iqr = iqr,
      skewness = skew_val,
      kurtosis = kurt_val,
      sw_stat = sw_test$statistic,
      sw_p = sw_test$p.value,
      mean_median_diff = mean_median_diff,
      mean_median_ratio = mean_median_ratio,
      bins = bin_count()
    )
  })

  # Display summary results
  output$summary_results <- renderPrint({
    req(input$analyze > 0)
    error <- validate_data()
    if (!is.null(error) && startsWith(error, "Error")) return(NULL)
    
    stats <- distribution_stats()
    
    cat("Distribution Analysis Results:\n")
    cat("==============================\n\n")
    
    cat("Basic Statistics:\n")
    cat("----------------\n")
    cat("Sample size:", stats$n, "\n")
    cat("Minimum:", round(stats$min, 4), "\n")
    cat("Maximum:", round(stats$max, 4), "\n")
    cat("Range:", round(stats$range, 4), "\n")
    cat("Mean:", round(stats$mean, 4), "\n")
    cat("Median:", round(stats$median, 4), "\n")
    cat("Standard deviation:", round(stats$sd, 4), "\n")
    cat("Q1 (25th percentile):", round(stats$q1, 4), "\n")
    cat("Q3 (75th percentile):", round(stats$q3, 4), "\n")
    cat("Interquartile range (IQR):", round(stats$iqr, 4), "\n\n")
    
    cat("Shape Characteristics:\n")
    cat("---------------------\n")
    if (!is.na(stats$skewness)) {
      cat("Skewness:", round(stats$skewness, 4), 
          ifelse(abs(stats$skewness) < 0.5, " (approximately symmetric)", 
                ifelse(stats$skewness > 0, " (right-skewed/positive)", " (left-skewed/negative)")), "\n")
    }
    
    if (!is.na(stats$kurtosis)) {
      cat("Excess kurtosis:", round(stats$kurtosis, 4), 
          ifelse(abs(stats$kurtosis) < 0.5, " (approximately mesokurtic)", 
                ifelse(stats$kurtosis > 0, " (leptokurtic/heavy-tailed)", " (platykurtic/light-tailed)")), "\n")
    }
    
    cat("Mean-median difference:", round(stats$mean_median_diff, 4), 
        ifelse(abs(stats$mean_median_diff) < 0.1*stats$sd, " (close)", 
              ifelse(stats$mean_median_diff > 0, " (mean > median, suggests right skew)", 
                    " (mean < median, suggests left skew)")), "\n\n")
    
    cat("Normality Assessment:\n")
    cat("--------------------\n")
    if (!is.na(stats$sw_stat)) {
      cat("Shapiro-Wilk test: W =", round(stats$sw_stat, 4), 
          ", p-value =", format.pval(stats$sw_p, digits = 4), "\n")
      
      if (stats$sw_p < 0.05) {
        cat("The p-value is less than 0.05, suggesting the data significantly\n")
        cat("deviates from a normal distribution.\n\n")
      } else {
        cat("The p-value is greater than or equal to 0.05, suggesting the data\n")
        cat("does not significantly deviate from a normal distribution.\n\n")
      }
    }
    
    cat("Histogram Information:\n")
    cat("----------------------\n")
    cat("Number of bins:", stats$bins, "\n")
    cat("Average observations per bin:", round(stats$n / stats$bins, 1), "\n")
  })

  # Generate interpretation text
  output$interpretation <- renderUI({
    req(input$analyze > 0)
    error <- validate_data()
    if (!is.null(error) && startsWith(error, "Error")) return(NULL)
    
    stats <- distribution_stats()
    
    # Interpretation based on statistics
    interpretation_text <- div(
      h5("Distribution Shape Assessment"),
      p("Based on the histogram and statistical analysis, your data shows the following characteristics:")
    )
    
    # Determine distribution characteristics
    dist_shape <- ""
    skew_text <- ""
    kurt_text <- ""
    normality_text <- ""
    transformation_text <- ""
    
    # Assess skewness
    if (!is.na(stats$skewness)) {
      if (abs(stats$skewness) < 0.5) {
        skew_text <- "The distribution appears approximately symmetric (skewness near zero)."
      } else if (stats$skewness > 1) {
        skew_text <- "The distribution is strongly right-skewed (positive skew), with a long tail extending toward higher values."
      } else if (stats$skewness > 0.5) {
        skew_text <- "The distribution is moderately right-skewed (positive skew)."
      } else if (stats$skewness < -1) {
        skew_text <- "The distribution is strongly left-skewed (negative skew), with a long tail extending toward lower values."
      } else if (stats$skewness < -0.5) {
        skew_text <- "The distribution is moderately left-skewed (negative skew)."
      }
    }
    
    # Assess kurtosis
    if (!is.na(stats$kurtosis)) {
      if (abs(stats$kurtosis) < 0.5) {
        kurt_text <- "The distribution has a similar tail thickness to a normal distribution (mesokurtic)."
      } else if (stats$kurtosis > 1) {
        kurt_text <- "The distribution has heavy tails (leptokurtic), with more extreme values than expected in a normal distribution."
      } else if (stats$kurtosis > 0.5) {
        kurt_text <- "The distribution has slightly heavier tails than a normal distribution."
      } else if (stats$kurtosis < -1) {
        kurt_text <- "The distribution has light tails (platykurtic), with fewer extreme values than expected in a normal distribution."
      } else if (stats$kurtosis < -0.5) {
        kurt_text <- "The distribution has slightly lighter tails than a normal distribution."
      }
    }
    
    # Assess normality based on Shapiro-Wilk test
    if (!is.na(stats$sw_p)) {
      if (stats$sw_p < 0.05) {
        normality_text <- "Formal testing (Shapiro-Wilk) suggests the data significantly deviates from a normal distribution."
      } else {
        normality_text <- "Formal testing (Shapiro-Wilk) does not detect significant deviation from a normal distribution."
      }
    }
    
    # Generate transformation recommendations
    if (!is.na(stats$skewness)) {
      if (stats$skewness > 0.7) {
        transformation_text <- "For right-skewed data, consider applying a log, square root, or reciprocal transformation to achieve a more normal distribution."
      } else if (stats$skewness < -0.7) {
        transformation_text <- "For left-skewed data, consider applying a square, cube, or exponential transformation to achieve a more normal distribution."
      }
    }
    
    # Complete interpretation
    interpretation_text <- tagList(
      interpretation_text,
      tags$ul(
        if (skew_text != "") tags$li(tags$strong(skew_text)),
        if (kurt_text != "") tags$li(tags$strong(kurt_text)),
        if (normality_text != "") tags$li(normality_text)
      )
    )
    
    # Add advice based on the analysis
    advice <- div(
      h5("Recommendations"),
      p("Based on this histogram analysis, consider the following:")
    )
    
    advice_items <- list()
    
    if (!is.na(stats$sw_p) && stats$sw_p >= 0.05 && abs(stats$skewness) < 0.5) {
      # Appears approximately normal
      advice_items <- c(advice_items, 
                       list(tags$li("Your data appears approximately normally distributed. Parametric statistical methods are likely appropriate.")))
    } else {
      # Not clearly normal
      advice_items <- c(advice_items, 
                       list(tags$li("Your data shows deviations from normality. Consider using non-parametric methods if normality is an assumption in your analysis.")))
      
      # Add transformation advice if appropriate
      if (transformation_text != "") {
        advice_items <- c(advice_items, list(tags$li(transformation_text)))
      }
    }
    
    # Check for possible bimodality
    # This is a simple heuristic and might not detect all bimodal distributions
    if (!is.na(stats$kurtosis) && stats$kurtosis < -1.2) {
      advice_items <- c(advice_items, 
                       list(tags$li("Your data may have a bimodal or multimodal distribution. Consider density estimation or mixture modeling approaches.")))
    }
    
    # Always suggest visual confirmation
    advice_items <- c(advice_items, 
                     list(tags$li("Always supplement this histogram analysis with other normality checks like Q-Q plots and formal tests.")))
    
    # Complete advice
    advice <- tagList(
      advice,
      tags$ul(advice_items)
    )
    
    # Return complete interpretation
    tagList(interpretation_text, advice)
  })
}

# Color palettes for use in the app
fill_colors <- c(
  "blue" = "#3498db",
  "green" = "#2ecc71",
  "red" = "#e74c3c",
  "purple" = "#9b59b6",
  "orange" = "#e67e22"
)

shinyApp(ui = ui, server = server)

How Histograms Reveal Distribution Properties

Histograms provide visual insights into multiple aspects of your data distribution:

flowchart TD
    A[Raw Data] --> B[Sort into Bins]
    B --> C[Count Observations\nin Each Bin]
    C --> D[Plot as Histogram]
    D --> E[Analyze Distribution Shape]
    E --> F[Central Tendency\nMean & Median]
    E --> G[Spread\nStandard Deviation & IQR]
    E --> H[Skewness\nSymmetry]
    E --> I[Kurtosis\nTail Heaviness]
    E --> J[Modality\nNumber of Peaks]

Key Distribution Characteristics

When analyzing a histogram, look for these key characteristics:

Central Tendency: Where is the center of the distribution?
- Mean: The arithmetic average (affected by outliers)
- Median: The middle value (robust to outliers)
- In a normal distribution, mean = median
- In skewed distributions, mean and median differ
Spread: How dispersed is the data?
- Standard Deviation: Average distance from the mean
- Range: Distance from minimum to maximum
- Interquartile Range (IQR): Distance between 25th and 75th percentiles
Skewness: Is the distribution symmetric or asymmetric?
- Right-skewed (positive): Longer tail to the right, mean > median
- Left-skewed (negative): Longer tail to the left, mean < median
- Symmetric: Balanced on both sides, mean ≈ median
Kurtosis: How heavy are the tails compared to a normal distribution?
- Mesokurtic: Normal tail thickness (normal distribution)
- Leptokurtic (positive kurtosis): Heavier tails, more extreme values
- Platykurtic (negative kurtosis): Lighter tails, fewer extreme values
Modality: How many peaks does the distribution have?
- Unimodal: One peak (e.g., normal distribution)
- Bimodal: Two peaks (possibly two subpopulations)
- Multimodal: Multiple peaks (possibly multiple subpopulations)

Mathematical Formulation

The mathematical calculations behind histogram analysis include:

Bin Width Calculation: Different methods exist for determining optimal bin width:
- Sturges’ Rule: \(k = 1 + \log_2(n)\)
- Scott’s Rule: \(h = 3.5\sigma n^{-1/3}\)
- Freedman-Diaconis Rule: \(h = 2 \times IQR \times n^{-1/3}\)
Where \(n\) is sample size, \(\sigma\) is standard deviation, and \(IQR\) is interquartile range.
Skewness Calculation:

\[\text{Skewness} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^3/n}{s^3}\]

Where \(\bar{x}\) is the mean and \(s\) is the standard deviation.
Kurtosis Calculation:

\[\text{Kurtosis} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^4/n}{s^4} - 3\]

The “-3” term gives excess kurtosis, where 0 represents a normal distribution.

Histograms vs. Other Visualization Methods

Method	Strengths	Limitations
Histogram	Shows full distribution shape; easy to interpret	Bin width selection affects appearance; discontinuous
Box Plot	Shows outliers and quartiles concisely	Hides distribution shape details
Density Plot	Smooth continuous estimate of distribution	Requires bandwidth selection; somewhat abstract
Q-Q Plot	Excellent for comparing to theoretical distributions	More difficult to interpret for non-specialists
ECDF Plot	Shows full distribution without binning decisions	Less intuitive for general audience

For the most comprehensive assessment, use histograms alongside other visualization methods.

Example 1: Normally Distributed Data

A university instructor collected final exam scores for a class of 120 students and wants to understand the distribution of scores.

Data Summary:

Sample size: 120 students
Mean score: 72.5
Median score: 73.0
Standard deviation: 8.6
Range: 48 to 97
Skewness: -0.12
Kurtosis: -0.08

Analysis:

Histogram Assessment:
- The histogram shows a roughly symmetric bell-shaped curve
- The peak is centered near 73
- Tails on both sides are approximately equal
- Mean (72.5) and median (73.0) are very close
Statistical Interpretation:
- Skewness near zero (-0.12) indicates good symmetry
- Kurtosis near zero (-0.08) indicates normal tail thickness
- Shapiro-Wilk test: W = 0.987, p = 0.294 (supports normality)

Conclusion:

The exam scores follow an approximately normal distribution. The slight negative skewness is minimal and not statistically significant. This normal distribution suggests that the exam difficulty was appropriate for the student population, with most students scoring near the average and fewer students at the extremes.

How to Report: “Examination of the histogram and statistical analysis revealed that student exam scores (M = 72.5, SD = 8.6) followed an approximately normal distribution (Shapiro-Wilk: W = 0.987, p = 0.294). The symmetric bell-shaped distribution with nearly identical mean and median (73.0) supports the use of parametric statistical methods for further analysis of this data.”

Example 2: Right-Skewed Data

A researcher collected data on customer wait times (in minutes) at a service center.

Data Summary:

Sample size: 75 customers
Mean wait time: 12.4 minutes
Median wait time: 9.8 minutes
Standard deviation: 8.3 minutes
Range: 1.2 to 42.5 minutes
Skewness: 1.63
Kurtosis: 2.45

Analysis:

Histogram Assessment:
- The histogram shows a clear right-skewed (positive skew) distribution
- Many customers had short wait times (peak at left)
- A long tail extends to the right (some customers waited much longer)
- Mean (12.4) is noticeably larger than median (9.8)
Statistical Interpretation:
- High positive skewness (1.63) confirms right skew
- Positive kurtosis (2.45) indicates heavy tails with some extreme values
- Shapiro-Wilk test: W = 0.824, p < 0.001 (rejects normality)

Conclusion: The wait time data follows a strongly right-skewed distribution, which is common for time-related measurements. Most customers experience relatively short wait times, but a small percentage face much longer waits. This skewed pattern suggests the service center generally handles customers efficiently but occasionally experiences backups or complex cases.

How to Report: “Histogram analysis of customer wait times revealed a strongly right-skewed distribution (skewness = 1.63), with most customers experiencing relatively short waits (Mdn = 9.8 minutes) but some facing much longer times, pulling the mean higher (M = 12.4 minutes). For statistical analyses of this non-normal data (Shapiro-Wilk: W = 0.824, p < 0.001), non-parametric methods or log transformation would be appropriate.”

Common Histogram Shapes and What They Mean

Understanding specific histogram shapes helps identify the exact nature of your data:

Normal Distribution

A normal distribution shows: - Symmetric bell-shaped curve - Mean and median are nearly identical - Approximately 68% of data within one standard deviation of the mean - Common in natural phenomena, measurement errors, and averages of random variables - Example applications: Height, IQ scores, measurement errors

Right-Skewed (Positive Skew) Distribution

A right-skewed distribution shows: - Peak toward the left (lower values) - Long tail extending to the right (higher values) - Mean > median - Common in time-to-event data, income, and counts with lower bounds - Example applications: Income distributions, reaction times, house prices

Left-Skewed (Negative Skew) Distribution

A left-skewed distribution shows: - Peak toward the right (higher values) - Long tail extending to the left (lower values) - Mean < median - Common in bounded distributions near their upper limit - Example applications: Age at death, exam scores in easy tests, purity measures

Bimodal Distribution

A bimodal distribution shows: - Two distinct peaks - Suggests two subpopulations or groups within the data - May indicate a mixture of two different distributions - Example applications: Heights of mixed gender groups, political opinions in polarized contexts

Uniform Distribution

A uniform distribution shows: - Roughly equal frequency across all bins - No clear peak - Common in randomly generated numbers or well-mixed populations - Example applications: Random numbers, positions of randomly placed objects

Heavy-Tailed Distribution

A heavy-tailed distribution shows: - More extreme values than expected in a normal distribution - Central peak may be higher and narrower - Positive kurtosis value - Example applications: Financial returns, catastrophic event magnitudes

Choosing Data Transformations Based on Histograms

Histograms can guide your choice of data transformation to achieve a more normal distribution:

Distribution Pattern	Recommended Transformations
Right-skewed (positive)	Log, Square root, Reciprocal, Box-Cox
Left-skewed (negative)	Square, Cube, Exponential, Reflect-and-log
Heavy-tailed	Winsorization, Trimming, Rank transformation
Bimodal	Consider analyzing subgroups separately

After applying a transformation, create a new histogram to assess whether normality has improved.

Test Your Understanding

What does a right-skewed (positively skewed) histogram indicate about the relationship between mean and median?
- 1. Mean equals median
- 1. Mean is greater than median
- 1. Mean is less than median
- 1. No consistent relationship
What bin width method is generally recommended for histograms when you want to detect bimodality?
- 1. Sturges’ rule
- 1. Scott’s rule
- 1. Freedman-Diaconis rule
- 1. Equal-width bins
What transformation is typically most effective for right-skewed data?
- 1. Square transformation
- 1. Log transformation
- 1. Subtraction of a constant
- 1. Addition of a constant
What does positive kurtosis (leptokurtic distribution) indicate about a distribution’s shape?
- 1. It has heavier tails than a normal distribution
- 1. It has lighter tails than a normal distribution
- 1. It is more skewed than a normal distribution
- 1. It has multiple peaks
Why is the histogram a better visualization than a boxplot for detecting bimodality?
- 1. Histograms are always more precise
- 1. Histograms show the complete shape of the distribution
- 1. Histograms require fewer data points
- 1. Histograms never show outliers

Answers: 1-B, 2-C, 3-B, 4-A, 5-B

Common Questions About Histogram Analysis

How many bins should I use in my histogram?

The number of bins in a histogram significantly affects its appearance and interpretation:

Too few bins: Obscures important details and patterns in the data
Too many bins: Creates a noisy, jagged appearance with many empty or near-empty bins

Several methods exist for determining optimal bin count:

Sturges’ rule: \(k = 1 + \log_2(n)\) where \(n\) is sample size
Scott’s rule: Based on minimizing integrated mean squared error for normal distributions
Freedman-Diaconis rule: Uses IQR to be robust against outliers
Square root rule: \(k = \sqrt{n}\)

For detecting bimodality or fine structure, the Freedman-Diaconis rule often works best. For general purposes, using 10-20 bins for small to medium datasets is a reasonable starting point. Always try multiple bin widths to check if the apparent distribution shape is sensitive to bin choice.

What’s the difference between a frequency histogram and a density histogram?

Both display the same data but with different y-axis scaling:

Frequency histogram: Y-axis shows the count of observations in each bin
- Advantage: Shows actual observation counts, which can be helpful for understanding sample size
- Disadvantage: Difficult to compare distributions with different sample sizes
Density histogram: Y-axis shows the proportion or probability density
- Advantage: Area under the histogram equals 1, making it comparable to theoretical density functions
- Advantage: Allows comparison between datasets of different sizes
- Disadvantage: Less intuitive interpretation of the y-axis values

Most statistical analysis focuses on density histograms because they facilitate comparison with theoretical distributions (like normal curves) and between different samples.

How do I interpret skewness and kurtosis values?

Skewness measures the asymmetry of a distribution:

Values near 0: Approximately symmetric
Positive values: Right-skewed (tail extends to the right)
Negative values: Left-skewed (tail extends to the left)
Rule of thumb for significance:
- |Skewness| < 0.5: Approximately symmetric
- 0.5 < |Skewness| < 1: Moderately skewed
- |Skewness| > 1: Highly skewed

Kurtosis (specifically excess kurtosis) measures the heaviness of the tails relative to a normal distribution:

Value of 0: Same tail thickness as a normal distribution (mesokurtic)
Positive values: Heavier tails than normal (leptokurtic)
Negative values: Lighter tails than normal (platykurtic)
Rule of thumb for significance:
- |Kurtosis| < 0.5: Approximately normal tails
- 0.5 < |Kurtosis| < 1: Moderately non-normal tails
- |Kurtosis| > 1: Extremely non-normal tails

Can I use histograms to check for outliers?

Yes, histograms can help identify outliers, but with some limitations:

Advantages:

Provide visual context for what values might be unusually far from the central tendency
Show whether potential outliers align with the overall distribution shape
Can reveal if “outliers” are actually part of a heavy-tailed distribution

Limitations:

Binning can obscure individual extreme values
Very extreme outliers might not appear if they fall outside the plot range
Not as precise as dedicated outlier detection methods

For a more complete outlier assessment, combine histograms with:

Box plots (which explicitly mark outliers)
Z-scores or modified Z-scores
Statistical tests like Grubbs’ test or Dixon’s Q test

Are histograms sufficient to test for normality?

Histograms alone are not sufficient for definitive normality testing, but they provide valuable initial insight:

What histograms can tell you: - Obvious deviations from normality (strong skewness, multiple peaks) - General distribution shape and potential issues - Whether transformation might be necessary

Why histograms aren’t sufficient: - Bin width choice affects the apparent shape - Visual assessment is subjective - Small samples may not show clear patterns - Cannot provide objective criteria for decision-making

For comprehensive normality assessment, combine histograms with: 1. Q-Q plots (more sensitive to deviations in the tails) 2. Formal tests (Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov) 3. Skewness and kurtosis statistics

How can I detect bimodality in a histogram?

Bimodality (having two peaks) can indicate mixed populations or interesting subgroups:

Tips for detecting bimodality:

Use an appropriate bin width - too few bins might obscure bimodality
The Freedman-Diaconis rule is often good for detecting multiple modes
Look for a “valley” between two peaks
Try kernel density estimation as a smoothed alternative

Formal approaches:

Hartigan’s dip test - tests specifically for unimodality vs. multimodality
Silverman’s test - uses kernel density estimation to test for modes
Mixture model fitting - statistical test of whether two component distributions fit better than one

Remember that sampling variation can create apparent bimodality, so be cautious with small samples.

Examples of When to Use Histogram Analysis

Data exploration: Initial assessment of any new dataset’s distribution
Normality checking: Before applying parametric statistical methods
Transformation selection: To determine appropriate data transformations
Outlier detection: To identify unusual values in their distributional context
Quality control: To monitor process distributions and detect shifts
Finance: To analyze return distributions and assess risk
Healthcare: To examine patient metrics and lab test distributions
Education: To evaluate test score distributions and identify student groups
Manufacturing: To assess product variation and quality metrics
Environmental science: To understand distributions of pollutants or natural phenomena

References

Scott, D. W. (1979). On optimal and data-based histograms. Biometrika, 66(3), 605-610.
Freedman, D., & Diaconis, P. (1981). On the histogram as a density estimator: L2 theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57(4), 453-476.
Sturges, H. A. (1926). The choice of a class interval. Journal of the American Statistical Association, 21(153), 65-66.
Joanes, D. N., & Gill, C. A. (1998). Comparing measures of sample skewness and kurtosis. Journal of the Royal Statistical Society: Series D (The Statistician), 47(1), 183-189.
Knuth, D. E. (2014). Art of Computer Programming, Volume 2: Seminumerical Algorithms. Addison-Wesley Professional.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. Chapman and Hall.
Hartigan, J. A., & Hartigan, P. M. (1985). The dip test of unimodality. The Annals of Statistics, 13(1), 70-84.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{kassambara2025,
  author = {Kassambara, Alboukadel},
  title = {Histogram {Analysis} {Tool} \textbar{} {Check} {Data}
    {Distribution} \& {Normality}},
  date = {2025-04-14},
  url = {https://www.datanovia.com/apps/statfusion/analysis/inferential/goodness-fit/normality/histogram-distribution-analysis.html},
  langid = {en}
}

For attribution, please cite this work as:

Kassambara, Alboukadel. 2025. “Histogram Analysis Tool | Check Data Distribution & Normality.” April 14, 2025. https://www.datanovia.com/apps/statfusion/analysis/inferential/goodness-fit/normality/histogram-distribution-analysis.html.