QQ Plot Analysis & Generator | Test Data Normality Visually

Check If Your Data Follows a Normal Distribution with Quantile-Quantile Plots

Free online QQ plot generator and analyzer to visually assess data normality. More intuitive than formal tests, with interactive visualizations and detailed interpretations of distribution patterns.

Tools
Author
Affiliation
Published

April 12, 2025

Modified

April 16, 2025

Keywords

qq plot, quantile quantile plot, normality test visual, check data distribution, normal probability plot, qq plot generator, test normality qq plot, qq plot interpretation

Key Takeaways: QQ Plot Analysis

Tip
  • Purpose: Visually assess whether data follows a particular distribution (typically normal)
  • How it works: Compares the quantiles of your data to the quantiles of a theoretical distribution
  • Advantage over formal tests: Shows exactly where and how data deviates from the theoretical distribution
  • Interpretation: Points following the diagonal line suggest the distributions match
  • Common patterns: Identify skewness, heavy/light tails, and outliers from specific curve patterns
  • Complementary to: Formal tests like Shapiro-Wilk or Kolmogorov-Smirnov
  • Applications: Testing assumptions for parametric tests, data exploration, distribution fitting

What is a QQ Plot?

A Quantile-Quantile (QQ) plot is a powerful graphical technique for comparing two probability distributions by plotting their quantiles against each other. When testing for normality, it compares the quantiles of your data against the quantiles of a normal distribution. QQ plots are widely considered one of the most effective visual methods for assessing whether a dataset follows a particular distribution.

Tip

When to use QQ plots:

  • When checking if data follows a normal distribution (or other theoretical distribution)
  • When formal normality tests give borderline results
  • When you need to understand exactly how your data deviates from normality
  • Before applying statistical methods that assume a particular distribution
  • When analyzing residuals from regression or ANOVA models
  • When you need to choose an appropriate data transformation

This interactive tool allows you to quickly generate QQ plots, visualize your data distribution patterns, and receive detailed interpretations to guide your statistical analysis decisions.



#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 1300

library(shiny)
library(bslib)
library(ggplot2)
library(bsicons)
library(vroom)
library(shinyjs)

ui <- page_sidebar(
  title = "QQ Plot Analysis: Assess Data Normality",
  useShinyjs(),  # Enable shinyjs for dynamic UI updates
  sidebar = sidebar(
    width = 400,
    
    card(
      card_header("Data Input"),
      accordion(
        accordion_panel(
          "Manual Input",
          textAreaInput("data_input", "Enter your data (one value per row):", rows = 8,
                      placeholder = "Paste values here..."),
          div(
            actionLink("use_example", "Use example data", style = "color:#0275d8;"),
            tags$span(bs_icon("file-earmark-text"), style = "margin-left: 5px; color: #0275d8;")
          )
        ),
        accordion_panel(
          "File Upload",
          fileInput("file_upload", "Upload CSV or TXT file:",
                   accept = c("text/csv", "text/plain", ".csv", ".txt")),
          checkboxInput("header", "File has header", TRUE),
          conditionalPanel(
            condition = "output.file_uploaded",
            div(
              selectInput("selected_var", "Select variable:", choices = NULL),
              actionButton("clear_file", "Clear File", class = "btn-danger btn-sm")
            )
          )
        ),
        id = "input_method",
        open = 1
      ),
      
      # Plot Settings accordion
      accordion(
        accordion_panel(
          "Plot Settings",
          
          card(
            card_header("Distribution Options:"),
            card_body(
              selectInput("dist_type", "Reference Distribution:",
                         choices = c("Normal" = "norm", 
                                    "t (Student's)" = "t",
                                    "Uniform" = "unif",
                                    "Log-Normal" = "lnorm",
                                    "Exponential" = "exp"),
                         selected = "norm"),
              conditionalPanel(
                condition = "input.dist_type == 't'",
                sliderInput("t_df", "Degrees of Freedom:", min = 1, max = 30, value = 5, step = 1)
              ),
              checkboxInput("standardize", "Standardize Data (Z-scores)", value = FALSE)
            )
          ),
          
          card(
            card_header("Visual Options:"),
            card_body(
              checkboxInput("conf_interval", "Show Confidence Bands", TRUE),
              sliderInput("conf_level", "Confidence Level:", min = 0.90, max = 0.99, value = 0.95, step = 0.01),
              checkboxInput("points_fill", "Color points by deviation", TRUE),
              checkboxInput("show_deviations", "Show Deviation Segments", FALSE),
              selectInput("theme_choice", "Plot Theme:",
                         choices = c("Minimal" = "minimal",
                                    "Classic" = "classic",
                                    "Light" = "light",
                                    "Dark" = "dark"),
                         selected = "minimal")
            )
          )
        ),
        open = FALSE
      ),
      
      actionButton("analyze", "Analyze Data", class = "btn btn-primary")
    ),

    hr(),

    card(
      card_header("Interpretation Guide"),
      card_body(
        div(class = "alert alert-info",
          tags$p("QQ (Quantile-Quantile) plots compare your data distribution to a theoretical distribution:"),
          tags$ul(
            tags$li(tags$b("Straight line:"), " Data follows the reference distribution."),
            tags$li(tags$b("S-shaped curve:"), " Data shows skewness."),
            tags$li(tags$b("Points above line at ends:"), " Data has heavy tails (more extreme values)."),
            tags$li(tags$b("Points below line at ends:"), " Data has light tails (fewer extreme values)."),
            tags$li(tags$b("Confidence bands:"), " Points falling outside these bands indicate significant deviations from the reference distribution.")
          )
        )
      )
    )
  ),

  layout_column_wrap(
    width = 1,

    card(
      card_header("QQ Plot Analysis"),
      card_body(
        navset_tab(
          nav_panel("QQ Plot", plotOutput("qqplot", height = "500px")),
          nav_panel("Histogram", plotOutput("histogram", height = "500px")),
          nav_panel("Deviations Analysis", plotOutput("deviation_plot", height = "500px"))
        )
      )
    ),

    card(
      card_header("Normality Assessment"),
      card_body(
        navset_tab(
          nav_panel("Results", uiOutput("error_message"), verbatimTextOutput("summary_results")),
          nav_panel("Interpretation", div(style = "font-size: 0.9rem;",
            uiOutput("interpretation"),
            uiOutput("deviation_pattern"),
            hr(),
            h5("Common Patterns in QQ Plots"),
            div(
              class = "row",
              div(class = "col-md-6",
                  h6("Normal Distribution:"),
                  p("Points follow the diagonal line with random scatter."),
                  h6("Skewed Right (Positive Skew):"),
                  p("Points curve above the line at right end (high values)."),
                  h6("Skewed Left (Negative Skew):"),
                  p("Points curve below the line at left end (low values).")
              ),
              div(class = "col-md-6",
                  h6("Heavy Tails (Leptokurtic):"),
                  p("S-shape with points below line in middle, above at ends."),
                  h6("Light Tails (Platykurtic):"),
                  p("Inverted S-shape with points above line in middle, below at ends."),
                  h6("Bimodal/Multimodal:"),
                  p("Step-like pattern with horizontal segments.")
              )
            )
          ))
        )
      )
    )
  )
)

server <- function(input, output, session) {
  # Example data - normally distributed with some outliers
  example_data <- "5.2\n5.5\n6.1\n5.8\n5.5\n5.9\n6.2\n5.7\n5.6\n6.0\n5.8\n5.5\n5.7\n6.3\n5.9\n5.8\n5.6\n5.7\n6.1\n5.9\n5.4\n5.8\n6.2\n5.7\n5.6\n5.8\n6.4\n5.9\n5.7\n5.5\n7.2\n4.3"

  # Track input method
  input_method <- reactiveVal("manual")
  
  # Function to clear file inputs
  clear_file_inputs <- function() {
    updateSelectInput(session, "selected_var", choices = NULL)
    reset("file_upload")
  }
  
  # Function to clear text inputs
  clear_text_inputs <- function() {
    updateTextAreaInput(session, "data_input", value = "")
  }

  # When example data is used, clear file inputs and set text inputs
  observeEvent(input$use_example, {
    input_method("manual")
    clear_file_inputs()
    updateTextAreaInput(session, "data_input", value = example_data)
  })

  # When file is uploaded, clear text inputs and set file method
  observeEvent(input$file_upload, {
    if (!is.null(input$file_upload)) {
      input_method("file")
      clear_text_inputs()
    }
  })

  # When clear file button is clicked, clear file and set manual method
  observeEvent(input$clear_file, {
    input_method("manual")
    clear_file_inputs()
  })
  
  # When text input changes, clear file inputs if it has content
  observeEvent(input$data_input, {
    if (!is.null(input$data_input) && nchar(input$data_input) > 0) {
      input_method("manual")
      clear_file_inputs()
    }
  }, ignoreInit = TRUE)

  # Process uploaded file
  file_data <- reactive({
    req(input$file_upload)
    tryCatch({
      vroom::vroom(input$file_upload$datapath, delim = NULL, col_names = input$header, show_col_types = FALSE)
    }, error = function(e) {
      showNotification(paste("File read error:", e$message), type = "error")
      NULL
    })
  })

  # Update variable selection dropdown with numeric columns from uploaded file
  observe({
    df <- file_data()
    if (!is.null(df)) {
      num_vars <- names(df)[sapply(df, is.numeric)]
      updateSelectInput(session, "selected_var", choices = num_vars)
    }
  })

  output$file_uploaded <- reactive({
    !is.null(input$file_upload)
  })
  outputOptions(output, "file_uploaded", suspendWhenHidden = FALSE)

  # Function to parse text input
  parse_text_input <- function(text) {
    if (is.null(text) || text == "") return(NULL)
    input_lines <- strsplit(text, "\\r?\\n")[[1]]
    input_lines <- input_lines[input_lines != ""]
    numeric_values <- suppressWarnings(as.numeric(input_lines))
    if (all(is.na(numeric_values))) return(NULL)
    return(na.omit(numeric_values))
  }

  # Get data values based on input method
  data_values <- reactive({
    if (input_method() == "file" && !is.null(file_data()) && !is.null(input$selected_var)) {
      df <- file_data()
      return(na.omit(df[[input$selected_var]]))
    } else {
      return(parse_text_input(input$data_input))
    }
  })

  # Validate input data
  validate_data <- reactive({
    values <- data_values()
    
    if (is.null(values)) {
      return("Error: Please check your input. Make sure all values are numeric.")
    }
    
    if (length(values) < 5) {
      return("Error: At least 5 values are recommended for meaningful QQ plot analysis.")
    }
    
    if (length(unique(values)) == 1) {
      return("Error: All values are identical. QQ plot analysis requires variation in the data.")
    }
    
    return(NULL)
  })

  # Display error messages
  output$error_message <- renderUI({
    error <- validate_data()
    if (!is.null(error) && input$analyze > 0) {
      if (startsWith(error, "Warning")) {
        div(class = "alert alert-warning", error)
      } else {
        div(class = "alert alert-danger", error)
      }
    }
  })

  # Prepare data for analysis
  processed_data <- reactive({
    req(input$analyze > 0)
    error <- validate_data()
    if (!is.null(error) && startsWith(error, "Error")) return(NULL)
    
    values <- data_values()
    
    # Standardize if selected
    if (input$standardize) {
      values <- scale(values)
    }
    
    values
  })

  # Generate QQ plot
  output$qqplot <- renderPlot({
    req(processed_data())
    
    values <- processed_data()
    
    # Set up base qq plot data
    n <- length(values)
    
    # Theoretical quantiles based on selected distribution
    p <- (1:n - 0.5) / n
    
    if (input$dist_type == "norm") {
      theoretical_quantiles <- qnorm(p)
      dist_name <- "Normal"
    } else if (input$dist_type == "t") {
      theoretical_quantiles <- qt(p, df = input$t_df)
      dist_name <- paste0("t (df = ", input$t_df, ")")
    } else if (input$dist_type == "unif") {
      theoretical_quantiles <- qunif(p)
      dist_name <- "Uniform"
    } else if (input$dist_type == "lnorm") {
      theoretical_quantiles <- qlnorm(p)
      dist_name <- "Log-Normal"
    } else if (input$dist_type == "exp") {
      theoretical_quantiles <- qexp(p)
      dist_name <- "Exponential"
    }
    
    # Sort the observed data
    ordered_values <- sort(values)
    
    # Calculate deviations from the theoretical line
    if (input$dist_type == "norm") {
      # For normal distribution, calculate line based on mean and SD
      slope <- sd(values)
      intercept <- mean(values)
      
      # Theoretical line values
      line_values <- intercept + slope * theoretical_quantiles
      
      # Deviations
      deviations <- ordered_values - line_values
    } else {
      # For other distributions, use the sorted values as observed quantiles
      # and calculate deviations directly
      deviations <- ordered_values - theoretical_quantiles
    }
    
    # Create data frame for plotting
    qq_data <- data.frame(
      theoretical = theoretical_quantiles,
      observed = ordered_values,
      deviation = deviations,
      deviation_color = ifelse(abs(deviations) > 1.5*sd(deviations), "Large", "Small")
    )
    
    # Calculate confidence bands (simulation-based for normal)
    if (input$conf_interval) {
      alpha <- 1 - input$conf_level
      
      # Number of simulations
      n_sim <- 1000
      
      # Initialize matrix to hold simulation results
      sorted_sims <- matrix(NA, nrow = n, ncol = n_sim)
      
      # Generate simulations based on selected distribution
      for (i in 1:n_sim) {
        if (input$dist_type == "norm") {
          sim_data <- rnorm(n, mean = mean(values), sd = sd(values))
        } else if (input$dist_type == "t") {
          sim_data <- rt(n, df = input$t_df)
          if (input$standardize == FALSE) {
            sim_data <- sim_data * sd(values) + mean(values)
          }
        } else if (input$dist_type == "unif") {
          sim_data <- runif(n, min = min(values), max = max(values))
        } else if (input$dist_type == "lnorm") {
          sim_data <- rlnorm(n)
          if (input$standardize == FALSE) {
            sim_data <- sim_data * sd(values) / sd(sim_data) + mean(values) - mean(sim_data) * sd(values) / sd(sim_data)
          }
        } else if (input$dist_type == "exp") {
          sim_data <- rexp(n, rate = 1/mean(values))
        }
        
        sorted_sims[, i] <- sort(sim_data)
      }
      
      # Calculate confidence bands
      lower_band <- apply(sorted_sims, 1, function(x) quantile(x, alpha/2))
      upper_band <- apply(sorted_sims, 1, function(x) quantile(x, 1 - alpha/2))
      
      # Add to data frame
      qq_data$lower <- lower_band
      qq_data$upper <- upper_band
    }
    
    # Create QQ plot
    p <- ggplot(qq_data, aes(x = theoretical, y = observed)) +
      labs(title = paste("QQ Plot against", dist_name, "Distribution"),
           subtitle = paste("Sample size =", n),
           x = "Theoretical Quantiles",
           y = "Sample Quantiles")
    
    # Add confidence bands if selected
    if (input$conf_interval) {
      p <- p + geom_ribbon(aes(ymin = lower, ymax = upper), fill = "gray80", alpha = 0.3)
    }
    
    # Add reference line for normal distribution
    if (input$dist_type == "norm") {
      p <- p + geom_abline(intercept = mean(values), slope = sd(values), 
                         color = "blue", linetype = "dashed", linewidth = 1)
    } else {
      p <- p + geom_line(aes(x = theoretical, y = theoretical), 
                       color = "blue", linetype = "dashed", linewidth = 1)
    }
    
    # Add points colored by deviation if selected
    if (input$points_fill) {
      p <- p + geom_point(aes(color = deviation_color), size = 3, alpha = 0.7) +
        scale_color_manual(values = c("Small" = "#2980b9", "Large" = "#c0392b"), 
                         name = "Deviation")
    } else {
      p <- p + geom_point(color = "#2980b9", size = 3, alpha = 0.7)
    }
    
    # Add deviation segments if selected
    if (input$show_deviations) {
      if (input$dist_type == "norm") {
        p <- p + geom_segment(aes(xend = theoretical, 
                               yend = mean(values) + sd(values) * theoretical,
                               color = deviation_color), 
                           alpha = 0.5, linewidth = 0.8)
      } else {
        p <- p + geom_segment(aes(xend = theoretical, 
                               yend = theoretical,
                               color = deviation_color), 
                           alpha = 0.5, linewidth = 0.8)
      }
    }
    
    # Apply selected theme
    if (input$theme_choice == "minimal") {
      p <- p + theme_minimal(base_size = 14)
    } else if (input$theme_choice == "classic") {
      p <- p + theme_classic(base_size = 14)
    } else if (input$theme_choice == "light") {
      p <- p + theme_light(base_size = 14)
    } else if (input$theme_choice == "dark") {
      p <- p + theme_dark(base_size = 14) +
        theme(
          plot.background = element_rect(fill = "gray10"),
          panel.background = element_rect(fill = "gray15"),
          panel.grid = element_line(color = "gray30")
        )
    }
    
    p
  })

  # Generate histogram
  output$histogram <- renderPlot({
    req(processed_data())
    
    values <- processed_data()
    
    # Create histogram with density
    p <- ggplot(data.frame(x = values), aes(x = x)) +
      geom_histogram(aes(y = ..density..), bins = min(30, max(10, length(values)/3)), 
                     fill = "#5dade2", color = "#2874a6", alpha = 0.7) +
      geom_density(color = "#c0392b", linewidth = 1.2) +
      labs(title = "Distribution of Data",
           subtitle = paste("Sample size =", length(values)),
           x = "Value", y = "Density")
    
    # Add theoretical density function
    if (input$dist_type == "norm") {
      p <- p + stat_function(fun = dnorm, args = list(mean = mean(values), sd = sd(values)), 
                          color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      
    } else if (input$dist_type == "t") {
      # Scale the t-distribution to match data
      scaled_dt <- function(x) {
        sd_val <- sd(values)
        mean_val <- mean(values)
        dt((x - mean_val) / sd_val, df = input$t_df) / sd_val
      }
      p <- p + stat_function(fun = scaled_dt, color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      
    } else if (input$dist_type == "unif") {
      p <- p + stat_function(fun = dunif, args = list(min = min(values), max = max(values)), 
                          color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      
    } else if (input$dist_type == "lnorm") {
      # Scale log-normal to match data
      log_mean <- mean(log(values[values > 0]))
      log_sd <- sd(log(values[values > 0]))
      p <- p + stat_function(fun = dlnorm, args = list(meanlog = log_mean, sdlog = log_sd), 
                          color = "#2471a3", linewidth = 1.2, linetype = "dashed")
      
    } else if (input$dist_type == "exp") {
      p <- p + stat_function(fun = dexp, args = list(rate = 1/mean(values)), 
                          color = "#2471a3", linewidth = 1.2, linetype = "dashed")
    }
    
    # Apply selected theme
    if (input$theme_choice == "minimal") {
      p <- p + theme_minimal(base_size = 14)
    } else if (input$theme_choice == "classic") {
      p <- p + theme_classic(base_size = 14)
    } else if (input$theme_choice == "light") {
      p <- p + theme_light(base_size = 14)
    } else if (input$theme_choice == "dark") {
      p <- p + theme_dark(base_size = 14) +
        theme(
          plot.background = element_rect(fill = "gray10"),
          panel.background = element_rect(fill = "gray15"),
          panel.grid = element_line(color = "gray30")
        )
    }
    
    p
  })

  # Generate deviation analysis plot
  output$deviation_plot <- renderPlot({
    req(processed_data())
    
    values <- processed_data()
    
    # Set up base qq plot data
    n <- length(values)
    
    # Theoretical quantiles based on selected distribution
    p <- (1:n - 0.5) / n
    
    if (input$dist_type == "norm") {
      theoretical_quantiles <- qnorm(p)
      dist_name <- "Normal"
    } else if (input$dist_type == "t") {
      theoretical_quantiles <- qt(p, df = input$t_df)
      dist_name <- paste0("t (df = ", input$t_df, ")")
    } else if (input$dist_type == "unif") {
      theoretical_quantiles <- qunif(p)
      dist_name <- "Uniform"
    } else if (input$dist_type == "lnorm") {
      theoretical_quantiles <- qlnorm(p)
      dist_name <- "Log-Normal"
    } else if (input$dist_type == "exp") {
      theoretical_quantiles <- qexp(p)
      dist_name <- "Exponential"
    }
    
    # Sort the observed data
    ordered_values <- sort(values)
    
    # Calculate deviations from the theoretical line
    if (input$dist_type == "norm") {
      # For normal distribution, calculate line based on mean and SD
      slope <- sd(values)
      intercept <- mean(values)
      
      # Theoretical line values
      line_values <- intercept + slope * theoretical_quantiles
      
      # Deviations
      deviations <- ordered_values - line_values
    } else {
      # For other distributions, use the sorted values as observed quantiles
      # and calculate deviations directly
      deviations <- ordered_values - theoretical_quantiles
    }
    
    # Create data frame for plotting
    deviation_data <- data.frame(
      index = 1:n,
      theoretical = theoretical_quantiles,
      observed = ordered_values,
      deviation = deviations,
      deviation_color = ifelse(abs(deviations) > 1.5*sd(deviations), "Large", "Small")
    )
    
    # Create deviation plot
    p <- ggplot(deviation_data, aes(x = theoretical, y = deviation)) +
      geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
      geom_point(aes(color = deviation_color), size = 3, alpha = 0.7) +
      geom_smooth(method = "loess", se = TRUE, color = "#8e44ad", fill = "#9b59b6", alpha = 0.2) +
      scale_color_manual(values = c("Small" = "#2980b9", "Large" = "#c0392b"), 
                       name = "Deviation Size") +
      labs(title = paste("Deviations from", dist_name, "Distribution"),
           subtitle = "Pattern of deviations reveals distribution characteristics",
           x = "Theoretical Quantiles",
           y = "Deviation (Sample - Theoretical)") 
    
    # Apply selected theme
    if (input$theme_choice == "minimal") {
      p <- p + theme_minimal(base_size = 14)
    } else if (input$theme_choice == "classic") {
      p <- p + theme_classic(base_size = 14)
    } else if (input$theme_choice == "light") {
      p <- p + theme_light(base_size = 14)
    } else if (input$theme_choice == "dark") {
      p <- p + theme_dark(base_size = 14) +
        theme(
          plot.background = element_rect(fill = "gray10"),
          panel.background = element_rect(fill = "gray15"),
          panel.grid = element_line(color = "gray30")
        )
    }
    
    p
  })

  # Calculate deviation statistics
  deviation_stats <- reactive({
    req(processed_data())
    
    values <- processed_data()
    
    # Set up base qq plot data
    n <- length(values)
    
    # Theoretical quantiles based on selected distribution
    p <- (1:n - 0.5) / n
    
    if (input$dist_type == "norm") {
      theoretical_quantiles <- qnorm(p)
    } else if (input$dist_type == "t") {
      theoretical_quantiles <- qt(p, df = input$t_df)
    } else if (input$dist_type == "unif") {
      theoretical_quantiles <- qunif(p)
    } else if (input$dist_type == "lnorm") {
      theoretical_quantiles <- qlnorm(p)
    } else if (input$dist_type == "exp") {
      theoretical_quantiles <- qexp(p)
    }
    
    # Sort the observed data
    ordered_values <- sort(values)
    
    # Calculate deviations from the theoretical line
    if (input$dist_type == "norm") {
      # For normal distribution, calculate line based on mean and SD
      slope <- sd(values)
      intercept <- mean(values)
      
      # Theoretical line values
      line_values <- intercept + slope * theoretical_quantiles
      
      # Deviations
      deviations <- ordered_values - line_values
    } else {
      # For other distributions, use the sorted values as observed quantiles
      # and calculate deviations directly
      deviations <- ordered_values - theoretical_quantiles
    }
    
    # Divide deviations into segments (lower tail, middle, upper tail)
    lower_third <- deviations[1:floor(n/3)]
    middle_third <- deviations[(floor(n/3)+1):ceiling(2*n/3)]
    upper_third <- deviations[(ceiling(2*n/3)+1):n]
    
    # Calculate statistics
    list(
      mean_deviation = mean(deviations),
      sd_deviation = sd(deviations),
      max_deviation = max(abs(deviations)),
      mean_lower = mean(lower_third),
      mean_middle = mean(middle_third),
      mean_upper = mean(upper_third),
      sd_lower = sd(lower_third),
      sd_middle = sd(middle_third),
      sd_upper = sd(upper_third)
    )
  })

  # Display summary results
  output$summary_results <- renderPrint({
    req(input$analyze > 0)
    error <- validate_data()
    if (!is.null(error) && startsWith(error, "Error")) return(NULL)
    
    values <- processed_data()
    stats <- deviation_stats()
    
    cat("QQ Plot Analysis Results:\n\n")
    
    cat("Data Summary:\n")
    cat("Sample size:", length(values), "\n")
    cat("Mean:", round(mean(values), 4), "\n")
    cat("Median:", round(median(values), 4), "\n")
    cat("Standard deviation:", round(sd(values), 4), "\n")
    
    # Calculate skewness and kurtosis if e1071 is available
    skew_val <- tryCatch({
      e1071::skewness(values)
    }, error = function(e) {
      NA
    })
    
    kurt_val <- tryCatch({
      e1071::kurtosis(values)
    }, error = function(e) {
      NA
    })
    
    if (!is.na(skew_val)) {
      cat("Skewness:", round(skew_val, 4), 
          ifelse(abs(skew_val) < 0.5, " (approximately symmetric)", 
                ifelse(skew_val > 0, " (right-skewed)", " (left-skewed)")), "\n")
    }
    
    if (!is.na(kurt_val)) {
      cat("Kurtosis:", round(kurt_val, 4), 
          ifelse(abs(kurt_val) < 0.5, " (approximately mesokurtic)", 
                ifelse(kurt_val > 0, " (leptokurtic - heavy tails)", " (platykurtic - light tails)")), "\n")
    }
    
    cat("\nDeviation Analysis:\n")
    cat("Mean deviation:", round(stats$mean_deviation, 4), "\n")
    cat("Std. deviation of deviations:", round(stats$sd_deviation, 4), "\n")
    cat("Maximum absolute deviation:", round(stats$max_deviation, 4), "\n\n")
    
    cat("Distribution Pattern:\n")
    cat("- Lower tail (mean deviation):", round(stats$mean_lower, 4), "\n")
    cat("- Middle section (mean deviation):", round(stats$mean_middle, 4), "\n")
    cat("- Upper tail (mean deviation):", round(stats$mean_upper, 4), "\n\n")
    
    # Normality tests if the selected distribution is normal
    if (input$dist_type == "norm") {
      cat("Formal Normality Tests:\n")
      
      # Shapiro-Wilk Test
      sw_test <- shapiro.test(values)
      cat("Shapiro-Wilk Test: W =", round(sw_test$statistic, 4), 
          ", p-value =", format.pval(sw_test$p.value, digits = 4), "\n")
      
      # Kolmogorov-Smirnov Test
# Kolmogorov-Smirnov Test
      ks_test <- ks.test(values, "pnorm", mean = mean(values), sd = sd(values))
      cat("Kolmogorov-Smirnov Test: D =", round(ks_test$statistic, 4), 
          ", p-value =", format.pval(ks_test$p.value, digits = 4), "\n")
      
      cat("\nInterpretation of formal tests:\n")
      if (sw_test$p.value < 0.05 || ks_test$p.value < 0.05) {
        cat("At least one formal test indicates significant deviation from normality (p < 0.05).\n")
        cat("This supports the visual assessment from the QQ plot.\n")
      } else {
        cat("Formal tests do not indicate significant deviation from normality (p ≥ 0.05).\n")
        cat("This supports the visual assessment from the QQ plot if points generally follow the reference line.\n")
      }
    }
  })

  # Generate interpretation text
  output$interpretation <- renderUI({
    req(input$analyze > 0)
    error <- validate_data()
    if (!is.null(error) && startsWith(error, "Error")) return(NULL)
    
    stats <- deviation_stats()
    values <- processed_data()
    
    # Interpretation based on pattern of deviations
    interpretation_text <- div(
      h5("QQ Plot Assessment"),
      p("Based on the QQ plot analysis, the data shows the following characteristics:")
    )
    
    # Calculate measures to determine distribution characteristics
    tail_diff <- stats$mean_upper - stats$mean_lower
    middle_vs_tails <- stats$mean_middle - (stats$mean_lower + stats$mean_upper)/2
    
    # Determine distribution type based on deviation patterns
    dist_type <- ""
    
    if (abs(tail_diff) < 0.25 * stats$sd_deviation && 
        abs(middle_vs_tails) < 0.25 * stats$sd_deviation) {
      # Close to reference distribution
      if (input$dist_type == "norm") {
        dist_type <- "The data appears to approximately follow a normal distribution."
      } else {
        dist_type <- paste0("The data appears to approximately follow the selected ", 
                           input$dist_type, " distribution.")
      }
    } else if (tail_diff > 0.5 * stats$sd_deviation) {
      # Right skewed
      dist_type <- "The data appears to be right-skewed (positive skew)."
    } else if (tail_diff < -0.5 * stats$sd_deviation) {
      # Left skewed
      dist_type <- "The data appears to be left-skewed (negative skew)."
    } else if (middle_vs_tails > 0.5 * stats$sd_deviation) {
      # Light tails (platykurtic)
      dist_type <- "The data appears to have light tails (platykurtic) compared to the reference distribution."
    } else if (middle_vs_tails < -0.5 * stats$sd_deviation) {
      # Heavy tails (leptokurtic)
      dist_type <- "The data appears to have heavy tails (leptokurtic) compared to the reference distribution."
    } else {
      # No clear pattern
      dist_type <- "The data shows some deviations from the reference distribution, but without a clear pattern."
    }
    
    # Add overall deviation assessment
    if (stats$max_deviation > 2 * stats$sd_deviation) {
      overall_dev <- "There are notable deviations from the reference distribution, particularly in certain regions."
    } else {
      overall_dev <- "The overall fit to the reference distribution is reasonable, with only minor deviations."
    }
    
    # Complete interpretation
    interpretation_text <- tagList(
      interpretation_text,
      tags$ul(
        tags$li(tags$strong(dist_type)),
        tags$li(overall_dev)
      )
    )
    
    # Add advice based on the analysis
    advice <- div(
      h5("Recommendations"),
      p("Based on this analysis, consider the following:")
    )
    
    advice_items <- list()
    
    if (grepl("appears to approximately follow", dist_type)) {
      # Good fit to reference distribution
      if (input$dist_type == "norm") {
        advice_items <- c(advice_items, 
                         list(tags$li("Parametric statistical methods assuming normality are likely appropriate.")))
      } else {
        advice_items <- c(advice_items, 
                         list(tags$li("Statistical methods appropriate for the selected distribution can be used.")))
      }
    } else {
      # Poor fit to reference distribution
      if (input$dist_type == "norm") {
        advice_items <- c(advice_items, 
                         list(tags$li("Consider non-parametric statistical methods due to deviations from normality.")))
        
        # Suggest data transformations if appropriate
        if (grepl("right-skewed", dist_type)) {
          advice_items <- c(advice_items, 
                           list(tags$li("Try log or square root transformations to address right skew.")))
        } else if (grepl("left-skewed", dist_type)) {
          advice_items <- c(advice_items, 
                           list(tags$li("Try square or cube transformations to address left skew.")))
        } else if (grepl("heavy tails", dist_type)) {
          advice_items <- c(advice_items, 
                           list(tags$li("Consider robust statistical methods that are less sensitive to outliers.")))
        }
      } else {
        advice_items <- c(advice_items, 
                         list(tags$li("Try different theoretical distributions to find a better fit for your data.")))
      }
    }
    
    # Always suggest visual confirmation
    advice_items <- c(advice_items, 
                     list(tags$li("Always supplement this analysis with domain knowledge and other diagnostic plots.")))
    
    # Complete advice
    advice <- tagList(
      advice,
      tags$ul(advice_items)
    )
    
    # Return complete interpretation
    tagList(interpretation_text, advice)
  })

  # Generate deviation pattern interpretation
  output$deviation_pattern <- renderUI({
    req(input$analyze > 0)
    error <- validate_data()
    if (!is.null(error) && startsWith(error, "Error")) return(NULL)
    
    stats <- deviation_stats()
    
    # Create detailed deviation pattern analysis
    div(
      h5("Deviation Pattern Analysis"),
      p("The pattern of deviations in the QQ plot provides insight into how the data distribution differs from the reference distribution:"),
      tags$ul(
        tags$li(tags$b("Lower Tail:"), 
               ifelse(stats$mean_lower < -0.5 * stats$sd_deviation, 
                      " Points fall below the reference line, indicating fewer small values than expected.", 
                      ifelse(stats$mean_lower > 0.5 * stats$sd_deviation,
                             " Points fall above the reference line, indicating more small values than expected.",
                             " Points generally follow the reference line, indicating the lower tail fits the reference distribution."))),
        tags$li(tags$b("Middle Section:"), 
               ifelse(stats$mean_middle < -0.5 * stats$sd_deviation, 
                      " Points fall below the reference line, suggesting the center of the distribution is shifted left.", 
                      ifelse(stats$mean_middle > 0.5 * stats$sd_deviation,
                             " Points fall above the reference line, suggesting the center of the distribution is shifted right.",
                             " Points generally follow the reference line, indicating the middle fits the reference distribution."))),
        tags$li(tags$b("Upper Tail:"), 
               ifelse(stats$mean_upper < -0.5 * stats$sd_deviation, 
                      " Points fall below the reference line, indicating fewer large values than expected.", 
                      ifelse(stats$mean_upper > 0.5 * stats$sd_deviation,
                             " Points fall above the reference line, indicating more large values than expected.",
                             " Points generally follow the reference line, indicating the upper tail fits the reference distribution.")))
      )
    )
  })
}

shinyApp(ui = ui, server = server)

How QQ Plots Work

QQ plots work by comparing the quantiles (percentiles) of your observed data to the quantiles of a theoretical distribution:

Mathematical Procedure

  1. Sort your data in ascending order: \(x_{(1)} \leq x_{(2)} \leq ... \leq x_{(n)}\)

  2. Calculate empirical quantiles by assigning probabilities to each data point:

    • For \(n\) observations, the \(i\)-th ordered value corresponds approximately to the \((i - 0.5)/n\) quantile
    • So each data point \(x_{(i)}\) is plotted against the theoretical quantile corresponding to probability \((i - 0.5)/n\)
  3. Calculate theoretical quantiles for the reference distribution:

    • For normal distribution, use \(\Phi^{-1}((i - 0.5)/n)\) where \(\Phi^{-1}\) is the inverse of the standard normal CDF
    • For other distributions, use the appropriate quantile function
  4. Plot empirical quantiles against theoretical quantiles:

    • If the data follows the theoretical distribution, points will approximately follow a straight line
    • Deviations from the straight line indicate deviations from the theoretical distribution

Interpreting QQ Plot Patterns

Different patterns in QQ plots reveal specific types of deviations from the theoretical distribution:

Pattern Interpretation
Points follow diagonal line Data follows the theoretical distribution
S-shaped curve Data has different skewness than the theoretical distribution
Points curve upward at ends Data has heavier tails (more extreme values)
Points curve downward at ends Data has lighter tails (fewer extreme values)
Points above line at left, below at right Data is left-skewed (negative skew)
Points below line at left, above at right Data is right-skewed (positive skew)
Step pattern Data is discrete or has tied values
Isolated points at ends Data contains outliers

Confidence Bands

Confidence bands on a QQ plot provide a visual guide to assess whether deviations from the diagonal line are statistically significant:

  • Points falling within the bands could be consistent with random variation from the theoretical distribution
  • Points falling outside the bands suggest significant deviations from the theoretical distribution
  • Width of the bands depends on the sample size and chosen confidence level

QQ Plots vs. Formal Normality Tests

QQ plots complement formal normality tests like Shapiro-Wilk or Kolmogorov-Smirnov:

Aspect QQ Plots Formal Tests
Output Visual assessment p-value
Information Shows where and how data deviates Binary decision (normal/not normal)
Sample size sensitivity Works for any sample size May flag trivial deviations in large samples
Multiple distributions Can compare to any distribution Usually specific to one distribution
Learning curve Requires practice to interpret patterns Simple p-value interpretation
Transformation guidance Shows which transformation might help Doesn’t suggest transformations

For the most comprehensive assessment, use both QQ plots and formal tests together.

Example 1: Normally Distributed Data

A researcher collected measurements of adult heights (in cm) and wants to check if the data follows a normal distribution.

Data (sample of 30 height measurements in cm):

165, 172, 168, 175, 171, 163, 169, 170, 178, 167, 173, 180, 166, 174, 172, 169, 177, 168, 173, 171, 169, 175, 172, 170, 174, 168, 176, 171, 173, 170

Analysis:

  1. QQ Plot Assessment:
    • Points generally follow the diagonal line with minor random deviations
    • No systematic curvature is observed
    • All points fall within the 95% confidence bands
  2. Deviation Analysis:
    • Lower tail: Small deviations with no systematic pattern
    • Middle section: Points closely follow the diagonal line
    • Upper tail: Small deviations with no systematic pattern
  3. Formal Tests:
    • Shapiro-Wilk: W = 0.9827, p = 0.8833
    • Kolmogorov-Smirnov: D = 0.0948, p = 0.9367

Conclusion:

The height data appears to follow a normal distribution. Both the visual assessment (QQ plot) and formal tests support this conclusion. The minor deviations from the diagonal line are consistent with random sampling variation.

How to Report: “Adult height measurements were assessed for normality using a QQ plot and formal tests. The QQ plot showed points following the diagonal reference line with no systematic deviations, and formal tests did not reject normality (Shapiro-Wilk: W = 0.983, p = 0.883; Kolmogorov-Smirnov: D = 0.095, p = 0.937). This confirms that parametric statistical methods assuming normality are appropriate for analyzing this data.”

Example 2: Right-Skewed Data

A researcher collected response time data (in milliseconds) from a cognitive experiment and wants to check the distribution.

Data Summary:

  • Sample size: 40 response times
  • Mean: 342 ms
  • Median: 328 ms
  • Range: 203 to 687 ms

Analysis:

  1. QQ Plot Assessment:
    • Points follow an upward curve pattern (below the line on the left, above the line on the right)
    • Several points in the upper tail fall outside the confidence bands
    • The deviation pattern indicates right skew (positive skew)
  2. Deviation Analysis:
    • Lower tail: Points fall below the diagonal line, indicating fewer small values than expected
    • Middle section: Points cross the diagonal line from below to above
    • Upper tail: Points fall significantly above the diagonal line, indicating more large values than expected
  3. Formal Tests:
    • Shapiro-Wilk: W = 0.8651, p = 0.0012
    • Kolmogorov-Smirnov: D = 0.1862, p = 0.0012

Conclusion:

The response time data shows a clear right-skewed (positively skewed) distribution and does not follow a normal distribution. Both the QQ plot and formal tests strongly indicate non-normality.

How to Report: “Response time data was assessed for normality using a QQ plot and formal tests. The QQ plot revealed a clear right-skewed pattern with points falling below the diagonal line in the lower tail and above the line in the upper tail. Formal tests confirmed significant deviation from normality (Shapiro-Wilk: W = 0.865, p = 0.001). A log transformation is recommended before applying parametric statistical methods to this data.”

Common QQ Plot Patterns and What They Mean

Understanding the specific patterns in QQ plots helps identify the exact nature of your data distribution:

Normal Distribution

If your data follows a normal distribution, the QQ plot shows:

  • Points following the diagonal reference line
  • Minor random deviations equally distributed above and below the line
  • No systematic patterns or curvature
  • Most or all points falling within the confidence bands

Right-Skewed (Positive Skew)

A right-skewed distribution shows:

  • Points following a curve that starts below the diagonal line on the left
  • Points rising above the diagonal line on the right
  • The pattern resembles a curve that bends upward
  • Consider log or square root transformations

Left-Skewed (Negative Skew)

A left-skewed distribution shows:

  • Points following a curve that starts above the diagonal line on the left
  • Points falling below the diagonal line on the right
  • The pattern resembles a curve that bends downward
  • Consider square or cube transformations

Heavy Tails (Leptokurtic)

Data with heavier tails than the normal distribution shows:

  • Points following an S-shaped pattern
  • Middle points falling below the diagonal line
  • End points rising above the diagonal line at both ends
  • Indicates more extreme values than expected in a normal distribution

Light Tails (Platykurtic)

Data with lighter tails than the normal distribution shows:

  • Points following an inverted S-shaped pattern
  • Middle points rising above the diagonal line
  • End points falling below the diagonal line at both ends
  • Indicates fewer extreme values than expected in a normal distribution

Bimodal/Multimodal Distribution

Data with multiple modes often shows:

  • Step-like patterns in the QQ plot
  • Horizontal segments or abrupt changes in slope
  • May indicate mixed populations or grouped data

Choosing Data Transformations Based on QQ Plots

QQ plots can guide your choice of data transformation to achieve normality:

Distribution Pattern Recommended Transformations
Right-skewed (positive) Log, Square root, Reciprocal
Left-skewed (negative) Square, Cube, Exponential
Heavy-tailed Box-Cox, Logit
Light-tailed Arcsine, Probit
Bimodal Consider analyzing subgroups separately

A good transformation will produce a new QQ plot with points that better follow the diagonal line.

Test Your Understanding

  1. What does a straight line pattern in a QQ plot indicate?
      1. The data is bimodal
      1. The data follows the reference distribution
      1. The data has outliers
      1. The data is highly skewed
  2. If points in a QQ plot curve upward at both ends (above the line), what does this suggest?
      1. The data has heavy tails (more extreme values than a normal distribution)
      1. The data has light tails (fewer extreme values than a normal distribution)
      1. The data is right-skewed
      1. The data is left-skewed
  3. What pattern would indicate right-skewed (positively skewed) data in a QQ plot?
      1. Points above the line at the left end, below at the right end
      1. Points below the line at the left end, above at the right end
      1. Points following the diagonal line perfectly
      1. Points forming a horizontal pattern
  4. What advantage do QQ plots have over formal normality tests like Shapiro-Wilk?
      1. They’re always more accurate
      1. They provide exact p-values
      1. They show where and how data deviates from normality
      1. They require smaller sample sizes
  5. If data follows a log-normal distribution, what pattern would you expect in a QQ plot against a normal distribution?
      1. A straight diagonal line
      1. An upward curve (suggesting positive skew)
      1. A downward curve (suggesting negative skew)
      1. A horizontal line

Answers: 1-B, 2-A, 3-B, 4-C, 5-B

Common Questions About QQ Plots

Both QQ (Quantile-Quantile) and PP (Probability-Probability) plots assess distributional fit, but they do so differently:

  • QQ plots compare the actual data quantiles against theoretical distribution quantiles. They’re better at showing deviations in the tails of distributions and are more widely used for normality assessment.

  • PP plots compare the empirical cumulative distribution function (CDF) to the theoretical CDF. They’re sometimes better for assessing the fit in the center of the distribution.

QQ plots are generally preferred for normality testing because they magnify deviations in the tails, which are often of particular interest.

QQ plots can be used with almost any sample size, but their interpretability varies:

  • With very small samples (n < 10), patterns may be difficult to discern, and random variation can mislead interpretation
  • For moderate samples (10 ≤ n ≤ 50), QQ plots are highly effective and provide reliable visual assessment
  • For large samples (n > 50), even minor deviations from normality become visible, so you should focus on practical significance rather than perfect alignment

Unlike formal tests that may become overly sensitive with large samples, QQ plots remain useful as they show the magnitude and pattern of deviations.

Confidence bands in QQ plots provide a visual guide to assess whether deviations from the diagonal line are statistically significant:

  • Points falling within the bands are consistent with random variation from the theoretical distribution
  • Points falling outside the bands suggest statistically significant deviations
  • The width of the bands depends on the sample size and chosen confidence level (typically 95%)

The bands are particularly useful for distinguishing between meaningful deviations and random sampling variation, especially with smaller sample sizes.

The best approach is to use both QQ plots and formal tests together:

  • QQ plots show the pattern, location, and magnitude of deviations from normality, helping you understand how your data deviates
  • Formal tests (like Shapiro-Wilk or Kolmogorov-Smirnov) provide objective criteria through p-values

When they agree, you can be more confident in your conclusion. When they disagree (particularly with large samples where formal tests may reject normality despite minor deviations), QQ plots help you assess practical significance.

Yes, QQ plots can compare your data to any theoretical distribution, not just normal:

  • For lognormal distributions, exponential distributions, Weibull distributions, etc.
  • You can even create QQ plots to compare two empirical data samples to each other

When using non-normal reference distributions, the interpretation remains the same: points following a straight line indicate that your data follows the specified distribution.

The pattern of deviation in your QQ plot suggests which transformation might normalize your data:

  • Upward curve (right skew/positive skew): Try log, square root, or inverse transformations
  • Downward curve (left skew/negative skew): Try square, cube, or exponential transformations
  • S-curve with heavy tails: Try logit transformation or Box-Cox with λ < 1
  • Inverted S-curve with light tails: Try arcsine transformation or Box-Cox with λ > 1

After applying a transformation, create a new QQ plot to assess whether normality has improved. You may need to try several transformations to find the optimal one.

Examples of When to Use QQ Plots

  1. Before parametric tests: To verify normality assumptions for t-tests, ANOVA, or regression
  2. Regression diagnostics: To check if residuals are normally distributed
  3. Exploratory data analysis: To understand the shape and characteristics of your data distribution
  4. After data transformations: To evaluate if transformations successfully normalized your data
  5. Model comparison: To determine which theoretical distribution best fits your data
  6. Quality control: To detect changes in process distributions
  7. Financial analysis: To assess return distributions and risk models
  8. Environmental studies: To examine distribution of pollutant measurements
  9. Multivariate outlier detection: To identify multivariate outliers using Mahalanobis distances
  10. Meta-analysis: To check for publication bias using funnel plots (a specialized application of QQ plots)

References

  • Chambers, J. M., Cleveland, W. S., Kleiner, B., & Tukey, P. A. (1983). Graphical Methods for Data Analysis. Wadsworth.
  • Wilk, M. B., & Gnanadesikan, R. (1968). Probability plotting methods for the analysis of data. Biometrika, 55(1), 1-17.
  • Thode, H. C. (2002). Testing for normality. Marcel Dekker.
  • Cleveland, W. S. (1994). The Elements of Graphing Data. Hobart Press.
  • Ghasemi, A., & Zahediasl, S. (2012). Normality tests for statistical analysis: a guide for non-statisticians. International Journal of Endocrinology and Metabolism, 10(2), 486-489.
  • Loy, A., Follett, L., & Hofmann, H. (2016). Variations of Q–Q Plots: The Power of Our Eyes! The American Statistician, 70(2), 202-214.
Back to top

Reuse

Citation

BibTeX citation:
@online{kassambara2025,
  author = {Kassambara, Alboukadel},
  title = {QQ {Plot} {Analysis} \& {Generator} \textbar{} {Test} {Data}
    {Normality} {Visually}},
  date = {2025-04-12},
  url = {https://www.datanovia.com/apps/statfusion/analysis/inferential/goodness-fit/normality/qq-plot-analysis.html},
  langid = {en}
}
For attribution, please cite this work as:
Kassambara, Alboukadel. 2025. “QQ Plot Analysis & Generator | Test Data Normality Visually.” April 12, 2025. https://www.datanovia.com/apps/statfusion/analysis/inferential/goodness-fit/normality/qq-plot-analysis.html.