Seriation in R: How to Optimally Order Objects in a Data Matrice

seriation-in-r-logo-1.png


Seriation in R: How to Optimally Order Objects in a Data Matrice

This article describes seriation methods, which consists of finding a suitable linear order for a set of objects in data using loss or merit functions.

There are different seriation algorithms. The input data can be either a dissimilarity matrix or a standard data matrix.

You will learn how to perform seriation in R and visualize the reordered data using the seriation R package.



Contents:

Prerequisites

Load required R packages:

library(seriation)

Data preparation

Demo data: iris

# Load the data
data("iris")
df <- iris
head(df, 2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
# Remove the column `species` (column 5)
df <- df[, -5]
# Reorder the objects randomly
set.seed(123)
df <- df[sample(seq_len(nrow(df))),]
head(df, 2)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 44           5.0         3.5          1.6         0.6
## 118          7.7         3.8          6.7         2.2

Basic example of seriation

Reorder objects and inspect the effect of seriation on dissimilarity matrix:

# Compute dissimilarity matrix
dist_result <- dist(df)
# Seriate objects, reorder rows based on their similarity
object_order <- seriate(dist_result)
# Extract object orders
head(get_order(object_order), 15)
##  [1]  78   2  52  83  76 139 148  32  59  20 129 103 143   4  85
# Visualize the effect of seriation on dissimilarity matrix
pimage(dist_result, main = "Random order")

pimage(dist_result, order = object_order, main = "Reordered")

Effect of the seriation on the original data scale. We standardize the data using scale such that the visualized value is the number of standard diviations an object differs from the variable mean. For matrices containing negative values, pimage() uses automatically a divergent palette.

Since the seriation above only produced an order for the rows of the data, we use NA for columns orders in the R code below.

# Heamap of the raw data
pimage(scale(df), main = "Random")

# Heatmap of the reordered data
pimage(scale(df), order = c(object_order, NA), main = "Reordered")

Heat maps

A heat map is a color coded data matrix with a dendrogram added to one side and to the top to indicate the order of rows and columns.

Typically, reordering is done according to row or column means within the restrictions imposed by the dendrogram.

It is possible to find the optimal ordering of the leaf nodes of a dendrogram which minimizes the distances between adjacent objects. Such an order might provide an improvement over using simple reordering such as the row or column means with respect to presentation.

The R function hmap() [seriation package] uses optimal ordering and can also use seriation directly on distance matrices without using hierarchical clustering to produce dendrograms first. It uses the function gplots::heatmap.2() to create the heatmap.

For the following example, we use again the randomly reordered iris data set from the examples above. To make the variables (columns) comparable, we use standard scaling.

# Standardize the data
df_scaled <- scale(df, center = FALSE)

# Produce a heat map with optimally reordered dendrograms
# Hierarchical clustering is used to produce dendrograms
hmap(df_scaled, margin = c(7, 4), cexCol = 1, labRow = FALSE)

# Specify seriation method
# seriation on the dissimilarity matrices for rows and columns is performed
hmap(df_scaled, method = "MDS")

Bertin’s permutation matrix

The idea is to reveal a more homogeneous structure in a data matrix by simultaneously rearranging rows and columns. The rearranged matrix is displayed and cases and variables can be grouped manually to gain a better understanding of the data.

As an example, we use the USArrests dataset, which contains the violent crime rates by US states in 1973. To make values comparable across columns (variables), the ranks of the values for each variable are used instead of the original values.

For seriation, we calculate distances between rows and between columns using the sum of absolute rank differences (this is equal to the Minkowski distance with power 1).

In the example, below the output is illustrated as a matrix of bars. High values are highlighted (filled blocks). The cases are displayed as the columns and the variables as rows.

# Data preparation
# Load the dataset
data("USArrests")
# Replace original values by their ranks
df <- head(apply(USArrests, 2, rank), 30)

# Perform seriation on row and columns
row_order <- seriate(dist(df, "minkowski", p = 1),  method ="TSP")
col_order <- seriate(dist(t(df), "minkowski", p = 1),  method ="TSP")
orders <- c(row_order, col_order)

# Visualization: matrix of bars 
# Original matrix
bertinplot(df)

# Rearranged matrix
bertinplot(df, orders)

Binary data matrices

Binary data are 0-1 data matrices. The standard visualization of bertinplot(), do not make much sense for binary data. We can use the panel functions panel.squares() or panel.circles().

# Load demo data
data("Townships")
# Visualize the original data
bertinplot(
  Townships, 
  options = list(panel = panel.circles)
  )

# Seriate rows and columns using the bond energy algorithm (BEA)
set.seed(1234)
orders <- seriate(Townships, method = "BEA", control = list(rep = 10))
bertinplot(
  Townships, order = orders, 
  options = list(panel = panel.circles)
  )

A clear structure is visible in the reordered matrix is displayed. The variables (rows in a Bertin plot) can be split into the three categories describing different evolution states of townships:

  1. Rural: No doctor, one-room school and possibly also no water supply
  2. Intermediate: Land reallocation, veterinary and agricultural cooperative
  3. Urban: Railway station, high school and police station

The townships also clearly fall into these three groups which tentatively can be called villages (first 7), towns (next 5) and cities (final 2). The townships B and C are on the transition to the next higher group.

Summary

This articles describes how to reorder objects in a dataset using seriation methods.

References



Version: Français





Comments ( 3 )

  • Tim

    In your “Township”example you: #Seriate rows and columns using the bond energy algorithm (BEA)
    I wanted to know if it is possible to only seriate the rows and fix the columns. I tried with the “margin” but I then get a error for plotting.

    Thanks for your help and your amazing package (sounds dirty 😉

    • Kassambara

      Seriate rows only. When creating the image, you need to specify the columns order even if you seriate only the rows; so, consider using NA for column orders:

      library(seriation)
      data("Townships")
      
      # Seriate rows only using margin = 1
      set.seed(1234)
      row_orders <- seriate(
        Townships, method = "BEA", control = list(rep = 10),
        margin = 1
        )
      
      # Visualize
      # You need to specify both rows and columns order
      # Use NA for column orders
      bertinplot(
        Townships, order = c(row_orders, NA),
        options = list(panel = panel.circles)
      )

      plot of chunk seriate-rows-only

      • Tim

        Perfect! Thank you for the quick reply.

Give a comment

Want to post an issue with R? If yes, please make sure you have read this: How to Include Reproducible R Script Examples in Datanovia Comments