Topological Data Analysis

Application in Spatial Statistics

Necessary Packages

This is the main and standard library for Topological Data Analysis in R.

library(TDA)

The R-package TDA is actually a wrapper of several standard Python/C++ libraries: Gudhi, Dionysus and PHAT.

A Toy Example

Data generation

  • Let’s use a very simple point pattern.
  • There is not need for spatstat there but let’s use it for training purposes.
library(spatstat)
Loading required package: spatstat.data
Loading required package: spatstat.geom
spatstat.geom 2.4-0
Loading required package: spatstat.random
spatstat.random 2.2-0
Loading required package: spatstat.core
Loading required package: nlme
Loading required package: rpart
spatstat.core 2.4-4
Loading required package: spatstat.linnet
spatstat.linnet 2.3-2

spatstat 2.3-4       (nickname: 'Watch this space') 
For an introduction to spatstat, type 'beginner' 
spatstat.options(transparent=F) # needed for rendering the html file. Remove transparency in plot of ppp object.
PPtoy <- ppp(x=c(-1.2,1,0,0),
             y=c(0,0,-0.5,0.5), 
             window=owin(c(-2,2),c(-2,2)))

A simple plot of these 4 points.

plot(PPtoy, cex=2, pch=20)

Filtration with Rips complexes

Rfiltration <- ripsFiltration(X=cbind(PPtoy$x, PPtoy$y), 
                              maxdimension=4, 
                              maxscale=3, 
                              dist = "euclidean",
                              library = "GUDHI", 
                              printProgress = FALSE)

A view on the argument:

  • In the TDA package, the input cannot be a point pattern (ppp object) but a matrix with on the first column, the first coordinates of the points, on the second column, the second coordinates … Thus we use cbind.
  • maxdimension indicates the maximal dimension of simplices that we allowed.
  • maxscale how far away the function should look for neighbooring points.
  • distance the distance used (see the documentation for more info)
  • library which library to use: gudhi, dionysys or PHAT. Except for one functionality dionysus, this does not change many things when dealing with a few thousands points.

Let’s display the filtration object. The returned object is a list. It contains the following.

A list of complexes

A list of the simplices appearing in the filtration.

Rfiltration$cmplx
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4

[[5]]
[1] 4 3

[[6]]
[1] 3 2

[[7]]
[1] 4 2

[[8]]
[1] 4 3 2

[[9]]
[1] 3 1

[[10]]
[1] 4 1

[[11]]
[1] 4 3 1

[[12]]
[1] 2 1

[[13]]
[1] 3 2 1

[[14]]
[1] 4 2 1

[[15]]
[1] 4 3 2 1
  • The indices represents a label for the points
  • \(1\ 2\) means that this is a 1-dimensional simplex: the edge with the points 1 and 2.
  • \(4\ 3\ 2\) means that this is a 2-dimensional simplex: the filled triangle with the points 4, 3, and 2.
  • \(4\ 3\ 2\ 1\) means that this is a 3-dimensional simplex: the filled tetrahedron with the points 4, 3, 2 and 1.

The filtration values of the complexes

  • The filtration values indicates when a simplex appears for the first time in the filtration.
  • Not always true but ease the understanding: Think of it as the value of the increasing radius of the growing cirlce.
  • This is ordered as Rfiltration$cmplx.
Rfiltration$values
 [1] 0.000000 0.000000 0.000000 0.000000 1.000000 1.118034 1.118034 1.118034
 [9] 1.300000 1.300000 1.300000 2.200000 2.200000 2.200000 2.200000
  • We see that the vertices/0-dimensional complexes appears at the filtration values 0.
  • Notice how several complexes appear at the end, when there is no more information on the structure.

Other outputs

The following indicates if the filtration values are indeed increasing.

Rfiltration$increasing  
[1] TRUE

The following return the distance matrix of the vertices. It appears only if the argument dist="euclidean" is used.

Rfiltration$coordinates
     [,1] [,2]
[1,] -1.2  0.0
[2,]  1.0  0.0
[3,]  0.0 -0.5
[4,]  0.0  0.5

Using the persistence diagram

This is done with the function ripsDiag. Arguments are the same as for ripsFiltration up to location. We will come back on it later.

Rdiag <- ripsDiag(X = cbind(PPtoy$x, PPtoy$y), 
                 maxdimension=4, 
                 maxscale=3,
                 library = c("GUDHI"), 
                 location = F, 
                 printProgress = FALSE)

The output is

Rdiag
$diagram
     dimension Birth    Death
[1,]         0     0 3.000000
[2,]         0     0 1.300000
[3,]         0     0 1.118034
[4,]         0     0 1.000000

Warning: The output is not directly the diagram but a list containing the diagram.

Let’s plot it. Note that there is no loop.

plot(Rdiag$diagram)

The barcode is obtained by adding one argument to the plot function.

plot(Rdiag$diagram, barcode=T)

Another filtration – The \(\alpha\)-complexes

Afiltration <- alphaComplexFiltration(
    X=cbind(PPtoy$x, PPtoy$y), 
    library = "GUDHI", 
    printProgress = FALSE)

A view on the argument:

  • As in ripsFiltration but without maxdimension, maxscale, dist.

The output is in the same format as for ripsFiltration. Note however that the values itself are different.

This was expected. An advantage of the \(\alpha\)-complexe is that it does not include simplicies of dimension higher than the dimension of the ambient space, here 2.

Afiltration$cmplx
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4

[[5]]
[1] 4 3

[[6]]
[1] 3 2

[[7]]
[1] 4 2

[[8]]
[1] 4 3 2

[[9]]
[1] 3 1

[[10]]
[1] 4 1

[[11]]
[1] 4 3 1

Using the persistence diagram

This is similar to what we did for the Rips complexes but without the arguments maxscale, dist.

Adiag <- alphaComplexDiag(X = cbind(PPtoy$x, PPtoy$y), 
                 maxdimension=1,
                 library = c("GUDHI"), 
                 location = F, 
                 printProgress = FALSE)

The output is

Adiag
$diagram
     dimension  Birth     Death
[1,]         0 0.0000       Inf
[2,]         0 0.0000 0.4225000
[3,]         0 0.0000 0.3125000
[4,]         0 0.0000 0.2500000
[5,]         1 0.3125 0.3906250
[6,]         1 0.4225 0.4958507

Warning: The output is not directly the diagram but a list containing the diagram.

Let’s plot it. Note that this is time, there is 2 loops. This looks more in adequation with the data.

plot(Adiag$diagram)

plot(Adiag$diagram, barcode=T)

On the argument location

We have used the argument location=F. What does it means?

  • We may be interested in the precise location the point that kills or gives birth to a topological feature.
  • It is possible to find it if we use the library “PHAT” or “Dionysus”. Note that the TDA package requires “GUDHI” for some computation so both libraries needs to be put in input as follows.
Adiag2 <- alphaComplexDiag(X = cbind(PPtoy$x, PPtoy$y), 
                 maxdimension=2,
                 library = c("GUDHI", "Dionysus"), 
                 location = T, 
                 printProgress = FALSE)

Each point involved in the birth of a topological feature.

Adiag2$birthLocation
     [,1] [,2]
[1,] -1.2  0.0
[2,]  1.0  0.0
[3,]  0.0 -0.5
[4,]  0.0  0.5
[5,]  1.0  0.0
[6,] -1.2  0.0

Each point involved in the death of a topological feature.

Adiag2$deathLocation
     [,1] [,2]
[1,] -1.2  0.0
[2,] -1.2  0.0
[3,]  1.0  0.0
[4,]  0.0 -0.5
[5,]  1.0  0.0
[6,] -1.2  0.0

Warning: the TDA package returns also an element called cycleLocation which represents the location of the cycles defining the topological feature. For example, for a loop it returns its border.
This is, in several ways, dangerously wrong

  • This works quite well on the simple example in the documentaiton but not in a general setting.
  • It actually returns a representant of the loop, void… There is an infinity of representants and worst than that, the mathematical definition of a loop is not trivial.
  • We will not cover it in the lecture but I recommend the additional references given for this course and also the excellent list of tutorials given on the youtube channels: https://www.youtube.com/c/AppliedAlgebraicTopologyNetwork.
  • In particular the tutorial of Teresa Heiss: https://youtu.be/a-va5BRs14k

A second example – Points on the unit circle

The TDA package includes a simple function for training purposes.

Circle1 <- circleUnif(n=60, r=1)
Circle1
                x1          x2
 [1,] -0.999458716  0.03289796
 [2,]  0.902074100 -0.43158118
 [3,] -0.017186760  0.99985230
 [4,] -0.796020582  0.60526955
 [5,] -0.605256801 -0.79603028
 [6,]  0.870878299  0.49149872
 [7,] -0.328762353  0.94441268
 [8,] -0.999566470  0.02944268
 [9,] -0.532656995  0.84633122
[10,] -0.199369468  0.97992439
[11,]  0.987266034  0.15907790
[12,] -0.387242253  0.92197800
[13,] -0.999935737 -0.01133673
[14,] -0.562699411  0.82666158
[15,]  0.596392589  0.80269289
[16,]  0.548770906 -0.83597278
[17,]  0.990839510 -0.13504468
[18,]  0.878282006 -0.47814299
[19,] -0.778898733 -0.62714971
[20,] -0.526743597  0.85002422
[21,] -0.617295257  0.78673157
[22,]  0.331538819  0.94344158
[23,] -0.071813918 -0.99741805
[24,]  0.176011244  0.98438816
[25,]  0.546265486 -0.83761209
[26,] -0.629178606 -0.77726076
[27,]  0.329666422 -0.94409748
[28,]  0.993050378 -0.11769004
[29,] -0.659254608 -0.75191978
[30,]  0.989357205  0.14550712
[31,] -0.202474187 -0.97928760
[32,]  0.577507632 -0.81638529
[33,] -0.295534205  0.95533216
[34,]  0.790605439 -0.61232593
[35,] -0.865357733 -0.50115466
[36,]  0.826890832  0.56236247
[37,] -0.197546625 -0.98029349
[38,]  0.637730361 -0.77025969
[39,] -0.995754442  0.09204940
[40,]  0.995833333 -0.09119196
[41,]  0.360995855 -0.93256742
[42,] -0.551423914 -0.83422519
[43,]  0.969672723  0.24440706
[44,] -0.769145135 -0.63907414
[45,] -0.076006375 -0.99710733
[46,] -0.748277582 -0.66338576
[47,]  0.649431917 -0.76041974
[48,] -0.164547164 -0.98636922
[49,]  0.214354955 -0.97675583
[50,]  0.930361921 -0.36664246
[51,] -0.652015278 -0.75820583
[52,]  0.005898459  0.99998260
[53,]  0.860835838 -0.50888276
[54,]  0.655741073  0.75498586
[55,] -0.810509211  0.58572589
[56,] -0.993230534  0.11615983
[57,] -0.358396974 -0.93356928
[58,] -0.762914521  0.64649937
[59,] -0.803922674  0.59473383
[60,]  0.983642206 -0.18013331

As you may see, this consists of 60 points on the unit circle.

plot(Circle1, asp=1)

The most “efficient” filtration – The Rips complexes

Rfiltration <- ripsFiltration(X=Circle1, 
                              maxdimension=2, 
                              maxscale=3, 
                              dist = "euclidean",
                              library = "GUDHI", 
                              printProgress = FALSE)

Displaying the filtration is no more a solution as there is too many simplices.

length(Rfiltration$cmplx)
[1] 523685

Thankfully, we do not need it to compute the persistence diagram. This time we see clearly one loop which is what we expect from an unit disc.

DiagRips <- ripsDiag(X = Circle1, maxdimension=1, maxscale=1,
library = "GUDHI", location = F, printProgress = FALSE)
plot(DiagRips$diagram)

Other interesting functions of the TDA package

Summary statistics computation

Let’s use the summary statistics defined for \(r >0\) by:\[\int_0^{r} \sum_{i\leq d}\mathbf{1}_{(0,i) \in D}\mathrm{d}d\] and

\[\frac{1}{\sqrt{\rho}|W|} \int_0^{r/\sqrt{\rho}} \sum_{i\leq d}\mathbf{1}_{(0,i) \in D}\mathrm{d}d.\]

We may compute it in R as follows. Let’s assume that coords represents a \(n\) by \(2\) matrix representing the \(x,y\) coordinates of \(n\) points.

coords <- cbind(PPtoy$x, PPtoy$y)

diag <- alphaComplexDiag(coords)$diagram # return the persistence diagram computed from the alpha complex 

diag_df <-  data.frame(dimension = diag[,1], 
                           death = diag[,3]) # Convert the diagram object 
# to a data frame. This is easier to handle 

sort(diag_df[diag_df$dimension == 0, 'death'])  # extract the connected components (dimension 0) and sort them by increasing death time.  
[1] 0.2500 0.3125 0.4225    Inf

The summary statistics is now easy to compute as it consists into as for any argument \(r\) it consists into the sum of the death times until \(r\).

Let’s first gather the code above in a function.

tda0_raw <- function(coords){    
    diag <- alphaComplexDiag(coords)$diagram
    diag_df <-  data.frame(dimension = diag[,1], 
                           death = diag[,3])
    sort(diag_df[diag_df$dimension == 0, 'death'])  
}

In the following function:

  • X is the input point pattern as ppp object.
  • rc correspond to the upper bound of the integral before being scaled.

We also use the R function findInterval which provide us a nice shortcut to compute what we want. See the help function for more information.

tda0_scaled <- function(X, 
                 rc =  0.3){ 

    dt <- tda0_raw(cbind(X$x, X$y)) # return the increasing list of death time. 
    
    intens <- X$n / area.owin(X$w) # Return the estimate intensity, equivalent to intensity(X)
    
    tx <- seq(0,rc/sqrt(intens), length.out=1e3)

    return(mean(findInterval(tx, dt)/ (sqrt(intens) * area.owin(X$w))))
}

Deviation test

We will test if an observed point process is a realisation of an homogeneous Poisson point process with intensity 2 on the window \([0,10]^2\).

This will be based on an application of the CLT presented in the slides. To that end, we need to estimate the mean and standard deviation of the summary statistics under the null hypothesis. To that end we will :

  • Generate many point processes under the null hypothesis.
  • Calculate the chosen summary statistics for each of them.
  • Take the mean and standard deviation.
  • For the observed point pattern, compute the summary statistics.
  • Look how much it deviates from the theoretical normal distribution.
  • If it deviates by too much, we reject the null hypothesis.

We will chose for this presentation the summary statistic: \[\frac{1}{10\sqrt{2}} \int_0^{0.1/\sqrt{2}} \sum_{i\leq d}\mathbf{1}_{(0,i) \in D}\mathrm{d}d.\]

Note that this choice is arbitrary and the “best choice” is most likely dependent on your problem.

Simulations under the null hypothesis

Let’s simulate a lot of point processes under the null hypothesis.

set.seed(35)
nsim <- 1e4 # the more the better but be gentle with your laptop.

ppps <- rpoispp(2, 
                win = owin(c(0,10), c(0,10)), 
                nsim = nsim)

We now compute the value of the test statistic on each point process, then estimate the mean and standard deviation.

Statval <- NULL

for (i in 1:length(ppps)) {
    Statval[i] <- tda0_scaled(ppps[[i]], 
                    rc= 0.1)
}

Avstat <- mean(Statval)

This is the mean:

Avstat
[1] 0.4840136

And now the standard deviation:

sdstat <- sd(Statval)
sdstat
[1] 0.04281816

A sanity check that our statistic follows a central limit theorem.

hist((Statval-Avstat)/sdstat, prob=T, breaks=25) 
lines(seq(-4,4,0.01), dnorm(seq(-4,4,0.01)), col=2, lwd=4) # Plot the density of a standard normal random variable.

This looks pretty good.

Assessing type I and II error

Estimate the type I error

We simulate some Poisson point processes according to the null hypothesis.

set.seed(134)
pptype1 <- rpoispp(2, 
                win = owin(c(0,10), c(0,10)), 
                nsim = 1000)

The following function returns the estimated quantile of the test statistics for a given point process. Taking their average gives the estimated p-value.

pval_teststat <- function(X){ 
    z_val <- abs(tda0_scaled(X, rc= 0.1) - Avstat) / sdstat
    2 * (1 - pnorm(z_val))
}

Estimate the p-value from a sample of Poisson process

A<-sapply(pptype1, pval_teststat)
mean(A)
[1] 0.5116972

Let’s see how many time we will reject the test at \(5\%\).

mean(A<0.05) # Shoult return 5% approximately
[1] 0.044

We find approximately the correct type I error.

Estimate type II error

To estimate type II error, we need to restrict ourselve. Let’s simulate matern cluster processes and see if our test rejects the assumption that they are Poisson.

set.seed(78)
nsim <- 1e3

mctypeII <- rMatClust(kappa=2,
                      scale=.5, 
                      mu= 1, 
                      win = owin(c(0,10), c(0,10)), 
                      nsim = nsim) # this process should have intensity 2. Can you check it?  

Can you empirically checked that these point processes have intensity 2?

Let’s look at a plot.

plot(mctypeII[[1]]$x, mctypeII[[1]]$y, xlim=c(0,10), ylim=c(0,10), pch=20, cex=2, xlab="", ylab="",main="")

The type II error when the type I error is \(5\%\) is given by estimating how many time we reject the null hypothesis when it should be rejected.

mean(sapply(mctypeII, pval_teststat)< 0.05)
[1] 0.69

Here we estimate the type II error to approximately \(69\%\).

Conclusion

  • We have deviced a statistical test based on our central limit theorem.
  • Note that we could have used another statistics based on the loop for example.
  • For example, we could have used the accumulated persistence function based on the loop: \[\mbox{APF}_k(m) = \sum_{p \in D} l_p \mathbf{1}_{\lbrace b_p \leq m \rbrace}.\]
  • It can be computed with the following code:
APF1 <- function(coords, r){
    diag <- alphaComplexDiag(coords)$diagram # compute the persistence diagram
    
    # Generate the lifetime and meanage for all features
    lifetime = diag[,3]-diag[,2]
    meanage = (diag[,3]+diag[,2])/2

    # Gather the lifetime and meanage in a dataframe, together with the corresponding dimension of the feature
    diagrot_df <-  data.frame(dimension = diag[,1], 
                            lifetime = lifetime,
                            meanage = meanage )

    # Extract the 1 dimensional features
    diagrot1_df <-  diagrot_df[diagrot_df[,1]==1,]                   

    # In order for the function APF1 to apply on a vector of values for r, I use 
    # sapply. For each element i in each r, I extract from the data frame 
    # diagrot1_df the row with meanage smaller than i 
    # and take the value on the second column which is the lifetime. 
    # I then sum these values to obtain the APF evaluated at i. 
    sapply(r, FUN=function(i){
        extracted_lifetime <- diagrot1_df[diagrot1_df[,3]<i,2]
        return(sum(extracted_lifetime))
    })
}

Let’s see if it works on our toy example.

m=seq(0,0.5,0.001)
plot(m, APF1(coords=cbind(PPtoy$x, PPtoy$y), m), type="l")