Topological Data Analysis
Application in Spatial Statistics
Necessary Packages
This is the main and standard library for Topological Data Analysis in R.
library(TDA)
The R-package TDA is actually a wrapper of several standard Python/C++ libraries: Gudhi, Dionysus and PHAT.
A Toy Example
Data generation
- Let’s use a very simple point pattern.
- There is not need for spatstat there but let’s use it for training purposes.
library(spatstat)
Loading required package: spatstat.data
Loading required package: spatstat.geom
spatstat.geom 2.4-0
Loading required package: spatstat.random
spatstat.random 2.2-0
Loading required package: spatstat.core
Loading required package: nlme
Loading required package: rpart
spatstat.core 2.4-4
Loading required package: spatstat.linnet
spatstat.linnet 2.3-2
spatstat 2.3-4 (nickname: 'Watch this space')
For an introduction to spatstat, type 'beginner'
spatstat.options(transparent=F) # needed for rendering the html file. Remove transparency in plot of ppp object.
<- ppp(x=c(-1.2,1,0,0),
PPtoy y=c(0,0,-0.5,0.5),
window=owin(c(-2,2),c(-2,2)))
A simple plot of these 4 points.
plot(PPtoy, cex=2, pch=20)
Filtration with Rips complexes
<- ripsFiltration(X=cbind(PPtoy$x, PPtoy$y),
Rfiltration maxdimension=4,
maxscale=3,
dist = "euclidean",
library = "GUDHI",
printProgress = FALSE)
A view on the argument:
- In the TDA package, the input cannot be a point pattern (ppp object) but a matrix with on the first column, the first coordinates of the points, on the second column, the second coordinates … Thus we use
cbind
. maxdimension
indicates the maximal dimension of simplices that we allowed.maxscale
how far away the function should look for neighbooring points.distance
the distance used (see the documentation for more info)library
which library to use: gudhi, dionysys or PHAT. Except for one functionality dionysus, this does not change many things when dealing with a few thousands points.
Let’s display the filtration object. The returned object is a list. It contains the following.
A list of complexes
A list of the simplices appearing in the filtration.
$cmplx Rfiltration
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
[[5]]
[1] 4 3
[[6]]
[1] 3 2
[[7]]
[1] 4 2
[[8]]
[1] 4 3 2
[[9]]
[1] 3 1
[[10]]
[1] 4 1
[[11]]
[1] 4 3 1
[[12]]
[1] 2 1
[[13]]
[1] 3 2 1
[[14]]
[1] 4 2 1
[[15]]
[1] 4 3 2 1
- The indices represents a label for the points
- \(1\ 2\) means that this is a 1-dimensional simplex: the edge with the points 1 and 2.
- \(4\ 3\ 2\) means that this is a 2-dimensional simplex: the filled triangle with the points 4, 3, and 2.
- \(4\ 3\ 2\ 1\) means that this is a 3-dimensional simplex: the filled tetrahedron with the points 4, 3, 2 and 1.
The filtration values of the complexes
- The filtration values indicates when a simplex appears for the first time in the filtration.
- Not always true but ease the understanding: Think of it as the value of the increasing radius of the growing cirlce.
- This is ordered as
Rfiltration$cmplx
.
$values Rfiltration
[1] 0.000000 0.000000 0.000000 0.000000 1.000000 1.118034 1.118034 1.118034
[9] 1.300000 1.300000 1.300000 2.200000 2.200000 2.200000 2.200000
- We see that the vertices/0-dimensional complexes appears at the filtration values 0.
- Notice how several complexes appear at the end, when there is no more information on the structure.
Other outputs
The following indicates if the filtration values are indeed increasing.
$increasing Rfiltration
[1] TRUE
The following return the distance matrix of the vertices. It appears only if the argument dist="euclidean"
is used.
$coordinates Rfiltration
[,1] [,2]
[1,] -1.2 0.0
[2,] 1.0 0.0
[3,] 0.0 -0.5
[4,] 0.0 0.5
Using the persistence diagram
This is done with the function ripsDiag
. Arguments are the same as for ripsFiltration
up to location
. We will come back on it later.
<- ripsDiag(X = cbind(PPtoy$x, PPtoy$y),
Rdiag maxdimension=4,
maxscale=3,
library = c("GUDHI"),
location = F,
printProgress = FALSE)
The output is
Rdiag
$diagram
dimension Birth Death
[1,] 0 0 3.000000
[2,] 0 0 1.300000
[3,] 0 0 1.118034
[4,] 0 0 1.000000
Warning: The output is not directly the diagram but a list containing the diagram.
Let’s plot it. Note that there is no loop.
plot(Rdiag$diagram)
The barcode is obtained by adding one argument to the plot function.
plot(Rdiag$diagram, barcode=T)
Another filtration – The \(\alpha\)-complexes
<- alphaComplexFiltration(
Afiltration X=cbind(PPtoy$x, PPtoy$y),
library = "GUDHI",
printProgress = FALSE)
A view on the argument:
- As in
ripsFiltration
but withoutmaxdimension, maxscale, dist
.
The output is in the same format as for ripsFiltration
. Note however that the values itself are different.
This was expected. An advantage of the \(\alpha\)-complexe is that it does not include simplicies of dimension higher than the dimension of the ambient space, here 2.
$cmplx Afiltration
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
[[5]]
[1] 4 3
[[6]]
[1] 3 2
[[7]]
[1] 4 2
[[8]]
[1] 4 3 2
[[9]]
[1] 3 1
[[10]]
[1] 4 1
[[11]]
[1] 4 3 1
Using the persistence diagram
This is similar to what we did for the Rips complexes but without the arguments maxscale, dist
.
<- alphaComplexDiag(X = cbind(PPtoy$x, PPtoy$y),
Adiag maxdimension=1,
library = c("GUDHI"),
location = F,
printProgress = FALSE)
The output is
Adiag
$diagram
dimension Birth Death
[1,] 0 0.0000 Inf
[2,] 0 0.0000 0.4225000
[3,] 0 0.0000 0.3125000
[4,] 0 0.0000 0.2500000
[5,] 1 0.3125 0.3906250
[6,] 1 0.4225 0.4958507
Warning: The output is not directly the diagram but a list containing the diagram.
Let’s plot it. Note that this is time, there is 2 loops. This looks more in adequation with the data.
plot(Adiag$diagram)
plot(Adiag$diagram, barcode=T)
On the argument location
We have used the argument location=F
. What does it means?
- We may be interested in the precise location the point that kills or gives birth to a topological feature.
- It is possible to find it if we use the library “PHAT” or “Dionysus”. Note that the TDA package requires “GUDHI” for some computation so both libraries needs to be put in input as follows.
<- alphaComplexDiag(X = cbind(PPtoy$x, PPtoy$y),
Adiag2 maxdimension=2,
library = c("GUDHI", "Dionysus"),
location = T,
printProgress = FALSE)
Each point involved in the birth of a topological feature.
$birthLocation Adiag2
[,1] [,2]
[1,] -1.2 0.0
[2,] 1.0 0.0
[3,] 0.0 -0.5
[4,] 0.0 0.5
[5,] 1.0 0.0
[6,] -1.2 0.0
Each point involved in the death of a topological feature.
$deathLocation Adiag2
[,1] [,2]
[1,] -1.2 0.0
[2,] -1.2 0.0
[3,] 1.0 0.0
[4,] 0.0 -0.5
[5,] 1.0 0.0
[6,] -1.2 0.0
Warning: the TDA package returns also an element called cycleLocation
which represents the location of the cycles defining the topological feature. For example, for a loop it returns its border.
This is, in several ways, dangerously wrong
- This works quite well on the simple example in the documentaiton but not in a general setting.
- It actually returns a representant of the loop, void… There is an infinity of representants and worst than that, the mathematical definition of a loop is not trivial.
- We will not cover it in the lecture but I recommend the additional references given for this course and also the excellent list of tutorials given on the youtube channels: https://www.youtube.com/c/AppliedAlgebraicTopologyNetwork.
- In particular the tutorial of Teresa Heiss: https://youtu.be/a-va5BRs14k
A second example – Points on the unit circle
The TDA package includes a simple function for training purposes.
<- circleUnif(n=60, r=1)
Circle1 Circle1
x1 x2
[1,] -0.999458716 0.03289796
[2,] 0.902074100 -0.43158118
[3,] -0.017186760 0.99985230
[4,] -0.796020582 0.60526955
[5,] -0.605256801 -0.79603028
[6,] 0.870878299 0.49149872
[7,] -0.328762353 0.94441268
[8,] -0.999566470 0.02944268
[9,] -0.532656995 0.84633122
[10,] -0.199369468 0.97992439
[11,] 0.987266034 0.15907790
[12,] -0.387242253 0.92197800
[13,] -0.999935737 -0.01133673
[14,] -0.562699411 0.82666158
[15,] 0.596392589 0.80269289
[16,] 0.548770906 -0.83597278
[17,] 0.990839510 -0.13504468
[18,] 0.878282006 -0.47814299
[19,] -0.778898733 -0.62714971
[20,] -0.526743597 0.85002422
[21,] -0.617295257 0.78673157
[22,] 0.331538819 0.94344158
[23,] -0.071813918 -0.99741805
[24,] 0.176011244 0.98438816
[25,] 0.546265486 -0.83761209
[26,] -0.629178606 -0.77726076
[27,] 0.329666422 -0.94409748
[28,] 0.993050378 -0.11769004
[29,] -0.659254608 -0.75191978
[30,] 0.989357205 0.14550712
[31,] -0.202474187 -0.97928760
[32,] 0.577507632 -0.81638529
[33,] -0.295534205 0.95533216
[34,] 0.790605439 -0.61232593
[35,] -0.865357733 -0.50115466
[36,] 0.826890832 0.56236247
[37,] -0.197546625 -0.98029349
[38,] 0.637730361 -0.77025969
[39,] -0.995754442 0.09204940
[40,] 0.995833333 -0.09119196
[41,] 0.360995855 -0.93256742
[42,] -0.551423914 -0.83422519
[43,] 0.969672723 0.24440706
[44,] -0.769145135 -0.63907414
[45,] -0.076006375 -0.99710733
[46,] -0.748277582 -0.66338576
[47,] 0.649431917 -0.76041974
[48,] -0.164547164 -0.98636922
[49,] 0.214354955 -0.97675583
[50,] 0.930361921 -0.36664246
[51,] -0.652015278 -0.75820583
[52,] 0.005898459 0.99998260
[53,] 0.860835838 -0.50888276
[54,] 0.655741073 0.75498586
[55,] -0.810509211 0.58572589
[56,] -0.993230534 0.11615983
[57,] -0.358396974 -0.93356928
[58,] -0.762914521 0.64649937
[59,] -0.803922674 0.59473383
[60,] 0.983642206 -0.18013331
As you may see, this consists of 60 points on the unit circle.
plot(Circle1, asp=1)
The most “efficient” filtration – The Rips complexes
<- ripsFiltration(X=Circle1,
Rfiltration maxdimension=2,
maxscale=3,
dist = "euclidean",
library = "GUDHI",
printProgress = FALSE)
Displaying the filtration is no more a solution as there is too many simplices.
length(Rfiltration$cmplx)
[1] 523685
Thankfully, we do not need it to compute the persistence diagram. This time we see clearly one loop which is what we expect from an unit disc.
<- ripsDiag(X = Circle1, maxdimension=1, maxscale=1,
DiagRips library = "GUDHI", location = F, printProgress = FALSE)
plot(DiagRips$diagram)
Other interesting functions of the TDA package
Summary statistics computation
Let’s use the summary statistics defined for \(r >0\) by:\[\int_0^{r} \sum_{i\leq d}\mathbf{1}_{(0,i) \in D}\mathrm{d}d\] and
\[\frac{1}{\sqrt{\rho}|W|} \int_0^{r/\sqrt{\rho}} \sum_{i\leq d}\mathbf{1}_{(0,i) \in D}\mathrm{d}d.\]
We may compute it in R as follows. Let’s assume that coords
represents a \(n\) by \(2\) matrix representing the \(x,y\) coordinates of \(n\) points.
<- cbind(PPtoy$x, PPtoy$y)
coords
<- alphaComplexDiag(coords)$diagram # return the persistence diagram computed from the alpha complex
diag
<- data.frame(dimension = diag[,1],
diag_df death = diag[,3]) # Convert the diagram object
# to a data frame. This is easier to handle
sort(diag_df[diag_df$dimension == 0, 'death']) # extract the connected components (dimension 0) and sort them by increasing death time.
[1] 0.2500 0.3125 0.4225 Inf
The summary statistics is now easy to compute as it consists into as for any argument \(r\) it consists into the sum of the death times until \(r\).
Let’s first gather the code above in a function.
<- function(coords){
tda0_raw <- alphaComplexDiag(coords)$diagram
diag <- data.frame(dimension = diag[,1],
diag_df death = diag[,3])
sort(diag_df[diag_df$dimension == 0, 'death'])
}
In the following function:
X
is the input point pattern as ppp object.rc
correspond to the upper bound of the integral before being scaled.
We also use the R function findInterval
which provide us a nice shortcut to compute what we want. See the help function for more information.
<- function(X,
tda0_scaled rc = 0.3){
<- tda0_raw(cbind(X$x, X$y)) # return the increasing list of death time.
dt
<- X$n / area.owin(X$w) # Return the estimate intensity, equivalent to intensity(X)
intens
<- seq(0,rc/sqrt(intens), length.out=1e3)
tx
return(mean(findInterval(tx, dt)/ (sqrt(intens) * area.owin(X$w))))
}
Deviation test
We will test if an observed point process is a realisation of an homogeneous Poisson point process with intensity 2 on the window \([0,10]^2\).
This will be based on an application of the CLT presented in the slides. To that end, we need to estimate the mean and standard deviation of the summary statistics under the null hypothesis. To that end we will :
- Generate many point processes under the null hypothesis.
- Calculate the chosen summary statistics for each of them.
- Take the mean and standard deviation.
- For the observed point pattern, compute the summary statistics.
- Look how much it deviates from the theoretical normal distribution.
- If it deviates by too much, we reject the null hypothesis.
We will chose for this presentation the summary statistic: \[\frac{1}{10\sqrt{2}} \int_0^{0.1/\sqrt{2}} \sum_{i\leq d}\mathbf{1}_{(0,i) \in D}\mathrm{d}d.\]
Note that this choice is arbitrary and the “best choice” is most likely dependent on your problem.
Simulations under the null hypothesis
Let’s simulate a lot of point processes under the null hypothesis.
set.seed(35)
<- 1e4 # the more the better but be gentle with your laptop.
nsim
<- rpoispp(2,
ppps win = owin(c(0,10), c(0,10)),
nsim = nsim)
We now compute the value of the test statistic on each point process, then estimate the mean and standard deviation.
<- NULL
Statval
for (i in 1:length(ppps)) {
<- tda0_scaled(ppps[[i]],
Statval[i] rc= 0.1)
}
<- mean(Statval) Avstat
This is the mean:
Avstat
[1] 0.4840136
And now the standard deviation:
<- sd(Statval)
sdstat sdstat
[1] 0.04281816
A sanity check that our statistic follows a central limit theorem.
hist((Statval-Avstat)/sdstat, prob=T, breaks=25)
lines(seq(-4,4,0.01), dnorm(seq(-4,4,0.01)), col=2, lwd=4) # Plot the density of a standard normal random variable.
This looks pretty good.
Assessing type I and II error
Estimate the type I error
We simulate some Poisson point processes according to the null hypothesis.
set.seed(134)
<- rpoispp(2,
pptype1 win = owin(c(0,10), c(0,10)),
nsim = 1000)
The following function returns the estimated quantile of the test statistics for a given point process. Taking their average gives the estimated p-value.
<- function(X){
pval_teststat <- abs(tda0_scaled(X, rc= 0.1) - Avstat) / sdstat
z_val 2 * (1 - pnorm(z_val))
}
Estimate the p-value from a sample of Poisson process
<-sapply(pptype1, pval_teststat)
Amean(A)
[1] 0.5116972
Let’s see how many time we will reject the test at \(5\%\).
mean(A<0.05) # Shoult return 5% approximately
[1] 0.044
We find approximately the correct type I error.
Estimate type II error
To estimate type II error, we need to restrict ourselve. Let’s simulate matern cluster processes and see if our test rejects the assumption that they are Poisson.
set.seed(78)
<- 1e3
nsim
<- rMatClust(kappa=2,
mctypeII scale=.5,
mu= 1,
win = owin(c(0,10), c(0,10)),
nsim = nsim) # this process should have intensity 2. Can you check it?
Can you empirically checked that these point processes have intensity 2?
Let’s look at a plot.
plot(mctypeII[[1]]$x, mctypeII[[1]]$y, xlim=c(0,10), ylim=c(0,10), pch=20, cex=2, xlab="", ylab="",main="")
The type II error when the type I error is \(5\%\) is given by estimating how many time we reject the null hypothesis when it should be rejected.
mean(sapply(mctypeII, pval_teststat)< 0.05)
[1] 0.69
Here we estimate the type II error to approximately \(69\%\).
Conclusion
- We have deviced a statistical test based on our central limit theorem.
- Note that we could have used another statistics based on the loop for example.
- For example, we could have used the accumulated persistence function based on the loop: \[\mbox{APF}_k(m) = \sum_{p \in D} l_p \mathbf{1}_{\lbrace b_p \leq m \rbrace}.\]
- It can be computed with the following code:
<- function(coords, r){
APF1 <- alphaComplexDiag(coords)$diagram # compute the persistence diagram
diag
# Generate the lifetime and meanage for all features
= diag[,3]-diag[,2]
lifetime = (diag[,3]+diag[,2])/2
meanage
# Gather the lifetime and meanage in a dataframe, together with the corresponding dimension of the feature
<- data.frame(dimension = diag[,1],
diagrot_df lifetime = lifetime,
meanage = meanage )
# Extract the 1 dimensional features
<- diagrot_df[diagrot_df[,1]==1,]
diagrot1_df
# In order for the function APF1 to apply on a vector of values for r, I use
# sapply. For each element i in each r, I extract from the data frame
# diagrot1_df the row with meanage smaller than i
# and take the value on the second column which is the lifetime.
# I then sum these values to obtain the APF evaluated at i.
sapply(r, FUN=function(i){
<- diagrot1_df[diagrot1_df[,3]<i,2]
extracted_lifetime return(sum(extracted_lifetime))
}) }
Let’s see if it works on our toy example.
=seq(0,0.5,0.001)
mplot(m, APF1(coords=cbind(PPtoy$x, PPtoy$y), m), type="l")