PLNmodels: Poisson lognormal models
The Poisson lognormal model and variants can be used for analysis of mutivariate count data. This package implements efficient algorithms to fit such models.
Installation
PLNmodels is available on pypi. The development version is available on Gitlab.
R Package installation
Usage and main fitting functions
Description of the package
The package implements 2 differents classes that fits a Poisson-Log-Normal model.
- The PLN class fits a PLN model with full covariance matrix.
- The IMPS_PLN fits a PLN-PCA model using Importance sampling.
IMPS_PLN is always slower than fastPLN. fastPLNPCA is faster than fastPLN only for datasets with very large number of genes (p>5000, see here). However, fastPLNPCA is convenient since it allows to get the Principal Components (PCs) directly, in contrary to fastPLN. To get the PCs using fastPLN, you first need to fit the model and do a PCA on the matrix
All of these class are aggregated into the class PLNmodel, so that you don't need to deal with multiple classes. This class will automatically fit the data with one of those classes.
How to use the package?
First, you need to pip install the package. We recommend to create a new environment before installing the package.
pip install pyPLNmodels
The package comes with an artificial dataset to present the functionality. You can load it doing the following:
import pandas as pd
Y = pd.read_csv('example_data/Y_test')
O = pd.read_csv('example_data/O_test')
cov = pd.read_csv('example_data/cov_test')
If you want
from pyPLNmodels.models import PLNmodel
nbpcs = 5 # number of principal components
mypln = PLNmodel(q= nbpcs)
mypln.fit(Y,O,cov)
print(mypln)
Note that if you do not specify
fast = False
in the .fit()
method, but it will take much more time:
nbpcs = 5
mypln = PLNmodel(nbpcs)
mypln.fit(Y,O,cov, fast = False)
print(mypln)
How to get the model parameters back ?
You can get the model parameters back running:
beta = mypln.get_beta()
C = mypln.get_C()
Sigma = mypln.get_Sigma()
This class automatically picks the right model among fastPLN, fastPLNPCA
and IMPS_PLN
. If you want to know more about each of these algorithms that are quite different, you can check below.
How to fit each model?
Fit the PLN model
You have to call :
from pyPLNmodels.models import fastPLN
fast = fastPLN()
fast.fit(Y,O,cov)
print(fast)
Hyperparameters
Here are the main hyperparameters of the .fit()
method of the fastPLN
object:
-
N_iter_max
: The maximum number of iteration you are ready to do. If the algorithm has not converged, you can try to increase it. Default is 200. -
tol
: tolerance of the model. The algorithm will stop if the ELBO (approximated likelihood of the model) has not increased of more thantol
. Try to decrease it if the algorithm has not converged. Default is1e-1
-
good_init
: If set toTrue
, the algorithm will do an initialization that can take some time, especially for large datasets. You can set it toFalse
if you want a much faster but random initialization. Default isTrue
.
Those 3 parameters are important. However, they won't change the asymptotic behavior of the algorithm. If you launch the algorithm for a sufficient time (i.e. tol
is small enough and N_iter_max
is big enough), it will converge to the right parameters independently of the hyperparameters. Moreover, the default arguments are convenient for most datasets.
If you want to see the progress of the algorithm in real time, you can set Verbose = True
in the .fit()
method.
How to be sure the algorithm has converged ?
Basically, if the ELBO reaches a plateau, the algorithm has converged. If it has not reached a plateau, then you can try to increase the number of iteration N_iter_max
or lower the tolerance tol
.
Note that you don't need to restart the algorithm from the beginning, you can start from where the algorithm has stopped by calling:
fast.fit(Y,O,cov, N_iter_max = 500, tol = 1e-5)
The numerical complexity is quadratic with respect to the number of genes p.
Fit the fastPLNPCA model
To fit the fastPLNPCA
object, you first need to declare the number of PCs you want, and then you can fit the object:
from pyPLNmodels.models import fastPLNPCA
nbpcs = 5
fastpca = fastPLNPCA(q=nbpcs)
fastpca.fit(Y,O,cov)
print(fastpca)
The hyperparameters of the .fit()
method are the same as for the fastPLN
object. Only the Default values of N_iter_max
and tol
are differents:
-
N_iter_max
default is : 5000 -
tol
default is : 1e-3
You can check if the algorithm has converged following the same guidelines as for fastPLN
.
The numerical complexity is linear with respect to the number of genes p.
Fit the IMPS_PLN model
To fit the IMPS based model, you need to declare the number of Principal composents, and then you can fit the model:
from pyPLNmodels.models import IMPS_PLN
nbpcs = 5
imps = IMPS_PLN(nbpcs)
imps.fit(Y,O,cov)
print(imps)
Hyperparameters
The hyperparameters of the .fit()
method of the IMPS_PLN
are more complicated and technical. We suggest to take a look at the mathematical description of the package to gain intuition. Basically, the IMPS_PLN
estimates the gradients of the log likelihood with importance sampling. Here are the main hyperparameters and their impacts:
-
acc
: the accuracy of the approximation. The lower the better the gradient approximation, but the lower the algorithm. Default is 0.005 You can try to increasing it if you want to be faster. However reducing it won't gain much accuracy, and will significantly increase the convergence time. -
N_epoch_max
: The maximum number of iteration you are ready to do. If the tolerance has not converged, you can try to increase it. Default is 500. -
lr
: Learning rate of the gradient ascent. You can try to reduce it or lower it, and see if the final likelihood has improved. Default is 0.1. -
batch_size
: The batch size of the gradient descent. The larger the more accurate the gradients, but the slower the algorithm. if you have very large datasets, you can try to increase it. If you decrease it, then you hsould also decrease the learning rate. Default is 40. Should not exceed the number of samples you have in your dataset. -
optimizer
: The optimizer you take for the gradient ascent. You can trytorch.optim.RMSprop
, which is more robust to inappropriate learning rates. However lower the learning rate to 0.01 if usingtorch.optim.RMSprop
. Default istorch.optim.Adagrad
. -
nb_plateau
: The algorithm will stop if the likelihood of the model has not increased duringnb_plateau
epochs. Default is 15. -
nb_trigger
: Since the likelihood is approximated and random, we consider that the likelihood does not increase if duringnb_trigger
iterations it has not improved from the maximum likelihood computed. This parameter is here to deal with the randomness of the criterion. Default is 5. -
good_init
: If set toTrue
, the algorithm will do a precise initialization (that takes some time). You can remove this step by settinggood_init = False
. Default is True.
You can see the progress of the algorithm in real time by setting verbose = True
in the .fit()
method.
The numerical complexity is linear with respect to the number of genes p.
How to be sure the algorithm has converged ?
Unfortunately, there is no heuristics to know if the algorithm has converged. Indeed, even if you reach a plateau, it is possible that you can reach a much better plateau with other hyperparameters. However, this is in fact due to the choice of torch.optim.Adagrad
as optimizer. If it has converged, it will be a very good solution. To have a (fast) convergence, you need to take the learning rate in the right interval. Fortunately, it is quite large: about [0.01, 0.3]
for many cases.
If you have still not converged, you can try to change the optimizer to torch.optim.RMSprop
, but lower the learning to 0.02 or lower. You can also increase the batch_size and the number of iteration you do. If your dataset is not too big, as a last resort, you can try to set the learning rate to 0.1, taking as optimizer torch.optim.Rprop
and set the batch_size
to the number of samples you have in your dataset.
How to retrieve the parameters of the model ?
After fitting the model, one can retrieve the parameters of the model. To retrieve
beta_chap = model.get_beta()
To retrieve
Sigma_chap = model.get_Sigma()
Note that for the PCA models, this matrix won't be invertible.
To retrieve
C_chap = model.get_C()
For the fastPLN object, you will get a Matrix of size
Quick mathematical description of the package.
The package tries to infer the parameters of two models:
- Poisson Log Normal-Principal Composent Analysis model (PLN-PCA)
- Poisson Log Normal model (PLN) (special case of PLN-PCA model)
We consider the following model PLN-PCA model:
-
Consider
samples -
Measure
:(covariate) for sample(altitude, temperature, categorical covariate, ...) -
Consider
features (genes)Measure: -
Measure
number of times the featureis observed in sample. -
Associate a random vector
with each sample. -
Assume that the unknown
are independant and living in a space of dimensionsuch that:
and
Where
We can see that
The unknown parameter is
- When , we call this model Poisson-Log Normal (PLN) model. In this case,is a non-degenerate gaussian with meanand covariance matrix.
- When , we call this model Poisson-Log Normal-Principal Component Analysis (PLN-PCA). Indeed, we are doing a PCA in the latent layer, estimatingwith a ranqmatrix:.
The goal of this package is to retrieve
However, almost any integrals involving the law of the complete data is unreachable, so that we can't perform neither gradient ascent algorithms nor EM algorithm. We adopt two different approaches to circumvent this problem:
- Variational approximation of the latent layer (Variational EM)
- Importance sampling based algorithm, using a gradient ascent method.
Variational approach
We want here to use the EM algorithm, but the E step is unreachable, since the law
We choose
\phi_i \in \mathcal{Q}_{\text {diag}}=\{ \mathcal{N}\left(M_{i}, \operatorname{diag} (S_{i}\odot S_i )) , M_i \in \mathbb{M} ^q, S_i \in \mathbb{R} ^q\right\}
We choose such a Gaussian approximation since W is gaussian, so that W|Y_i may be well approximated. However, taking a diagonal matrix as covariance breaks the dependecy induced by Y_i.
We can prove that J_{Y_i}(\theta, \phi_i) \leq p_{\theta} (Y_i) \; \forall \phi_i. The quantity J_{Y}(\theta, \phi) is called the ELBO (Evidence Lower BOund).
Variational EM
Given an intialisation (\theta^0, q^0), the variational EM aims at maximizing the ELBO alternating between two steps:
- VE step: update q q^{t+1}=\underset{q \in \mathcal{Q}_{gauss}}{\arg \max } J_Y(\theta^{t}, q)
- M step : update \theta \theta^{t+1}=\underset{\theta}{\arg \max } J_Y(\theta, q^{t+1}) Each step is an optimisation problem that needs to be solved using analytical forms or gradient ascent. Note that q is completely determined by M = (M_i)_{1 \leq i \leq n } \in \mathbb R ^{n\times q} and S = (S_i)_{1 \leq i \leq n } \in \mathbb R ^{n\times q}, so that J is a function of (M, S, \beta, \Sigma). q = (M,S) are the variational parameters, \theta = (\beta, \Sigma) are the model parameters.
p = q
CaseThe case p=q does not perform dimension reduction, but is very fast to compute. Indeed, computations show that the M-step is straightforward in this case as we can update \Sigma and \beta with an analytical form :
\begin{aligned} \Sigma^{(t+1)} & = \frac{1}{n} \sum_{i}\left(\left((M^{(t)}-X\beta)_{i} (M^{(t)}-X\beta)_{i}\right)^{\top}+S^{(t)}_{i}\right)\\ \beta^{(t+1)} &= (X^{\top}X)^{-1}X^{\top}M^{(t)} \\ \end{aligned} This results in a fast algorithm, since we only need to go a gradient ascent on the variational parameters M and S. Practice shows that we only need to do one gradient step of M and S, update \beta and \Sigma with their closed form, then re-perform a gradient step on M and S and so on.
p <q
CaseWhen p<q, we do not have any analytical form and are forced to perform gradient ascent on all the parameters. Practice shows that we can perform a gradient ascent on all the parameters at a time (doing each VE step and M step perfectly is quite inefficient).
Importance sampling based algorithm
In this section, we try to estimate the gradients with respect to $\theta = (C, \beta) $.
One can use importance sampling to estimate the likelihood:
p_{\theta}(Y_i) = \int \tilde p_{\theta}^{(u)}(W) \mathrm dW \approx \frac 1 {n_s} \sum_{k=1}^{n_s} \frac {\tilde p_{\theta}^{(u)}(V_k)}{g(V_k)}, ~ ~ ~(V_{k})_{1 \leq k \leq n_s} \overset{iid}{\sim} g
where g is the importance law, n_s is the sampling effort and
\begin{array}{ll} \tilde p_{\theta}^{(u)}\ :& \mathbb R^{q} \to \mathbb R^+ \\ & W \mapsto p_{\theta}(Y_i| W) p(W) = \exp \left( - \frac 12 \| W_i\|^2 - \mathbf{1}_p^{\top} \exp(O_i + \beta^{\top}X_i + CW_i) + Y_i^{\top}(O_i + \beta^{\top}X_i +CW_i)\right) \\ \end{array}
To learn more about the (crucial) choice of g, please see Memoire PLN, section 3.2.3.
One can do the following approximation:
\begin{equation}\label{one integral} \nabla _{\theta} \operatorname{log} p_{\theta}(Y_i) \approx \nabla_{\theta} \operatorname{log}\left(\frac 1 {n_s} \sum_{k=1}^{n_s} \frac {\tilde p_{\theta}^{(u)}(V_k)}{g(V_k)}\right)\end{equation}
And derive the gradients formula:
\nabla_{\beta} \operatorname{log} p_{\theta}(Y_i)\approx X_iY_i^{\top} -\frac{\sum_{i = 1}^{n_s}\frac{\tilde p_{\theta}(V_k)}{g(V_k)}X_i\operatorname{exp}(O_i + \beta^{\top}X_i + CV_k)^{\top}}{\sum_{i = 1}^{n_s}\frac{\tilde p_{\theta}(V_k)}{g(V_k)}}
\nabla_{C} \operatorname{log} p_{\theta}(Y_i)\approx \frac{\sum_{i = 1}^{n_s}\frac{\tilde p_{\theta}(V_k)}{g(V_k)}\left[Y_{i}- \exp \left(O_i + \beta^{\top} X_{i}+CV_{k}{ }\right)\right] V_{k}^{\top}}{\sum_{i = 1}^{n_s}\frac{\tilde p_{\theta}(V_k)}{g(V_k)}} $$$$
Given the estimated gradients, we can run a gradient ascent to increase the likelihood. We use algorithm of Variance reduction such as SAGA, SAG or SVRG, implemented in the VR.py file.