Prepare
Installation
SELINA is available for macOS, Linux and Windows. To use SELINA, you should first build a conda environment.
conda create -n Selina
conda activate Selina
All the dependency packages will be installed simultaneously with the following commands except for the presto which can be found at presto. The R package devtools used in the installation of presto has been included in SELINA, so you do not need to install devtools once again.
conda install -c conda-forge -c r -c bioconda -c pfren selina
Note that if you have gpu on your device and want to use it, you should additionally run the following command to install cudatoolkit and the paired version of pytorch and cudnn.
conda install pytorch cudatoolkit cudnn
Prepare data
Preprocess of training data
Before you start to run SELINA, make sure you have prepared the reference datasets with the format shown as below. For each reference dataset, you should have 2 paired files, one is named as xx_expr.txt which contains the gene expression profile, and the other is named as xx_meta.txt containing the meta information for each cell. For the expression profile, the first column is gene name which is followed by expression of each cell, and the gene names should be symbol names and matched with hg38. For the meta file, the first column is cell type of each cell, and the second column is the sequencing platform of each cell. If you want to train data with normal cells and disease cells mixed, you should additionally add one column named Disease which presents the cell source(Normal/Name of any disease). Note that if you choose to use our pre-trained models, this step can be skipped. We recommend that you use the original count data as the reference data.
File tree
reference/
├── train1_expr.txt
├── train1_meta.txt
├── train2_expr.txt
└── train2_meta.txt
Expression matrix
| Gene | Cell1 | Cell2 |
|---|---|---|
| NOC2L | 7 | 3 |
| ISG15 | 10 | 2 |
Meta files
Meta file for normal data
| Celltype | Platform |
|---|---|
| CD8T | 10x |
| CD4T | 10x |
Meta file for disease data
| Celltype | Platform | Disease |
|---|---|---|
| CD8T | 10x | T2D |
| CD4T | 10x | Normal |
Preprocess of query data
This step is to normalize, convert the genes to version hg38 and symbol names, perform dimension reduction and clustering for your data. SELINA supports 3 formats of input: plain,h5 and mtx. The gene by cell matrix is in plain format. The full list of preprocessing commands is shown as below:
usage: selina preprocess [-h] --format {h5,mtx,plain} [--matrix MATRIX]
[--separator {tab,space,comma}] [--feature FEATURE]
[--gene-column GENE_COLUMN] [--barcode BARCODE]
[--gene-idtype {symbol,ensembl}]
[--assembly {GRCh38,GRCh37}]
[--count-cutoff COUNT_CUTOFF]
[--gene-cutoff GENE_CUTOFF]
[--cell-cutoff CELL_CUTOFF] [--mito]
[--mito-cutoff MITO_CUTOFF]
[--variable-genes VARIABLE_GENES] [--npcs NPCS]
[--cluster-res CLUSTER_RES] --directory DIRECTORY
[--outprefix OUTPREFIX]
optional arguments:
-h, --help show this help message and exit
Input files arguments:
--format {h5,mtx,plain}
Format of the count matrix file.
--matrix MATRIX Location of count matrix file. If the format is 'h5'
or 'plain', users need to specify the name of the
count matrix file.If the format is 'mtx', the
'matrix' should be the name of .mtx formatted matrix
file, such as 'matrix.mtx'.
--separator {tab,space,comma}
The separating character (only for the format of
'plain').Values on each line of the plain matrix file
will be separated by the character. DEFAULT: tab.
--feature FEATURE Location of feature file (required for the format of
'mtx'). Features correspond to row indices of count
matrix. DEFAULT: features.tsv.
--gene-column GENE_COLUMN
If the format is 'mtx', please specify which column
of the feature file to use for gene names. DEFAULT:
2.
--barcode BARCODE Location of barcode file (required for the format of
'mtx'). Cell barcodes correspond to column indices of
count matrix. DEFAULT: barcodes.tsv.
--gene-idtype {symbol,ensembl}
Type of gene name, 'symbol' for gene symbol and
'ensembl' for ensembl id. DEFAULT: symbol.
--assembly {GRCh38,GRCh37}
Assembly (GRCh38/hg38 and GRCh37/hg19). DEFAULT:
GRCh38.
Quality control arguments:
--count-cutoff COUNT_CUTOFF
Cutoff for the number of count in each cell. DEFAULT:
1000.
--gene-cutoff GENE_CUTOFF
Cutoff for the number of genes included in each cell.
DEFAULT: 500.
--cell-cutoff CELL_CUTOFF
Cutoff for the number of cells covered by each gene.
DEFAULT: 10.
--mito This flag should be used when you want to filter out
cells with a high percentage of mitochondria genes.
--mito-cutoff MITO_CUTOFF
Cutoff for the percentage of mitochondria genes in
each cell. DEFAULT: 0.2.
Process arguments:
--variable-genes VARIABLE_GENES
Number of variable genes used in PCA. DEFAULT: 2000.
--npcs NPCS Number of dimensions after PCA. DEFAULT: 30.
--cluster-res CLUSTER_RES
Clustering resolution. DEFAULT: 0.6.
Output arguments:
--directory DIRECTORY
Path to the directory where the result file shall be
stored.
--outprefix OUTPREFIX
Prefix of output files. DEFAULT: query.
In this step four output files will be generated:
query_res.rds: a seurat object storing the normalized data, dimension reduction and clustering resultsquery_expr.txt: expression matrix of query data with the first column as genesquery_cluster.png: UMAP plot with cluster labelsquery_cluster_DiffGenes.tsv: differentially expressed genes for each cluster