TRANSCUP: A Scalable Workflow for Predicting Cancer of Unknown Primary Based on Next-Generation Transcriptome Profiling

Article Outline

Research Article | Volume 4 | Issue 1 | DOI: 10.36959/584/457 Open Access

TRANSCUP: A Scalable Workflow for Predicting Cancer of Unknown Primary Based on Next-Generation Transcriptome Profiling

Peng Li

Peng Li 1*

OrigiMed Inc., Shanghai, China

Peng L (2020) TRANSCUP: A Scalable Workflow for Predicting Cancer of Unknown Primary Based on Next- Generation Transcriptome Profiling. Insights Biomed Res 4(1):117-119

Accepted: November 23, 2020 | Published Online: November 25, 2020

TRANSCUP: A Scalable Workflow for Predicting Cancer of Unknown Primary Based on Next-Generation Transcriptome Profiling

Abstract

Cancer of unknown primary site (CUP) accounts for 5% of all cancer diagnoses. These patients may benefit from more precise treatment when primary cancer site was identified. Advances in high-throughput sequencing have enabled cost-effective sequencing the transcriptome for clinical application. Here, I present a free, scalable and extendable software for CUP predication called TRANSCUP, which enables (1) Raw data processing, (2) Read mapping, (3) Quality re-port, (4) Gene expression quantification, (5) Random forest machine learning model building for cancer type classification. TRANSCUP achieved high accuracy, sensitivity and specificity for tumor type classification based on external RNA-seq datasets. It has potential for broad clinical application for solving the CUP problem. TRANSCUP is open-source and freely available at https://github.com/plsysu/TRANSCUP

Keywords

CUP, Classification, Machine learning, RNA-seq, Workflow, Transcriptome

Introduction

The Cancer of unknown primary site (CUP) is a heterogeneous group of cancers in the absence of an identifiable primary tumor despite a standard diagnostic approach and accounts for approximately 5% of all cancer diagnoses [1]. The biologic mechanisms underlying this phenomenon are unknown [2]. CUP patients can be given site specific therapy with significant improvement in clinical outcome compared with empirical chemotherapy when cancer primary site was identified [3,4].

Currently, many molecular diagnostic methods have been widely applied in clinic, including RT-PCR [5], microRNA RT-PCR [6], gene expression microarray [7] and DNA methylation microarray [8]. Advances in high-throughput sequencing have enabled cost-effective sequencing the transcriptome for clinical application [9]. RNA-Seq based predication algorithm have been proposed using TCGA's RNA-Seq RSEM expression value and validated on both RNA-Seq and microarray dataset [10]. However, data analysis pipelines from raw FASTQ data to final tumor type predication are currently not fully implemented and validated computationally and are not publicly available [11].

Here, I present a free, scalable and extendable software for CUP predication called TRANSCUP, which comprises modules for raw data processing, read mapping, quality report, gene expression quantification and building a random forest model for cancer type classification. It achieved high accuracy, sensitivity and specificity for tumor type classification based on external RNA-seq datasets. It has potential for broad clinical application for solving the CUP problem. TRANSCUP is open-source and freely available at https://github.com/plsysu/TRANSCUP

Methods

The TRANSCUP workflow is illustrated in Figure 1. To build CUP classifier, I use TCGA RNA-Seq data as training dataset, followed by sample and feature selection, data transformation, data normalization and random forest model building. To predict new tumor samples, I build a bioinformatics pipeline to process data from raw FASTQ reads to tumor type predication result. The details for implementation are the followings:

Training data source

TCGA harmonized gene expression quantification data and clinical data were retrieved from GDC portal (https://portal.gdc.cancer.gov/) via TCGAbiolinks [12]. Totally, 32 cancer types (COAD and READ were combined to CRC) and 10,363 tumor samples were used for model building.

Preprocess and feature selection

All FPKM expression data was log2-transformed, and only genes with (a) Maximum log2 expression greater than 2.5 and (b) Variance in log2 expression greater than 0.1 were retained. After filtering, the genes in each dataset were scaled to zero mean expression and unit variance. The mean and standard variance learned from training dataset for each gene was used to normalize the new data. Feature genes were selected by two criterion: 1) Those genes were different expressed by one tumor type to other tumor types; 2) Those genes were expressed more higher in one tumor type than other tumor types, which evaluated by t-test and log2-fold change, respectively. The top 100, 150, 200 genes' log2 fold change value or log2 fold change value larger than 2.5 or 3 were compared to find the best feature selection method.

Random forest model

The random forest algorithm was used for machine learning model building. To avoid overfitting, model was validated by 10-fold cross-validation. The final model was trained on all training data with the best hyper-parameters.

Bioinformatics pipeline

The raw reads were cleaned prior to following analysis by Trimmomatic [13]. Clean reads were mapped to the human genome GRCh38 by STAR [14] using the 2-pass model. Gene read counts were calculated using HTSeq-count [15]. GENCODE v22 GTF file was used to alignment and HTSeq-count. To quantify gene expression, the fragments per kilobase of transcript per million mapped reads (FPKM) values of each gene were calculated. To evaluate the RNA-seq data quality, multiple metrics include yield, alignment, GC bias, rRNA content, regions of alignment (exon, intron and intragenic), continuity of coverage, 3'/5' bias and count of detectable genes, among others were calculated by RNA-SeQC [16].

Snakemake workflow

To make this software extendable and scalable, I adopted snakemake [17,18] workflow engine to chain the tools, databases, and config files together. This enables users to easily make use of any cluster environment for processing and conveniently manage tools, databases and pipelines.

Results

I found that TOP200 was the best feature selection method after compared with other 4 methods mentioned above (Supplemental Table S1). 3817 genes were retained as feature genes (Supplemental Table S2). In the training phase, we performed 10-fold cross-validation on all training data to avoid overfitting and find best hyper-parameters. The accuracy, kappa, standard deviation of accuracy and standard deviation of kappa were 0.961, 0.959, 0.003292 and 0.003457, respectively. The median of sensitivity and specificity across 32 cancer types were 0.969 and 0.999, respectively (Supplemental Table S3). I used five external public RNA-seq datasets (Supplemental Table S4) to evaluate the performance of TRANSCUP. These datasets totally contain 557 samples and across 4 different cancer types including OSCC (HNSC), CRC, lung cancer (LUAD/LUSC) and BRCA. Notably, the overall accuracy of TRANSCUP was 98.7%, totally 7 samples were misclassified (Supplemental Table S5).

Conclusions

In this article, I described the TRANSCUP package for tumor type predication. TRANSCUP has been validated using 557 external samples and was more accurate than other methods for cancer type classification. Furthermore, TRANSCUP can analyze data from raw FASTQ to final cancer type predication results, and is more scalable and extendable than other methods. Users can train other kinds of models like deep learning models to extend its capability. However, this package needs to be validated on more external RNA-Seq datasets which including more diverse cancer types when data are available. Its actual clinical effect has to be verified by further experiments and clinical trials in the future.

Funding

None.

Conflict of Interest

None declared.

References

Corresponding Author

Peng Li, OrigiMed Inc., 5^th Floor, Building 3, No.115 Xin Jun Huan Road, Minhang District, Shanghai-201114, China.

Copyright

© 2020 Peng. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

References

Download PDF

View PDF

Views and Downloads

Article views
1173

PDF downloads
1006

Insights of Biomedical Research

Article Outline

Table of Contents

TRANSCUP: A Scalable Workflow for Predicting Cancer of Unknown Primary Based on Next-Generation Transcriptome Profiling

TRANSCUP: A Scalable Workflow for Predicting Cancer of Unknown Primary Based on Next-Generation Transcriptome Profiling

Abstract

Keywords

Introduction

Methods

Training data source

Preprocess and feature selection

Random forest model

Bioinformatics pipeline

Snakemake workflow

Results

Conclusions

Funding

Conflict of Interest

References

Corresponding Author

Copyright

Abstract

References

Download PDF

View PDF

Views and Downloads

Article views
1173

PDF downloads
1006

Download

Figures

Supplemental Tables

Insights of Biomedical Research

Article Outline

Table of Contents

TRANSCUP: A Scalable Workflow for Predicting Cancer of Unknown Primary Based on Next-Generation Transcriptome Profiling

TRANSCUP: A Scalable Workflow for Predicting Cancer of Unknown Primary Based on Next-Generation Transcriptome Profiling

Abstract

Keywords

Introduction

Methods

Training data source

Preprocess and feature selection

Random forest model

Bioinformatics pipeline

Snakemake workflow

Results

Conclusions

Funding

Conflict of Interest

References

Corresponding Author

Copyright

Abstract

References

Download PDF

View PDF

Views and Downloads

Article views 1173

PDF downloads 1006

Download

Figures

Supplemental Tables

Article views
1173

PDF downloads
1006