Genômica
Oct 16, 2023
Nature volume 616, páginas 543–552 (2023) Citar este artigo
25k Acessos
4 Citações
89 Altmétrica
Detalhes das métricas
A heterogeneidade intratumoral (ITH) alimenta a evolução do câncer de pulmão, o que leva à evasão imune e resistência à terapia1. Aqui, usando dados pareados de exoma completo e sequenciamento de RNA, investigamos a diversidade transcriptômica intratumoral em 354 tumores de câncer de pulmão de células não pequenas de 347 dos primeiros 421 pacientes prospectivamente recrutados para o estudo TRACERx2,3. Análises de 947 regiões tumorais, representando doença primária e metastática, juntamente com 96 amostras de tecido normal adjacente ao tumor, implicam o transcriptoma como uma importante fonte de variação fenotípica. Níveis de expressão gênica e ITH relacionam-se com padrões de seleção positiva e negativa durante a evolução do tumor. Observamos expressão específica de alelo independente do número de cópias frequente que está ligada à disfunção epigenômica. A expressão específica do alelo também pode resultar em evolução paralela genômica-transcriptômica, que converge na interrupção do gene do câncer. Extraímos assinaturas de substituições de base única de RNA e vinculamos sua etiologia à atividade das enzimas de edição de RNA ADAR e APOBEC3A, revelando assim a atividade APOBEC contínua não detectada em tumores. Caracterizando os transcriptomas de pares de tumores metastáticos primários, combinamos várias abordagens de aprendizado de máquina que alavancam variáveis genômicas e transcriptômicas para vincular o potencial de semeadura de metástases ao contexto evolutivo de mutações e aumento da proliferação nas regiões tumorais primárias. Esses resultados destacam a interação entre o genoma e o transcriptoma na influência do ITH, na evolução do câncer de pulmão e na metástase.
Uma compreensão das causas da variação célula a célula do câncer é essencial para entender a evolução do tumor. Trabalhos recentes têm enfatizado que grande parte dessa variação é transcriptômica, decorrente de diversos mecanismos que se relacionam ou são independentes da variação genômica4. Em modelos de camundongos de câncer de pulmão de células não pequenas (NSCLC), a plasticidade transcriptômica demonstrou sustentar o ITH5. Enquanto a variação genômica reflete as relíquias de eventos somáticos passados adquiridos durante a história evolutiva de um tumor, a variação transcriptômica pode fornecer uma aproximação precisa do estado fenotípico de um tumor no momento da amostragem1. Até o momento, a maioria dos estudos sobre a evolução do tumor em humanos se concentrou no impacto das alterações genômicas no câncer. Estudos transcriptômicos que utilizam dados de sequenciamento de RNA tumoral em massa (RNA-seq) tendem a se concentrar na amplitude da expressão gênica em uma única biópsia realizada em um único ponto de tempo. Essa abordagem pode falhar em capturar processos transcriptômicos mal compreendidos, incluindo expressão específica de alelo (ASE) e edição de RNA que podem exercer efeitos importantes na evolução do câncer1,4.
Aqui, aproveitamos os dados de sequenciamento multirregional de pacientes recrutados para o estudo TRACERx2 para entender melhor o impacto de vários recursos transcriptômicos e sua interação com a diversidade genômica e fenotípica na evolução do NSCLC em diferentes escalas espaciais e temporais.
Analisamos os dados correspondentes de sequenciamento de RNA-seq e exoma total de 347 pacientes recrutados para o estudo prospectivo TRACERx (coorte TRACERx 421). As amostras da coorte compreendiam 947 regiões tumorais de 354 tumores NSCLC (6 pacientes abrigavam múltiplos primários no momento do diagnóstico), bem como 96 regiões de tecido pulmonar normal adjacentes ao tumor (consulte o diagrama de padrões consolidados de relatórios de estudos (CONSORT) em Informações Suplementares)6 ,7. Desses pacientes, 344 tinham 886 regiões tumorais primárias, 21 também tinham 29 regiões metastáticas de linfonodos (LN) amostradas na ressecção cirúrgica do tumor primário e 24 pacientes tinham 30 regiões tumorais metastáticas amostradas na recidiva ou progressão. No total, 168 regiões tumorais primárias e 4 regiões LN de 64 pacientes nesta coorte foram previamente descritas na coorte TRACERx 1008. A coorte de regiões primárias-metastáticas pareadas analisadas aqui (e relatadas em um artigo complementar6) compreende 61 regiões metastáticas, incluindo regiões LN e metástases intrapulmonares ressecadas na cirurgia (doravante denominadas lesões primárias LN/satélite) e LN e regiões metastáticas em recorrência ou progressão.
1) was most readily observed within truncating mutations in genes in the highest expression tertile. Notably, within non-cancer genes, signals of negative selection (dN/dS ± 95% confidence intervals of <1) were identified within truncating mutations in genes within the highest expression tertile only (242 truncating mutations, relative to 3,932 observed truncating mutations, were estimated to have been lost through negative selection in these genes). Similar patterns were observed when dividing the data by different expression quantiles (Extended Data Fig. 1i)./p>8 reads (Methods). It was possible to evaluate ASE in a total of 16,378 different genes across all samples within the cohort at an average of 3,809 (s.d. ± 885) and 4,064 (s.d. ± 485) genes per tumour and normal tissue sample, respectively./p>G substitutions, in keeping with ADAR-linked RNA editing, which deaminates adenosine to inosine, a nucleotide that is then read as guanosine by the translation machinery26 and sequencing platforms. Of these substitutions, 65% were present in the REDIportal database27 of known A>G editing events in human tissues. C>T substitutions28 represented 11.8% of the total substitutions detected. Of all the RNA substitutions detected, 67% were tumour specific (not present within a TRACERx panel of samples of normal tissue), and of these, 29.4% were shared between two or more tumours./p>G transitions, whereas RNA-SBS2 consisted mainly of C>T transitions. RNA-SBS3 consisted mainly of A>G and T>C transitions, RNA-SBS4 of G>A transitions and RNA-SBS5 of G>T transversions. RNA-SBS1 and RNA-SBS3 were identified in most tumours (RNA-SBS1 in 98% and RNA-SBS3 in 85%). RNA-SBS1 exhibited the lowest ITH and was detected within all regions of 87.4% of multiregion tumours./p>G sites from REDIportal was highly similar to RNA-SBS1 (cosine similarity = 0.97), consistent with the A>G activity of ADAR underpinning RNA-SBS1./p>T transitions at TpC sites (67%), a motif consistent with the RNA editing activity of APOBEC3A (ref. 30). In keeping with this, an unbiased analysis showed that RNA-SBS2 correlated more strongly with APOBEC3A expression than with any other gene in the transcriptome (Pearson's r = 0.73, FDR = 4.7 × 10−108; Fig. 3d). A multiple linear regression considering all APOBEC enzymes revealed that the expression of APOBEC3A was the strongest independent predictor of RNA-SBS2 activity, although APOBEC3F was also significant (P = 2.6 × 10−57 and P = 0.01 for APOBEC3A and APOBEC3F, respectively, linear mixed-effects model). Investigating the link between RNA-SBS2 and C>T enrichment at APOBEC3A-specific motifs30,31 further confirmed that RNA-SBS2 was strongly influenced by APOBEC3A expression (Extended Data Fig. 3c,d). Associations between gene expression or genomic features and the activity of the three remaining RNA-SBS signatures did not produce any obvious explanations for their aetiology./p>40% of all genes with zero counts (estimated using the QoRTS output Genes_WithZeroCounts) were excluded. Additionally, samples with <20% of reads mapping to a genomic area covered by exactly one gene in a coding sequence genomic region (estimated using the QoRTS output ReadPairs_UniqueGene_CDS) were excluded. Next, RNA coverage was calculated for single nucleotide variants (SNVs) detected in matched whole-exome sequencing data per tumour region using SAMtools (v.1.9)61 mpileup. Mutation expression was used to further quality check the mapping of RNA reads. The expression of SNVs exclusive to a given tumour region was used to detect potential instances of within-patient mislabelling of RNA–DNA matched tumour regions as well as to exclude normal adjacent lung tissue regions that expressed mutations present in paired tumour regions. A similar approach was applied to germline SNPs to further assess potential sample swaps based on patterns of CN variation from matched DNA per tumour region. Tumour regions in which fewer than 10 mutations, or fewer than 25% of the total mutation count, had evidence of expression, and/or less than 10% of SNPs had evidence of biallelic expression, were excluded. Finally, tumour regions clustering with tumour-adjacent normal tissue regions (see the section ‘UMAP clustering’) and tumour regions with a low purity were also excluded from further analyses. To ensure the reproducibility and portability of the above pipeline, all steps described were implemented through the Nextflow (v.20.07.1)62 pipeline manager./p>0) were evaluated for an enrichment in driver mutations more commonly associated with LUADs./p> 0.5 as not significantly ASE. In the case of CN-dependent ASE, genes were required to show no significant ASE, irrespective of CN, to be categorized as not significantly ASE. Genes with no phasing information were not tested for ASE./p> 0.2). For each of these, we computed the number of CpGs that were significantly hypomethylated and hypermethylated in tumour samples compared to the normal samples, taking only loci that had coverage in all samples (minnormal = 10, mintumour = 3). We then calculated the fraction of differentially methylated positions that were hypomethylated. Using a linear mixed effects model, with tumour identity as random effect, we then compared this metric to the percentage of genes showing evidence of CN-independent ASE per sample (separately for LUAD and LUSC)./p>T events at known RNA-editing APOBEC motifs. APOBEC enzymes typically edit C>T variants at the fourth position of 4-nucleotide-long RNA hairpin loops. In particular, APOBEC3A favours the CAT[C>T] motif30,31./p>T variant site, a Fisher's test was performed to test whether C>T changes within 20 upstream or downstream nucleotides occurred more than expected by chance at specific motifs (CAT[C>T]) in either strand./p>0.2 CCF were considered as seeding for this analysis. In total, 516 primary tumour regions from 206 tumours for which seeding status could be established and for which all metrics tested could be measured (307 non-seeding regions, 209 seeding) were analysed. The following features were also considered for the classifier:/p> 0.75, n = 11). We one-hot-encoded categorical features using get_dummies from Pandas (v1.3.3)106 and then split the data into training and test datasets (75/25 split). After encoding, we had a total of 60 features. We scaled the continuous features using MinMaxScaler from sklearn.preprocessing (v.0.0)107 and used SMOTENC from imblearn.over_sampling (v.0.8.0)105 to improve the balance of the dataset in terms of numbers of seeding and non-seeding regions. Finally, we used the sklearn (v.0.0)105 framework to perform additional variable selection before training using a LinearSVC model (penalty = "l1"), keeping those features with importance ≥0.015. This threshold removed 15 out of 60 features. Following this initial pre-processing, we generated different subsets of the dataset depending on the source of the input features, thus downstream processes within the pipeline operated on three datasets: (1) genomic only features, (2) transcriptomic only features, and (3) all features./p>