Copied to clipboard!

Low-coverage whole genome sequencing for a highly selective cohort of severe COVID-19 patients

Background Despite advances in identification of genetic markers associated to severe COVID-19, the full genetic characterisation of the disease remains elusive. Imputation of low-coverage whole genome sequencing (lcWGS) has emerged as a competitive method to study such disease-related genetic markers as they enable genotyping of most common genetic variants used for genome wide association studies. This study aims at exploring the potential use of imputation in lcWGS for a highly selected severe COVID-19 patient cohort. Findings We generated an imputed dataset of 79 variant call format (VCF) patient files using the GLIMPSE1 tool, each containing, on average, 9.5 million single nucleotide variants. The validation assessment of imputation accuracy yielded a squared Pearson correlation of approximately 0.97 across sequencing platforms, showing that GLIMPSE1 can be used to confidently impute variants with minor allele frequency up to approximately 2% in Spanish ancestry individuals. We conducted a comprehensive analysis on the patient cohort, examining hospitalisation and intensive care utilisation, sex and age-based differences, and clinical phenotypes using a standardised set of medical terms specifically developed to characterise severe COVID-19 symptoms for this cohort. Conclusion This dataset highlights the utility and accuracy of lcWGS imputation in the study of COVID-19 severity, setting a precedent for other applications in resource-constrained environments. The methods and findings presented here may be leveraged in future genomic projects, providing vital insights for health challenges like COVID-19.

Type: Other
Archiver: European Genome-phenome Archive (EGA)

1 Dataset

Click on a Dataset ID in the table below to learn more, and to find out who to contact about access to these data

Dataset ID	Description	Technology	Samples
EGAD00001011363	We generated a dataset consisting of 79 VCF files, and respective FASTQ and CRAM files, methodically generated using the GLIMPSE1 imputation algorithm leveraging the 1000 Genomes Project Phase 3 dataset as the reference panel of haplotypes. In total this dataset is composed of approximately 325 GB of FASTQ data, 156 GB of CRAM data, and 6 GB of VCF data. Our samples were specifically derived from sequenced DNA from a highly selective cohort of patients, mostly comprised of Iberian Populations in Spain (IBS) individuals but also containing some individuals with other genetic backgrounds, who presented severe COVID-19 symptoms during the initial wave of the SARS-CoV-2 pandemic in Madrid, Spain. On average, each VCF file in this rich dataset contains 9.49 million high-confidence single nucleotide variants [95%CI: 9.37 million - 9.61 million].	unspecified	80