Pooling pre-processed data from public studies sucks! It takes time and way too much brain energy. When I first started in bioinformatics a couple years ago, I spent much of my time doing two things:
1.) cleaning -omics data matrices, e.g. mapping between gene IDs (HGNC, Ensembl, USCS, etc.) for pre-processed data matrices, trying all sort of bioinformatics pipelines that yield basically the same results, investigating what is the exact unit being counted over when pulling pre-processed data from public database, etc.
2.) cleaning metadata annotation, which usually involves extracting and aliasing the labels to the exact same categories.
Nonetheless, one of the best thing that journals have done is forcing authors to submit their raw sequencing data to the Sequence Read Archive (SRA), making SRA a centralized resource that covers over a million sequencing runs:
Here is the solution I am proposing:
An automated pipeline to generate a single data matrix of sequence read counts for each species and -omic layer, which can also fit in your hard drive ( < 1,000Gb). I believe that “Science started with counting” (from “Cancer: Emperor of all malady” by Siddhartha Mukherjee), and thus I offer counts for all the features. Over 80% of bioinformatic analyses involve counting over some -omic data layer. For example, variants mainly require computing the A/C/G/T counts for each base, expression quantification requires calculating the number of reads covering the gene, and peaks in ChIP-seq involves counting read coverage over defined regions. Since most findings can be “validated” by simply counting reaction cycles in qPCR, it strikes me as odd that we can’t do the same in omics.
With the raw counts, you can normalize the data however you want. One thing I noticed is that the most normalizations can be derived post hoc based on the read counts. For example, FPKM or TPM in RNA-seq can be obtained from the transcript counts with some simple multiplication and division operations given the transcript lengths and total reads.
Also, the metadata table consists of controlled vocabulary (NCI Terminology) from free text experiment annotations. For this project, I used the NLM Metamap engine for extracting keywords from free text. The nice thing about this is that the UMLS ecosystem from NLM allows the IDs (Concept Unique Identifiers) to be mapped onto different ontologies that relate the terms. Incidentally, NCIT is by far the cleanest general purpose biomedical ontology I have seen: it has low term redundancy, encodes medical knowledge from many domains and is well maintained.
The goal of this pipeline and the resulting read count tables is to suit the needs of the most common use cases. To use an analogy, most bioinformatics pipelines out there are more like sports cars, having custom flavors for specific groups of drivers. If you have very particular requirements, what I am offering is probably not going to work. What I am trying to create is more like a train system, aiming to provide a general purpose backbone that can allow a wide variety of high throughput data analytics.
Why Skymap while there are so many groups out there also trying to unify the public data
To the best of my knowledge, Skymap is the first that offer both the unified -omic data and the \metadata. The other important aspect is that the process of data extraction is fully automated, so it is can be scalable to new data and systematic, without biases introduced by manual curation.
Why Skymap offers a local copy instead of a web API:
Again, the purpose of this project is more geared towards bioinformaticians or data scientists who want to quickly test a hypothesis on a truly vast amount of data. Speaking for myself, I really hate when I have to recover a simple table by requesting each row from REST API repeatedly when it should have only required one click on an FTP link. As it turns out, the entire raw metadata from SRA can fit into a typical amount of RAM (3 GB), making complex client-server APIs unnecessary.
TL/DR: The premise of skymap is this: With a couple clicks all of the publicly-available omic data sits on your computer. After that it’s up to you: you can slice and dice it however you want.