What percentage of the population provides their DNA for research? Are there efforts/estimates of how much it would cost to provide cash incentives to grab more data, and what the projected benefits would be? Personally I'd be happy to provide my DNA if I was asked by my primary care, but I've never been asked. Isn't my DNA valuable? Is this something government would have to legislate, or is there a private means of coordination?
These are all live questions. We have a lot of DNA for research purposes (23andMe - 12M; AllofUs - 1M, UK Biobank - 500k; Regeneron 1M exomes, gnomAD, etc), but DNA itself is only so useful if we don't have family sampling and good phenotyping data. We only have a modest amount of this relative to what we could have. It is usually expensive to collect rigorous phenotype data especially on the same scale needed to power GWAS for complex traits. It's why easy measures like height, EA, etc have been published on first.
The other issue is genotyping approaches vary in meaningful ways. We may actually need whole genomes assessed with long-read sequencing to answer some important questions. Most of the genotyping like for Ancestry or 23andMe is sparse SNP data (optimized for ancestry info) while some of the better medical biobanks have exomes (~1.5% of the genome) which were assessed with short-read sequencing.
There have been pilot programs where newborns are prospectively provided with whole genome sequencing. The results have shown these to be medically valuable and cost effective, but this hasn't resulted in a campaign to do this for all newborns born every year. There are a number of ethical, regulatory, economic, and logistical issues that would have to be resolved before such a program could take off. A population newborn program is what we start to make genomic medicine a regular part of healthcare (most family med doctors do not have expertise in genetics and so it still wouldn't happen in that setting).
In general, the rate of participation in clinical research from the broader public is quite low despite strong incentives from payers to move patients onto trial rolls (especially in oncology). I do know that the participation rate of cancer patients is approximately 5% of the eligible population, though it is probably logistical hurdles and know-how that hold down the participation rate. Nothing is centralized about the system (except trials.gov) and for the longest time the trial rules require being on-site to participate in trials. Not fun to be a cancer patient in Mississippi who needs to get to California for a trial.
Collecting DNA can be decentralized (easy to get a blood draw or spit into a tube and mail it) but it doesn't solve the phenotype collection issue (researchers have tried to turn to wearables and telemetry etc) but this hasn't scaled yet.
Theoretically, it would be the job of a primary care doctor to collect phenotypic data -- family history of cancer, family history of heart disease, incidence of injury, incidence of surgery. Then, this data could be sent for central processing, where this phenotypic data would be mapped onto genotypic data to look for correlates. You say that 23and Me has optimized SNP, and we need exomes or even full sequencing to be effective. How much does full sequencing cost per person?
Yes, so there have been updates to medical record regulations and laws, and the portability of the medical record has improved some of late, but there are still enormous regulatory and technical hurdles to turning health record data into something usable for research purposes. For example, there is a lot of useful data stored in pdf format or unstructured formats, but it is not particularly easy to move data from a PDF into an executable and standardize file format (even using AI-based approaches like optical character recognition and natural language processing) at scale and be confident in its quality.
Also, the above discussion just entirely sets aside the legal privacy issues. Every individual would need to be consented for the research use of their data. Resolving these issues would probably requires some changes to existing federal legislation and agency policies (HIPAA, GINA, etc).
So yes, theoretically some public entity like the NIH or a private institution (the Broad) could do something like this where the electronic health record (EHR) is synced with genomic data. All of Us and the UK biobank already enables this to some degree (more feasible for UKB due to the centralized national health service), but even then, this is not necessarily the level of phenotypic resolution that I mean (though it's better than what currently is cutting edge). Longitudinally collected, super high-resolution quantitative phenotypes will be much more useful than categories things like simply having a diagnosis or particular health event. Some stuff like heart rate, blood pressure, blood lipids, other labs is in the EHR but other stuff like the total volume of one's ventromedial prefrontal cortex or how dense one myofibrils are are not.
The oft quote cost per clinical-grade genome is $100-$200, but this is a cost of goods sold (COGS) perspective and doesn't include the analysis portion of the process which is the real bottleneck. There are a couple direct-to-consumer companies that offer exome or genome sequencing, but I believe they just return customers a variant call file (VCF) which is basically just a list of all the location in their genome that differ from the reference. This is not some summarizing laboratory report (which would trigger a bunch of regulatory standards they'd need to meet) and is pretty much useless to someone without bioinformatic expertise (though AI could probably provide some clever people a good assist). There is no other way for a healthy person to access exome or genome sequencing unless they have a medical need (e.g. rare undiagnosed disease or a clear family history of a hereditary disease) and then the cost will run $3-10k (mostly covered by payers) but a lot of this sequencing is done by private laboratories and the only data returned to clinicians and patients is the laboratory report.
What I am imagining is some kind of opt-in feature, like when you get a driver's license they ask if you want to be an organ donor. In this case, it's a privacy agreement to share all your genetic and phenotypic data with the database.
It also sounds like we need to force doctors to input phenotypic data into excel. I think this would have to be done through medicare/medicaid eligibility regulations.
Yes, it would be great to have a real population scale infrastructure in America to have a way to quickly opt-in to genotypic and phenotypic data gathering for both research and medical purposes (there are indeed actually clinical valid PRS scores which are just only used by a few experts at a few major hospitals).
I also don't want to make it sound like there isn't progress. Recently, there was this paper in Nature on the UKB (https://doi.org/10.1038/s41586-025-09272-9). This is whole-genome sequencing (WGS) on 490,640 participants (diverse ancestry but mostly British) with 764 ICD-10 codes (basically diagnoses) and 71 selected quantitative phenotypes (phenotype coverage was variable across the cohort though). There were some interesting findings and the work can generally be said to show that genome sequencing is superior to exome sequencing.
What percentage of the population provides their DNA for research? Are there efforts/estimates of how much it would cost to provide cash incentives to grab more data, and what the projected benefits would be? Personally I'd be happy to provide my DNA if I was asked by my primary care, but I've never been asked. Isn't my DNA valuable? Is this something government would have to legislate, or is there a private means of coordination?
These are all live questions. We have a lot of DNA for research purposes (23andMe - 12M; AllofUs - 1M, UK Biobank - 500k; Regeneron 1M exomes, gnomAD, etc), but DNA itself is only so useful if we don't have family sampling and good phenotyping data. We only have a modest amount of this relative to what we could have. It is usually expensive to collect rigorous phenotype data especially on the same scale needed to power GWAS for complex traits. It's why easy measures like height, EA, etc have been published on first.
The other issue is genotyping approaches vary in meaningful ways. We may actually need whole genomes assessed with long-read sequencing to answer some important questions. Most of the genotyping like for Ancestry or 23andMe is sparse SNP data (optimized for ancestry info) while some of the better medical biobanks have exomes (~1.5% of the genome) which were assessed with short-read sequencing.
There have been pilot programs where newborns are prospectively provided with whole genome sequencing. The results have shown these to be medically valuable and cost effective, but this hasn't resulted in a campaign to do this for all newborns born every year. There are a number of ethical, regulatory, economic, and logistical issues that would have to be resolved before such a program could take off. A population newborn program is what we start to make genomic medicine a regular part of healthcare (most family med doctors do not have expertise in genetics and so it still wouldn't happen in that setting).
In general, the rate of participation in clinical research from the broader public is quite low despite strong incentives from payers to move patients onto trial rolls (especially in oncology). I do know that the participation rate of cancer patients is approximately 5% of the eligible population, though it is probably logistical hurdles and know-how that hold down the participation rate. Nothing is centralized about the system (except trials.gov) and for the longest time the trial rules require being on-site to participate in trials. Not fun to be a cancer patient in Mississippi who needs to get to California for a trial.
Collecting DNA can be decentralized (easy to get a blood draw or spit into a tube and mail it) but it doesn't solve the phenotype collection issue (researchers have tried to turn to wearables and telemetry etc) but this hasn't scaled yet.
Theoretically, it would be the job of a primary care doctor to collect phenotypic data -- family history of cancer, family history of heart disease, incidence of injury, incidence of surgery. Then, this data could be sent for central processing, where this phenotypic data would be mapped onto genotypic data to look for correlates. You say that 23and Me has optimized SNP, and we need exomes or even full sequencing to be effective. How much does full sequencing cost per person?
Yes, so there have been updates to medical record regulations and laws, and the portability of the medical record has improved some of late, but there are still enormous regulatory and technical hurdles to turning health record data into something usable for research purposes. For example, there is a lot of useful data stored in pdf format or unstructured formats, but it is not particularly easy to move data from a PDF into an executable and standardize file format (even using AI-based approaches like optical character recognition and natural language processing) at scale and be confident in its quality.
Also, the above discussion just entirely sets aside the legal privacy issues. Every individual would need to be consented for the research use of their data. Resolving these issues would probably requires some changes to existing federal legislation and agency policies (HIPAA, GINA, etc).
So yes, theoretically some public entity like the NIH or a private institution (the Broad) could do something like this where the electronic health record (EHR) is synced with genomic data. All of Us and the UK biobank already enables this to some degree (more feasible for UKB due to the centralized national health service), but even then, this is not necessarily the level of phenotypic resolution that I mean (though it's better than what currently is cutting edge). Longitudinally collected, super high-resolution quantitative phenotypes will be much more useful than categories things like simply having a diagnosis or particular health event. Some stuff like heart rate, blood pressure, blood lipids, other labs is in the EHR but other stuff like the total volume of one's ventromedial prefrontal cortex or how dense one myofibrils are are not.
The oft quote cost per clinical-grade genome is $100-$200, but this is a cost of goods sold (COGS) perspective and doesn't include the analysis portion of the process which is the real bottleneck. There are a couple direct-to-consumer companies that offer exome or genome sequencing, but I believe they just return customers a variant call file (VCF) which is basically just a list of all the location in their genome that differ from the reference. This is not some summarizing laboratory report (which would trigger a bunch of regulatory standards they'd need to meet) and is pretty much useless to someone without bioinformatic expertise (though AI could probably provide some clever people a good assist). There is no other way for a healthy person to access exome or genome sequencing unless they have a medical need (e.g. rare undiagnosed disease or a clear family history of a hereditary disease) and then the cost will run $3-10k (mostly covered by payers) but a lot of this sequencing is done by private laboratories and the only data returned to clinicians and patients is the laboratory report.
What I am imagining is some kind of opt-in feature, like when you get a driver's license they ask if you want to be an organ donor. In this case, it's a privacy agreement to share all your genetic and phenotypic data with the database.
It also sounds like we need to force doctors to input phenotypic data into excel. I think this would have to be done through medicare/medicaid eligibility regulations.
Yes, it would be great to have a real population scale infrastructure in America to have a way to quickly opt-in to genotypic and phenotypic data gathering for both research and medical purposes (there are indeed actually clinical valid PRS scores which are just only used by a few experts at a few major hospitals).
I also don't want to make it sound like there isn't progress. Recently, there was this paper in Nature on the UKB (https://doi.org/10.1038/s41586-025-09272-9). This is whole-genome sequencing (WGS) on 490,640 participants (diverse ancestry but mostly British) with 764 ICD-10 codes (basically diagnoses) and 71 selected quantitative phenotypes (phenotype coverage was variable across the cohort though). There were some interesting findings and the work can generally be said to show that genome sequencing is superior to exome sequencing.
Thanks for writing always so accessibly, Stetson!
Of course, thank you! Glad to hear it was accessible!