top of page

Future of Identifying the Causes of Disease


I watched a documentary called "Toxic Puzzle" about a theory of the cause of neurodegenerative diseases such as ALS, Alzheimer's, and Parkinson's. The theory is that these diseases are caused or partially caused by a chemical product of cyanobacteria (blue-green algae) called BMAA. The theory is based on the fact that in areas where there have been more algal blooms, there has been a higher incidence of the diseases. The BMAA produced by the cyanobacteria can be transmitted to humans through drinking contaminated water, swimming in contaminated water, or even through the air, as is the case for the the US soldiers of operation Desert Storm. Supposedly the desert in Kuwait where the operation took place has a lot of BMAA in the air from dried cyanobacteria in the dirt, and some scientists think this could be the cause of why those US soldiers had double the rate of neurodegenerative disease.


There are a few questions this theory raises for me, both about this particular theory of the cause of a set of diseases, and about finding the causes of other diseases. Even with modern science we still don't know what causes a lot of the most common diseases that claim lives. From cancer and heart disease to the common cold, there are still many unknowns and expert disagreements as to their origins.



If one is to find the cause of a disease a logical first step would be to look at regions with a higher than normal incidence of the disease. This is the approach taken by the scientists studying the link between neurodegenerative diseases and cyanobacteria. However, there is a fundamental problem with this approach. The incidence of disease per population will natural vary across the world. This can be based on millions of factors that we simply do not understand, or it can be based on a highly discounted phenomenon at work across the universe – randomness.


If one is given only one piece of information, the average rate of disease X across the world, the scenario that the rate of disease X in every individual population across the world is uniform and equal to the average rate of disease X is the MOST probable scenario, however it is NOT probable. There will be some populations that naturally have higher than average rates of disease X simply due to randomness and no other factors.


The question epidemiologists seek to answer is: is disease X dependent on certain risk factors or is it just randomness? So for the theory above, they looked at the groups of people with the highest rates of ALS then determined what is also common among these people. They settled on exposure to cyanobacteria and formed their theory. Of course human populations and individuals are so diverse and complex that there are potentially millions of connections that can be made between two data sets. It is possible that the rate of ALS and the presence of cyanobacteria are both high in certain populations and have absolutely no correlation. It could be due to complete randomness. Then in a dubious practice called "P-hacking", scientists can further ascertain their theory by only looking at data that confirms their theory or supports their points. Often P-hacking is not done maliciously, it's an inadvertent result of confirmation bias.


So if it's extremely difficult for humans to identify patterns among millions of data sets that may or may not be related, how are we supposed to derive accurate connections between diseases and their causes? Even if the ALS and cyanobacteria theory is correct, why isn't the rate of disease even higher in populations with high levels of exposure to cyanobacteria? Is there a genetic connection at well? Is it a combination of 2, 3, 5, or 157 risk factors that determines whether a person gets neurodegenerative disease?


Here's my thoughts on finding the cause of diseases:

The shear number of connection possibilities relating to the cause of diseases is too great for humans to compute. There are a lot of new technologies emerging that can make getting these answers a reality. I believe our best chance at determining the cause of disease lies at the intersection of genomics and specialized artificial intelligence. If hundreds of thousands of people are exposed to the same environmental factors and some get cancer and some don't, then there must be a genetic and/or randomness component as well. If we are able to map the genomes of humans with and without certain diseases, store that information in a database along with thousands of other data sets relating to lifestyle and environmental factors, then use a powerful specialized AI system to use that data to search, organize, form hypotheses, test those hypotheses against other data points, and generate a series of conclusions complete with probabilities and statistical significance, I believe this is our best chance at identifying the causes of complex diseases.

Comments


bottom of page