Health

Genomics and proteomics: the emerging role of machine learning

Tag:

Computing and communication technologies have already had a massive impact on today's healthcare system. Even so, there is great potential for these technologies to not only improve, but profoundly change healthcare in several ways.

Sensors and mobile devices have entered into our everyday life, collecting information about our environment and ourselves. For the ill, elderly or people with disabilities, wireless sensors in the environment or attached to the user's body enable analysis of vital information, rapid detection of emergencies, and increased independence and personal freedom.

The amount of data collected through sensors, clinical tests and scientific experiments is vast. The intrinsic information in this data can most often not be observed by the human eye. It requires sophisticated statistics, and computers that apply the statistics and learn about the specifics of particular data sets while processing, which we call 'machine learning'.

My genes – my health?

The Human Genome Project resulted in the mapping of the human genome. One of the most surprising outcomes of this project was that the number of genes that carry information for each person is only about 25,000 – it is hard to believe that this relatively small number of genes can explain the complexity of a person's body and mind. And yet it is small differences in these genes, and in other regions of the genome that control how they are used, that create the many variations of mankind. Unfortunately, some of these genetic variations can lead to a person having a much higher risk of getting certain diseases.

“Common diseases such as cardiovascular disease, cancer, obesity, diabetes and psychiatric illnesses are caused by multiple genetic and environmental factors,” says John Winn from Microsoft Research in Cambridge (UK). “Understanding how these factors interact with each other would allow better prevention, diagnosis and treatment of these diseases. It would allow personalised treatment based on the genetic make-up of the patients. Almost all existing approaches for studying the genetic causes of disease look at the effects of a single gene, and thus cannot capture the subtle interactions of many genes and the environment.”

Microsoft Research Cambridge is collaborating with researchers at the Wellcome Trust Sanger Institute to investigate how genetic variation affects disease. Data is key: the genetic information from the International HapMap project and from experiments at the Sanger Institute drives the understanding of the biological mechanisms involved.

“We are integrating two data sources: on the one hand, the variation of the gene sequences and, on the other hand, gene expression data, which means how active the genes are in the cell,” says John Winn. In fact, recent results from the project show that more subtle relationships between gene variation and gene expression can be detected than was previously possible.

John Winn is very enthusiastic about the value of this research: “Our goal in this joint project with the Sanger Institute is to analyse these multiple sources of genomic data using machine learning tools developed at Microsoft Research. By combining the expertise of the Sanger Institute and Microsoft Research Cambridge, we aim to gain new insights into genetic networks and the pathogenesis, diagnosis and treatment of human disease, whilst also driving the development of machine learning tools usable by the wider scientific community.”

The Case of the Bloody Fingerprints

The blood contains a very diverse mix of proteins – and this mix varies across the population. There are many individual differences; however, there are also protein patterns that correlate with specific diseases.

This is the basis for a new type of diagnosis. A new test would check for these protein patterns, often referred to as ‘fingerprints', and could then detect diseases in very early stages, long before symptoms are apparent, early enough to reduce costs and negative impacts that would be imposed in late-stage treatments, and increasing the likelihood of the patient's recovery.

In 2003, an article in NATURE on such a test for early-stage cancer diagnosis drew much attention and we saw the rise of the age of ‘Clinical Proteomics'. However, the results at that point were unreliable and NATURE commented “Running before we can walk?” The reason for the setback was that the specific fingerprints of diseases were still unknown. Identifying these fingerprints means comparing the common protein patterns of those who are affected by a disease and those who are not, and then extracting the difference.

To accelerate research in identifying fingerprints in blood serum, mathematicians from Freie Universität Berlin and DFG Research Center MATHEON, medical scientists from University Hospital in Leipzig and computer scientists from Microsoft Research have joined forces.

“We think that reliable identification can be achieved along two dimensions: latest statistical techniques and a high number of patients,” says Christof Schuette from Freie Universität Berlin. Doctors collect health status data of thousands of patients, including the protein mix in blood serum through new and standardised methods. This creates lots of new data, and one could imagine the value of a worldwide repository that medical scientists would contribute to and statistical analysis would benefit from.

Prof Schuette and part of his team

The data of the protein mix is collected through highly complex machines, mass spectrometers that fire at blood serum samples with a laser. The fingerprints that the researchers are looking for are overlaid and hidden by many tiny molecules, some of which are fragments that emerge when the laser is fired, often referred to as ‘noise', but some of this ‘noise' includes molecules that are worth detecting.

“The most important point is: we don't accept noise, we still search for information in that region which others reject as noise,” says Christof Schuette.

A reliable identification of that pattern in research and later its detection in clinical tests is not easy, but the results of this joint research project are very promising: The resolution of the identification of patterns could be increased by 40-fold. Researchers will continue looking for fingerprints that identify if a person is affected by a disease, accelerated by this improved ‘magnifying glass'.

“We are planning to extend this approach to include even more data sources, such as genomic information, to create what we call a BioPrint,” adds Tim Conrad, a researcher in Schuette's group. With this ‘multidimensional fingerprint' it might become possible to detect even more types of diseases in early stages than possible with current state-of-the-art techniques.”

Comments (0) RSS comment feed |

Comments

There are currently no comments, be the first to post one.

Post Comment

Name (required)

Email (required)

Website