Machines help scientists spot new patterns in Alzheimer’s, cancer and more

Written by 

rep code machine learning 1000px 

 

During the past several decades, a string of highly promising therapies to treat Alzheimer’s disease have failed in clinical trials. Thousands of patients and families, at UAB and other trial sites around the world, have seen their hopes dashed. Researchers, who saw years of work and patiently elaborated hypotheses crumble, were devastated, too. But Richard Kennedy, M.D., Ph.D., suspects the wreckage of these failed tests may yet hold hidden clues.

rep machine learning kennedy 550pxRichard Kennedy, M.D., Ph.D., is using machine-learning techniques to search for new Alzheimer's disease therapies among the 10,000 medications being taken by patients who participated in failed trials. In 2017, with a multi-year, multimillion-dollar commitment from the National Institute on Aging and a database of some 8,000 Alzheimer’s drug trial participants going back 30 years, Kennedy began testing a novel idea. He knows exactly how quickly the disease progressed in each of those participants. And he knows which medications (other than the test therapy) they were taking — drugs for heart disease, high blood pressure, diabetes and on and on. “That’s 10,000 or so medications that could be associated with slower progression of Alzheimer’s disease,” said Kennedy, an associate professor in the Division of Gerontology, Geriatrics and Palliative Care. “It’s a huge natural experiment.” The study participants’ primary care doctors “put them on these medications — not for Alzheimer’s, but they were taking them when they came into the Alzheimer’s study,” Kennedy said. And because “you get these cognitive assessments all along the way,” it’s possible to see if any of those medications was associated with better assessment scores.

This is a pattern-matching problem on a vast scale — akin to finding matching stalks of hay in a colossal haystack. To do that, you would need to compare each strand to every other strand, across multiple dimensions. This is an ideal use case for machine learning — the subfield of artificial intelligence that has already revolutionized image search, fraud detection and many other applications. Machine learning is catching on with scientists, too. Between 2015 and 2018, the number of biomedical papers involving machine learning jumped to more than 8,000 per year from less than 1,000. Machine learning is at the core of Kennedy’s approach to Alzheimer’s and a second NIH R01 grant he received in 2019. In fact, Kennedy was one of three UAB researchers to receive significant NIH funding for machine-learning applications in clinical care this past year. Here, they discuss their work and the ways in which machine learning could revolutionize treatment for millions.

 

Finding the hay in a haystack

Just down the street from Kennedy’s lab, Brittany Lasseigne, Ph.D., is chasing another haystack issue. Lasseigne, a recipient of the NIH’s prestigious K99/R00 Pathway to Independence award, was recently recruited to UAB from HudsonAlpha Institute for Biotechnology in Huntsville (see our recent stories on other K99 recipients here, here and here). She runs a hybrid lab; there is a gene-sequencer in one corner, a high-powered computer in another and plenty of beakers, tubes and other typical wet-lab glassware in between. Her goal: to perfect “cell-free” sequencing tests that can track DNA and RNA shed from tumors and other dysfunctional cells — all in a single milliliter of patient blood. “Our question is, How can we learn something new about how a disease starts or how it progresses?” said Lasseigne, an assistant professor in the Department of Cell, Developmental and Integrative Biology. “Our inputs are genomic datasets — DNA, RNA or how they interact with proteins — and our outputs are biological signals or clinical outcomes. And we use machine learning to look across those datasets to find patterns. That lets us identify sets of biomarkers, so when we see those patterns in the future we know how a patient will respond.”

“When you have a set of records where you know the classification, such as whether someone became delirious or not, especially if it’s a very large dataset where there are many possible predictors, that’s an indication that machine learning may be a good choice.”

Meanwhile, one block away, Surya Bhatt, M.D., Ph.D., is looking for patient patterns of a different sort. A pulmonary physician specializing in chronic obstructive pulmonary disease (COPD), he is training a machine-learning model to pore over CT scans and recognize an often overlooked and untreated subset of patients. His work, which is funded by an R21 Exploratory/Developmental Research grant from the NIH’s National Institute of Biomedical Imaging and Bioengineering, “could lead to a paradigm shift in the field,” potentially improving treatment for millions, said Bhatt, an associate professor in the Division of Pulmonary, Allergy and Critical Care Medicine and director of the UAB Lung Imaging Core.

 

What is machine learning, anyway?

“There are a lot of debates about what’s machine learning and what’s not,” Lasseigne said. “If the computer is helping me with the equation, I call it machine learning.” It’s just that the help is a little different from what we’re accustomed to when using machines. “I think of it this way: A calculator takes a set of numbers — say 2 and 3 — and an operator — the ‘plus’ sign — and gives you 5,” Lasseigne said. “Machine learning turns that on its head. You give it the inputs 2 and 3 and the output 5 and ask the computer to help calculate it.”

“There are a lot of debates about what’s machine learning and what’s not. If the computer is helping me with the equation, I call it machine learning.”

That, of course, is an equation you can solve in your head. But try this instead: Choose 20-30 words at random out of the 10,000 or so in an average electronic medical record, then look through more than 50,000 medical records to see if they have any correlation with that patient’s eventual delirium status during a hospital stay. Then do it again, and again, until you’ve covered the whole lot. That’s the topic of Kennedy’s latest grant, another multi-year, multi-million dollar R01 from the National Institute on Aging using machine learning.

 

Investing in computational power

“When you have a set of records where you know the classification, such as whether someone became delirious or not, especially if it’s a very large dataset where there are many possible predictors, that’s an indication that machine learning may be a good choice,” Kennedy said. “This type of discovery process is very difficult using traditional tools. With a desktop computer it would take years.” But a multi-million dollar institutional investment in UAB Research Computing resources during the past several years means Kennedy now has the power to run this type of experiment on UAB’s 4,000-plus-core Cheaha supercomputer in a few hours or days. “Cheaha is a lot of what makes this possible,” Kennedy said. “We’re fortunate at UAB that we have a really good resource in our Research Computing department.”

Kennedy has worked closely with Research Computing staff such as William Monroe, who says he has been partnering with a growing number of UAB researchers during the past year from the schools of Medicine, Engineering, Dentistry and more.

Kennedy also has tapped into the expertise of the UAB Informatics Institute, including John Osborne, Ph.D., who specializes in natural language processing (see Watching his words, below). “You have to be able to pick up on the different way things are phrased in a patient chart,” Kennedy said. A doctor might write “patient has altered mental status,” while a nurse’s comment might say a patient was “acting weird.” “That’s not a diagnosis of delirium, but it can be an indicator of it,” Kennedy said. And, by the way, the computer needs to understand that “patient has altered mental status” and “patient’s mental status is altered” refer to the same thing, while “patient’s mental status is not altered” is the opposite.

 

Story continues below box

 

rep nih t cells 1000px

UAB does compute

14.2 million — that’s the number of compute hours on UAB’s Cheaha supercomputer used by 1,134 university researchers from 23 academic units in fiscal year 2019, up from 3.6 million compute hours used by 621 researchers in 11 units in FY2017 (see graphic from Research Computing above).

Ralph Zottola, Ph.D., assistant vice president for Research Computing at UAB, shared those figures in a recent talk sponsored by the UAB Informatics Institute, and he noted that the surge in interest was due in part to the launch of Open OnDemand, a web-based interface to the Cheaha cluster that enables users to interact through any internet browser. Zottola also pointed out the continuous upgrades made to Cheaha, which now has 528 teraflops of processing power in its 3,744 CPUs. Research Computing also added $1.6 million worth of additional storage in the past year.

Another indication of chip-backed productivity: UAB researchers supported by the Cheaha resource published 2,331 peer-reviewed articles between FY2017-FY2019.

 

What you can do at UAB

The university’s collaborative atmosphere was a key element in Lasseigne’s recruitment. “UAB is very team-science-oriented,” she said. “I can go have coffee with a clinician in the Cancer Center and talk about problems they’re seeing in patients — that is something you can only do at a major medical center.”

UAB also is willing to invest in innovative thinking, Bhatt said. His project grew out of his time in the Deep South Mentored Career Development scholars program in the UAB Center for Clinical and Translational Science. The program, also known by its NIH grant designation, KL2, is designed for junior faculty with a passion for translational research. Bhatt went on a CCTS-funded sabbatical to Auburn University to acquire new skills in cardiac MRI imaging with a biomedical engineering faculty member at Auburn. In the process, he met a visiting professor from Georgia Tech University who was an expert on fluid mechanics in the heart. “That was the genesis for this idea to study fluid mechanics related to major airway collapse in COPD,” Bhatt said.

Learn more about how Bhatt, Lasseigne and Kennedy are tackling their problems with computing power:

 

 

1. Finding confused patients before it’s too late

rep machine learning pcu 550pxMore than 50,000 patients age 65 and older have received delirium screenings at UAB Hospital as part of the innovative Virtual Acute Care for Elders Unit program. With a machine-learning algorithm, Richard Kennedy (not shown above) is mining these records to find patterns that could allow doctors to predict which patients are at highest risk for delirium.The project: Automating Delirium Identification and Risk Prediction in Electronic Health Records, launched in spring 2019 with a $403,449 grant from the National Institute on Aging. Principal investigator: Richard Kennedy, M.D., Ph.D. Investigators: Kellie Flood, M.D., John Osborne, Ph.D.

The problem: Delirium, marked by confused thinking and a lack of environmental awareness, is extremely common in hospitals. According to a 2012 study, up to 25% of patients ages 65 and older who are hospitalized have delirium on admission, and another 30% will develop delirium during their hospital stays. But the true numbers may be far higher. “Delirium is not well recognized,” Kennedy said. “Research studies in hospitals found that 75% of the time, doctors were unaware their patients had delirium.” Identifying delirium is very important, he added. “It is associated with a higher risk of death and dementia.”

The data: For the past several years, every patient 65 and older admitted to UAB Hospital has received a delirium screening on admission and every 12 hours during their stay. It’s an idea borrowed from the hospital’s innovative Acute Care for Elders (ACE) Unit, which pioneers new approaches to improving care and quality of life for older patients. More than 50,000 patients have received screenings as a part of this Virtual ACE Unit program.

Kennedy is leveraging those 50,000 patient records, which give him an accurate picture of which patients did and didn’t receive a delirium diagnosis during their stay. Other researchers have attempted to isolate delirium cases by searching medical records for some two dozen keywords identified by experts. “They did terribly,” Kennedy said. “When they looked in the charts, they only found 10-11% of the cases had these keywords.”

Into the random forest: Kennedy is taking a different approach, using a machine-learning algorithm known as random forest (learn more about machine learning algorithms below). “It doesn’t go in looking for anything specific,” Kennedy said. “It takes a bunch of people who were diagnosed with delirium through a clinical interview, and it looks for patterns in the medical records, rather than testing hypotheses we already have.”

The average medical record contains 10,000 words. Kennedy’s algorithm picks out a tiny portion of that chart — 20-30 words — and checks to see whether they have any association with the patient’s eventual delirium diagnosis. “Then it picks another 20-30, then another 20-30, picking at random so eventually it covers all the words in the chart.”

Kennedy will repeat the procedure for all 50,000 medical records in his study. Then he plans to verify his findings in a new sample of patients admitted to the hospital.

Automated alerts: “This is down the road, but we could have a delirium risk assessment built into the electronic health record so when a person comes in to the hospital it will look through their chart and see their risk,” Kennedy said. “Then it can update that prediction during their hospital stay based on lab values or diagnoses or words that start popping up in their chart.”

 

 

2. Bringing it all together in -omics

rep machine learning lasseigne 550px “UAB is very team science-oriented,” said Brittany Lasseigne, Ph.D., whose lab uses machine learning to uncover insights about disease initiation and progression. “I can go have coffee with a clinician in the Cancer Center and talk about problems they’re seeing in patients — that is something you can only do at a major medical center.”The project: Integrating Multidimensional Genomic Data to Discover Clinically Relevant Predictive Models, funded by a three-year, $747,000 R00 grant awarded in June 2019 from the National Human Genome Research Institute. Principal investigator: Brittany Lasseigne, Ph.D.

The problem: Genomic studies are producing a data deluge that grows exponentially with each passing year; then there are the additional avalanches generated by sequencing the transcriptome, proteome, microbiome and other -omic datasets. “Sequencers are much faster now,” Lasseigne said. “It used to take 10 days to run a batch of exomes, which only represent 3-5% of the genome. Now you can do several genomes in two days.” That’s great, but the trouble is, “you get all this data back,” Lasseigne explained. “Nobody knew what to do with it all 10 years ago and it’s worse now.

“We want to have the best picture we can have,” she continued. “Looking at only one is not the silver bullet. DNA is awesome, but outside of cancer there aren’t a lot of mutations to find. My money is on things like methylation that help us understand the tissue of origin and the broad changes we see in disease.”

The plan: “Part of the grant we’re working on is to identify other molecular patterns that are indicative of these broad changes,” Lasseigne said. “One of the reasons it’s so challenging to prevent, diagnose and treat diseases like cancer and Alzheimer’s is because everyone essentially has their own version of the disease. We’re using machine learning to see if we can group together patients where we think a common therapy will help them. We employ computers to find patterns that are meaningful for the clinic.”

Doing more with less: Another area of research in Lasseigne’s lab is cell-free sequencing. “With many types of cancer, we can’t always take a biopsy or get tissue samples of the primary tumor and we certainly can’t get samples from all the metastases,” Lasseigne said. “But they are all shedding DNA and RNA in the blood. We can take a milliliter of patient blood or urine, extract the nucleic acids and do a lot of big data analysis.”

That analysis will include a focus on the minimum requirements for diagnosis. “I don’t think we need 50 million sequencing reads to tell if a patient has cancer,” Lasseigne said. “By using smart analysis techniques and big data, we could do a profile with a million reads instead. That keeps costs down and you would need smaller samples as well.”

Lasseigne is collaborating with Eddy Yang, M.D., Ph.D., professor and vice chair for translational sciences in the Department of Radiation Oncology, senior scientist at UAB’s O’Neal Comprehensive Cancer Center and associate director for Precision Oncology in the Hugh Kaul Precision Medicine Institute. “He has patients with advanced-stage kidney cancer, and there aren’t great treatments for that,” Lasseigne said. Yang’s team collects blood samples from these patients over time, and Lasseigne looks for changes in the epigenome — “chemical modification tags on DNA,” she said. “We sequence everything there that we can and look for patterns indicating whether or not a patient is progressing or whether they are responding to a drug.”

 

 

3. Whose airways will collapse? 

rep machine learning bhatt 550pxSurya Bhatt, M.D., is using deep learning to diagnose central airway collapse in patients with COPD. "If this is successful, it could lead to a paradigm shift" in the field, he said.The project: Deep Learning and Fluid Dynamics-Based Phenotyping of Expiratory Central Airway Collapse, funded by a $226,659 R21 grant awarded in September 2019 from the National Institute of Biomedical Imaging and Bioengineering. Principal investigator: Surya Bhatt, M.D. Investigators: Sandeep Bodduluri, Ph.D., Young-il Kim, Ph.D.

The problem: More than 16 million Americans have been diagnosed with chronic obstructive pulmonary disease (COPD), according to the Centers for Disease Control and Prevention. Millions more suffer from the disease, which is characterized by blocked airflow and breathing-related problems. Alabama has some of the highest rates of COPD in the United States.

For most people with COPD, the main site of blockage is in the small airways of the lungs. But for some 5% of patients, breathing problems may be caused by collapse of the large airways when they exhale. The distinction is crucial, because bronchodilators, which are the first-line treatment for COPD, are effective in expanding the small airways but they do not help when the problem is collapsed large airways. “Unfortunately, the symptoms are almost exactly the same,” Bhatt said. “It’s very difficult to identify these patients in clinical practice. We estimate that there are 3.5 to 4 million people in the United States with this condition. It is significantly underappreciated.”

Currently, when physicians suspect large airway collapse they can investigate with bronchoscopy or with two CT scans — one as they hold their breath at full inhalation and another after they have completely exhaled.  The first method is invasive and the second involves a substantial amount of radiation exposure to the patient.

Seeking a third way with deep learning: With his R21 exploratory grant, Bhatt is exploring a third way. “We want to know, can we identify these patients with just a single CT scan, by training a machine to look at the geometry of airways and identify the probability that their airways will collapse when they breathe out,” he said.

The plan: Bhatt has a large database of CT images that includes individuals with large airway collapse.  “We’re going to feed the raw images into the computer and use deep learning to identify features that humans haven’t been able to pick up: subtle changes in branching patterns or geometry of the airways or differences in the wall,” Bhatt said. “That’s the hope.”

The work is done on Cheaha and also with an Nvidia computer in the lab, Bhatt said. Much of the programming is done by his trainee, Sandeep Bodduluri, Ph.D., an instructor in the division.

The endgame: “If this is successful, it could lead to a paradigm shift” in the field, Bhatt said. “Right now we don’t even identify most of these patients. If you have a high clinical suspicion with unexplained symptoms, this algorithm could give a probability score for someone to have this problem and then the physician could investigate further.”

 

 

Four machine-learning algorithms to know: Meet your matchers

rep machine learning GPUs 550pxMachine learning has a bewildering array of approaches to choose from. But “ultimately it all comes down to trees or regression — making a tree out of the data or fitting a line to the data,” explains Brittany Lasseigne, Ph.D. Here are some of the most popular algorithms used in the field and in UAB labs.

Random forest, a form of the decision tree, is the one of the most popular current machine-learning algorithms in life sciences research, according to a recent paper. The idea behind all decision trees is to narrow down a range of options using some criteria: think of a flowchart or the game 20 Questions. Any one decision tree is likely to have poor predictive value. But if you group thousands or millions of trees — hence the forest — each created by randomly choosing elements, and take the best average answer, the estimate becomes much more accurate.

Lasso regression, a type of linear regression, stands for least absolute shrinkage and selection operator. It aims to minimize or “shrink” irrelevant features (down to zero), leaving behind only the most predictive elements of the dataset.

“Instead of having just one algorithm we use, we’ll try multiple algorithms and see which works best,” Lasseigne said. In a recent paper, “we did lasso regression, random forests and support vector machines,” she noted. “We reported on the lasso regression, but the performance was almost indistinguishable across models.”

Deep learning, a more complex method, may be the most discussed machine-learning algorithm because of its spectacular success. Deep-learning models are improving language translation, automatic image identification (and image modifications, the so-called deep fakes) and reinforcement training, such as the AlphaGoZero model that taught itself to play chess — and beat the best human-designed chess software — over millions of simulated games. Deep learning uses multiple (hence, deep) layers of learning algorithms, with each node of a layer trained to recognize one specific feature of its dataset input and each layer wired into the next along a host of connections, reminiscent of how the brain is built from billions of neurons with trillions of connections between them. When dealing with photos, for example, each node in a deep-learning model will aim to pick up on a salient part of an image. The process involves a vast number of prediction attempts, with the results used to reshape the model’s architecture. Connections to nodes that make successful predictions are strengthened, while unsuccessful nodes are eventually cut out of the loop.

The trouble with deep-learning approaches is they generally produce black boxes: the output may be accurate, but the methods are inscrutable and that complicates their use in science, where understanding mechanisms is often the whole point. “I’m a big fan of trying the simplest methods first” — such as random forest and lasso regression — “and then going to the more complex,” Lasseigne said. “They are much more interpretable than deep learning, where most of the methods are buried in the steps.”

Lasseigne’s lab is “getting into transfer learning,” she said. “If I have 10,000 images of a dog and feed them into a computer, it can learn to identify a dog. But if I only have 100 pictures of a cat, that’s much harder. Transfer learning starts on the 10,000 dog pictures and then finishes on cats, transposing between one group and the other. We’re starting to work on that with small, heterogenous or rare cohorts. You’re never going to have a cohort of 10,000 patients to work on in these areas. But you can leverage big data collected on common diseases to give you insight into rare ones.”

 

 

Watching his words

rep machine learning osborne 550pxJohn Osborne, Ph.D., assistant professor in the Department of Medicine, created a natural language processing (NLP) algorithm to assist registrars in indentifying reportable cancers in clinical text.In order to comply with federal guidelines, hospitals such as UAB must regularly report on the number of cancer cases they treat and the subtypes involved. The process requires a lot of reading of patient charts and can be onerous for staff to keep up with. Several years ago, when he was still a doctoral student, John Osborne, Ph.D., created a natural language processing (NLP) algorithm to assist registrars in indentifying reportable cancers in clinical text. NLP is a blend of linguistics and computer science focused on developing software that can understand the highly unstructured concepts and relationships contained in human writing, such as the free text fields in a patient’s electronic health record.

UAB Hospital “is still using that,” Osborne said proudly from his office at the UAB Informatics Institute, where he now is an assistant professor in the Department of Medicine. Similar concepts lie behind an Informatics Institute project called the Phenotype Detection Registry System (PheDRS). The system, which Osborne and colleagues described in a 2018 paper, is being used to enable UAB pulmonary nurses to identify and connect with patients who have COPD — even when they are in the hospital for a seemingly unrelated condition.

Natural language processing has made great leaps forward in recent years, Osborne said. Most of today’s best NLP algorithms — “things like BERT and ELMO and other weird-sounding tools,” Osborne said — are neural networks that use graphics processing units (GPUs) to accelerate the repetitive vector multiplication behind their calculations. Recent upgrades to UAB’s Cheaha supercomputer, which now boasts 72 Nvidia GPUs, have made a significant difference in his work, Osborne added.

“The research question I spend a lot of time on is information extraction — pulling contexts from text,” Osborne said. Often, these are medical records, as in his project with Richard Kennedy, M.D., Ph.D. (see main article). But other projects have sifted data from PDF documents and other intransigent files. Another project focuses on automatically identifying patients with cancer or other diseases whose molecular profiles, diagnoses or other criteria that make them eligible for specific clinical trials. “Unlike most normal NLP researchers, my work is heavily clinically focused,” Osborne said. “The problem I’m usually solving is, ‘Find me this set of patients’ — using machine learning.’” 

Return to story

 

 

Learn the basics of machine learning with hands-on videos from UAB’s Data Science Club

Learn the basics of machine learning with hands-on videos from UAB’s Data Science Club

Interested in data science but don’t know where to begin? A new program from Research Computing offers a step-by-step intro to “one of the most in-demand skillsets today.”

Read more

Machine-learning helps predict which cancer patients are most likely to enter the fog

Machine-learning helps predict which cancer patients are most likely to enter the fog

Noha Sharafeldin, MBBCh, Ph.D., of the Institute for Cancer Outcomes and Survivorship, used UAB’s supercomputer to identify biomarkers linked with cognitive impairment in patients who received a blood or marrow transplant. She’s also testing a way to repair the damage.

Read more

DREAM Challenge to automate assessment of radiographic damage from rheumatoid arthritis

DREAM Challenge to automate assessment of radiographic damage from rheumatoid arthritis

UAB is helping develop automated technology to assess joint damage from rheumatoid arthritis.

Read more

More storage, better data, new partners: Zottola’s people-first approach to research computing

More storage, better data, new partners: Zottola’s people-first approach to research computing

UAB’s inaugural AVP for Research Computing explains how he went from lab Ph.D. to IT guru and charts the next moves to accelerate science through technology.

Read more