Interview: 'Everyone, Whether a Historian or a Geologist, Should Learn Mathematics'

In conversation with Sanghamitra Bandyopadhyay on microRNA, machine learning, computational biology and open science.

Bengaluru: Sanghamitra Bandyopadhyay is the director of the Indian Statistical Institute (ISI). She won the Infosys Prize in 2017 in the ‘Engineering and Computer Science’ category for her work on algorithmic optimisation in biological data analysis. Her work involves pattern recognition using machine learning in large datasets. She has identified a genetic marker for breast cancer, determined the co-occurrence of HIV and cancers, and helped understand the significance of the brain’s white matter in Alzheimer’s disease. She is a recipient of the Shanti Swarup Bhatnagar Prize, again in the ‘engineering sciences’ category. Sandhya Ramesh caught up with Bandyopadhyay in Bengaluru for an interview for The Wire. It has been edited for clarity.

Sanghamitra Bandyopadhyay. Source: Infosys Science Foundation

Sanghamitra Bandyopadhyay. Source: Infosys Science Foundation

SR: How did you get interested in biology when your training has been in physics and computer science?

SB: When I was in school, I was quite scared of biology. Later, when I studied computer science after a bachelors in physics, I was working on pattern recognition: finding patterns in large amounts of data. One day, a client company came to my team with a problem in biology which we could work on easily as it was a straightforward application of the work we were doing in pattern recognition. Then, I got a student who had a biology background. He came to me and suggested that I should look at data of small RNA molecules called microRNAs.

These microRNAs are responsible for fine-tuning a cell’s regulation. The amounts of molecules of various substances in a cell is what decides if a cell is healthy or diseased. These levels go up and down, and the number of molecules of each type generated in a cell is called its “expression”. MicroRNAs essentially fine-tune the level of expression of all the molecules in a cell. I don’t do the biology here. A lot of biologists publish data from their research, so I work on freely available data for the most part.

Then we tried to develop algorithms that can make sense of the data and recognise patterns in it. For example, if there is a small molecule, like a microRNA, it tends to target a certain other molecule and interfere with its working. Through computational methods, looking at this tiny RNA molecule and a target molecule, I work on predicting whether the former would meddle with the latter or not. These are fully computational predictions made by algorithms that extract features from both these molecules, study them, and learn when an interaction or interference would occur. Thus, we want to learn how to distinguish between such interacting pairs and non-interacting pairs in our data sets.

Our data sets are huge! Data generated from biological experiments result in very, very large data sets. Sifting through them and finding what we just can can only be done computationally, not manually. This is true for many other domains as well.

I’ve primarily been developing algorithms, and it so happens that I’ve been applying them to biological data. While it does require some understanding of biology and how these molecules work, my primary work is in making predictions and communicating my results to biologists. My algorithms can be applied anywhere, although biology does feel very real as it happens around us and in the cells of our bodies every day. So it’s great to use these sophisticated algorithms to gain insights into how our body is functioning.

SR: I’ve heard doctors say that these days we collect so much data 24×7, thanks to devices and trackers, that the future of biology and medicine lies in machine-learning more than anything else.

SB: Yes, exactly! There is so much to look into. Where should a biologist look and where should they not look? Here’s where computational predictions and mathematical models are handy.

In fact, it is said that drugs will no longer be discovered, they will be designed. But despite predictions and designing being done computationally, at the end you have to go to a biological laboratory to experiment. Predictions often go wrong and one must keep that in mind.

What we do on the computer isn’t the end. It’s just the beginning.

SR: Could you explain in layperson’s terms the work you did that resulted in identifying a breast cancer marker and what it means?

SB: Well, this was the end of our work. We had started with an algorithm that was looking at graphs – mathematical structures that had entities. You and I could be entities on this graph, for example, represented as nodes. Connections between us are called edges and could mean different things. If you made a phone call to me, we would be connected and the edge would mean that. No edge between two nodes means they haven’t interacted.

In our graph, these nodes were molecules and edges were interactions between two of them. It also had data on when a certain molecule controlled the expression of another. A molecule’s expression could be controlled by one or several other molecules. This is a big, complex network.

We were trying to analyse this network with the information we had from well-established biological experiments on molecular control. However this existing data is not sufficient for making inferences. So we plugged in high-confidence predictions we had that hadn’t yet been validated at the time. Then we created new algorithms to analyse this network.

We noticed one specific microRNA whose expression is dramatically altered between a healthy person and a diseased one. A diseased person had a much higher level of it. It indicated that this molecule played an important role in the development of the disease. This is the marker that we identified.

This isn’t the only marker. Several markers are well-known and can be found in the experimental literature implicated in certain diseases. Markers are important to understand and cure diseases. Targeting these markers through designer drugs can inhibit and interfere with markers wreaking havoc in the cell. Bacterial and viral infections already have good medicines. But viruses mutate and change themselves to be drug-resistant. Then we need new medicines that target different molecules, so we are in constant need of new markers. Identifying a marker and designing a drug is not the end of it.

SR: Have you been able to experimentally verify your findings?

SB: We aren’t equipped to do that as we don’t have the infrastructure or expertise to do biological validations. However we have collaborators who work on the data we’ve published. Typically, publicly published findings like ours are picked up by scientists all over the world who would then choose to experiment and validate our results for us. This gives us confidence about our own work. Some of our markers have already been validated by people we’ve never met, but such validations are not very common.

The best solution is to have collaborations, but in India, enabling collaborations between computer scientists, biologists, chemists and physicists has proved to be difficult. This is something that needs to be worked out in our Indian culture because appreciating others’ work and solving problems together are necessary for scientific findings to progress.

SR: Collaborative processes in Indian academia are still nascent, right?

SB: Yes, true. And it is for us to change.

SR: Can your evolutionary algorithms be applied to fields other than disease research and microbiology?

SB: Yes, definitely. Some of our algorithms are good at clustering of data, or finding groups in data. Several domains that work with large data sets use clustering as a fundamental process. For example, in financial markets, they can be used to know which stocks behave similarly over a period of time. Our algorithms have been used in portfolio management as well.

Our algorithms can also be used wherever optimisation is required. Chip design is an example. An easy application is a problem like the ‘traveling salesman’ problem that optimises time and cost for when a salesman has to visit several cities and come back to the starting point. This is a multi-objective problem: time and cost can’t be optimised simultaneously; when one goes down, the other goes up. Traditionally, people have attempted the primary approach of combining multiple objectives and optimise them together, but these problems can’t be answered by a single solution. They need a set of solutions involving trade-offs, like less cost and more time. Our algorithms optimise these trade-offs and provide the best possible outcome to such problems.

SR: How often do you get students from biology to work on computational problems compared to students from computer science?

SB: Quite a few, actually. There are many biology students these days who want to work on computational problems as nearly all fields today use software. Through this work, some biologists develop a deeper interest and want to work on the software itself. However, generally it is people from computer science and mathematics who typically get interested in biological problems. They need to learn some biology as well and can’t treat their work as working on just any other data.

Once again, collaborations are needed to work in interdisciplinary processes like this. We have six in and around Kolkata funded by the Department of Biotechnology working on systems biology: Indian Institute of Science Education and Research, Kolkata; National Institute of Biomedical Genomics; ISI – which performs the computation; Bose Institute; Indian Institute of Chemical Biology; and the Tata Medical Centre.

SR: What do you think is the future of machine learning in biology, both globally and in India?

SB: Oh, it’s big and will remain very important in the years to come because of the sheer complexity and volume of data being generated. There will be newer challenges as well. Medicine is a very empirical science where it’s been very difficult to create a mathematical model. But we’re understanding better and the pieces are coming together. Everything will require machine learning and making machines think like humans. Artificial neural networks that mimic human learning capabilities, quantum computing, DNA computing – all will have a bright future and newer areas will emerge.

SR: Where will it pick up in a big way in India, say over the next five years?

SB: Five years is too short a time for India. But in 20 years, machine learning will be routine. Today we get blood tests done when we have a disease, but then we will be getting our genome sequenced, which could provide a goldmine of data about you. Once this massive amount of data comes about, more computational interventions would be required because this data needs to be stored somewhere. And storing isn’t enough – the data needs to be indexed to enable quick searching and retrieving as well. So newer ways of optimal indexing will be born. Then there’s data compression, summarisation, data analysis – all will pick up. Then privacy will also start playing a big role because, just like biometrics, a genome is specific to an individual.

SR: So exponentially cascading growth for machine learning is to be expected.

SB: Yes, yes. Costs are coming down and awareness is increasing already.

SR: At what point in someone’s education should computational biology be introduced?

SB: Definitely at a masters level or above, not before that. Every student should first build up a core strength in one subject very well, whether it is biology or mathematics or computer science. Then at masters or PhD, they can step into other fields.

But I am of the opinion that biologists – and in fact everyone, whether they are a historian or a geologist – should learn mathematics and computing. I really wish that in the years to come, the powers that be ensure that till class four or five, children are taught only mathematics and languages, and nothing else!

SR: What do you think about open science and open data?

SB: That’s where the world is heading. All research done with taxpayer funds are open, and this is essentially how biology works already. A lot of biological data is available online free of cost, which helps researchers from countries like ours who cannot buy data. Same with software, too. The open source movement is prevalent, important and will continue. Healthcare especially can’t grow unless it’s global and open. But I’m curious to see how businesses will work around this.

SR: There has been a lot of investment in skilled research in India in recent times, especially in biology. Accompanying this are funds and grants like the Wellcome Trust/DBT. However, most academic positions are closed to people over the age of 35. What is your perspective on age limits in Indian academia, especially for women?

SB: Yes, this age limitation exists at the entry level. If you’d like to be an assistant professor, typically 35 is the limit, but things are changing now. In several institutes, this limit is being relaxed for women candidates by five years or so, which is required as a lot of academically productive time is lost for women. In academia and research, 30 to 45 is a highly productive age for excellent research, beyond which people tend to move to management and start setting the path for the next batch of youngsters. To some extent, this is prevalent, although it is often relaxed, like when older industry experts become professors upon entry. Relaxing these restrictions is gradually becoming the norm, at least in the top schools.

Sandhya Ramesh is a science writer focusing on astronomy and earth science.