How scientists at the Bioinformatics Platform are helping capture, integrate, visualize and understand biological data
Virtually every laboratory involved in the life sciences or health-related themes has had to find ways to cope with the “data deluge” that has swelled logarithmically over the past few years. The amount of information routinely produced in experiments using today’s standard technologies is so vast that it can’t be understood – or even captured and examined – without dedicated support based on computational methods. Usually these approaches have to be custom-fit to particular scientific questions and experimental set-ups. The results give different views of life and are often difficult to compare and combine.
One response has been the growth of the number of groups and scientists specialized in bioinformatics. Individual laboratories have often retasked scientists to broaden their horizons by gaining expertise in data analysis and the development of algorithms, or added members with this type of expertise. Additionally, on the institutional level, the MDC and its Berlin Institute of Medical Systems Biology (BIMSB) have responded by creating a Scientific Platform for Bioinformatics, headed by Altuna Akalin.
The group spends 50 percent of its time providing services to other labs, mainly within BIMSB, and the other half of its efforts goes to research. “The two go hand in hand,” Altuna says. “Nothing is carved in stone; you have to do research just to keep up with ongoing developments. The process of customization itself could be a form of research, as you adapt methods that are available in some form to new questions, experimental procedures, and techniques.”
On the service side, in addition to developing tools and software for data analysis in a range of projects, the 11-member group provides regular training to groups and “walk-in clinics” for individuals. They have just conducted a workshop in computational genomics for 20 participants, and plan another soon that will cover the “Galaxy workbench” that they operate at the MDC (http://galaxy.mdc-berlin.net).
“Galaxy is a framework that permits users to do bioinformatics with minimal specialized knowledge of the field and no programming skills,” Altuna says. “Users simply link or upload their data to their Galaxy session and use a web browser-based interface to select the tools and parameters they want to run.”
An ongoing task has been to maintain servers that are hosting applications developed by BIMSB scientists. “We maintain the machines for the circular RNA database by Nikolaus Rajewsky’s lab,” Altuna says. The group has also set up a number of other BIMSB-specific tools, software and databases, which run on their own servers, in addition to projects with other labs throughout the MDC.
For example, an ongoing project with the groups of Nikolaus and Uwe Ohler is to provide bioinformatics infrastructure regarding all sorts of data related to RNA molecules – a main focus of BIMSB: from data on transcription to the structure, binding partners, targets, and classification of RNAs. The group works on the development and maintenance of a public database called DoRiNA (http://dorina.mdc-berlin.de), whose focus is interactions between proteins, microRNAs and other molecules that have an effect on the regulation of RNAs once they have been transcribed.
The list goes on and on, and includes a “Help Desk” available to groups throughout the institute. A particular emphasis is placed on the development of tools to visualize data – basically, to render huge amounts of information in a form that allows a scientist to extract information and make sense of it all.
When should a lab approach the facility for help? “As early as possible,” Altuna says. “The best case is right from the beginning of a project, before any experiments have been performed. This allows us to help people design experiments themselves in a way that will produce the most meaningful data. A problem we have often seen is that data that has already been collected has ‘batch effects:’ various problems that arise when you try to put different results together in a consistent way that can be integrated with each other and other datasets. In many cases, for example, experiments haven’t been randomized. Failing to combine samples from patients and controls in the same round of sequencing, for example, can introduce biases. You might not know whether the differences you find come from the biology of a disease or some small difference or error in the handling of the samples between groups.”
Quality assessment is crucial, he says: it’s vital to ensure that experiments are of high quality. A failure to do so produces problems in the data. When analytical tools are then applied, problems with quality may be detected – in some cases mistakes can be corrected analytically, but sometimes it is not possible to salvage any data.
Computational genomics and epigenomics have interested Altuna throughout his career. Born in Turkey, he pursued undergraduate studies at the Sabanci University in Istanbul. “My original intent was to study molecular biology, but I got enthusiastic about bioinformatics after talking to a professor there.” He received his PhD from the University of Bergen, in Norway, where he worked on gene regulation, then held postdoctoral positions at Cornell Medical College in New York and at the FMI in Switzerland.
The group includes three postdocs and three PhD students who are engaged in diverse biomedical research projects. A central interest lies in developing original computational models for genomics and epigenomics – the relationship of particular variants of genes and the epigenetic changes they undergo (such as DNA methylation) – to biological processes and disease.
One project in this area, the focus of PhD students Katarzyna Wreczycka and Jonathan Ronen, concerns epigenetic changes found in cancer. While most cases of the disease exhibit features that are unique to individuals, patterns have emerged. Alongside the “usual suspects:” mutations in oncogenes known to disrupt essential biological processes, a number of tumors exhibit mutations in molecules involved in pathways that carry out epigenetic modifications of DNA. The students and their colleagues have further developed and maintained a popular software package called “methylKit” to approach the problem using statistical and machine-learning methods. They are particularly interested in more carefully characterizing brain cancers grouped under the term “neuroblastoma”, which is the most common cancer in infancy. In clinical practice the disease is generally classified into multiple risk categories: intermediate and low-risk categories can frequently be treated or in some cases disappear on their own; however, high-risk tumors relapse after treatment therefore difficult to treat and often aggressive. The scientists hope that their work will lead to a better understanding of the differences between risk categories – and likely the identification of more sub-types.
Another student, Inga Patarcic, is working on data concerning “long-range” gene regulation: the processes by which a gene’s activity is controlled by sequences that may lie far away from it on a strand of DNA, or even on another chromosome altogether. Inga and other members of the group are trying to develop methods that integrate data about many types of gene regulation into a single database.
Katarzyna and scientist Verdan Franke are carrying out a parallel project to process, integrate, analyze and visualize many different types of data through the construction of a tool called “genomation”. This will permit information obtained from diverse experiments to be mapped onto regions of the genome to discover patterns related to complex biological processes and diseases. The effort aims to “translate computer science and statistics research to biological problems” in a way that helps researchers visualize relationships in extremely complex biological data of different types.