Visualization to Support
Advanced Analysis of Genomic Data
Kumar Aman
Thesis submitted for the degree of
Master in Informatics: Technical and Scientific Applications
60 credits
Department of Informatics
Faculty of mathematics and natural sciences
UNIVERSITY OF OSLO
Visualization to Support
Advanced Analysis of Genomic Data
Kumar Aman
© 2018 Kumar Aman
Visualization to Support Advanced Analysis of Genomic Data http://www.duo.uio.no/
Printed: Reprosentralen, University of Oslo
Abstract
In the past few years, the advancement and cost reduction in Genome Sequencing Technologies has resulted into dependable data sources for Analysis of Human and animal genomes to find relationships between diseases and traits. In the field of bioinformatics, the availability of computer-ready data brings enormous possibilities. The annotation tracks which are numerical representation of genomic data allow the bioinformaticians to evaluate and process these complex and extremely large genomic data using different algorithms and find useful conclusions according to their need.
With the growth of multiple ways of analysis and visualizations in different settings, visual analytics is one of the most popular providing users the freedom to interact with the data while visualising in real time.
This thesis present ways to visualize genomic track data in an interac- tive way based on a number of analysis. These analysis help to provide insight about the input data. The large size of a genome can hence be dis- tributed into simpler sizes and complexities such that interesting features can be observed more intuitively and useful results can be obtained.
A tool is being made available through The Genomic Hyperbrowser, an open-source, web based analysis platform developed and maintained by the Biomedical Informatics research group at the University of Oslo. The tool aims to provide an easier and user friendly way to explore a given genomic data. Work flows, used cases and data sets related to the tool are provided.
Acknowledgement
This thesis is the result of several months of hard work and support and would not have been possible without the help and guidance of many people. First and foremost, I would like to thank my supervisors, Geir Kjetil Sandve and Diana Domanska for their upfront and helpful guidance, support and inspirational advice from first until the last day whenever I was in need throughout the term of the thesis. I would also like to thank the HyperBrowser development team and my colleagues at the Biomedical Informatics group who always helped me at times I got stuck working with the new environment or technologies. I would also like to thank Sveinung Gundersen, for his continued support with development environment in the Hyperbrowser and always resolving the trouble making HB situations be it version control or implementation issues. On a personal note, I am deeply grateful for the support from my family and friends. Especially the friendships made at the Biomedical Informatics and Informatics department have made the past year unforgettable. A special thank you goes to my parents and friends who were a source of support at all times.
Last but not the least, I would also thank God for giving me an opportunity to work on this wonderful project.
Contents
Acknowledgement . . . iii
I Introduction 1 1 Introduction 3 1.1 Motivation . . . 4
1.2 Goals . . . 5
1.3 Chapter Overview . . . 6
2 Background 7 2.1 Visual Analytics . . . 7
2.1.1 Visual Analytic Process . . . 8
2.1.2 Application of Visual Analytics . . . 8
2.2 Visualization . . . 8
2.2.1 Visualization Tasks . . . 9
2.2.2 Visualization Techniques . . . 10
2.2.3 Use of Multiple Visual Interface Resolutions . . . 12
2.3 Background from Biology . . . 12
2.3.1 Genome . . . 12
2.3.2 The global Human Reference Genome . . . 14
2.3.3 Annotation Tracks . . . 15
2.4 The Genomic Hyperbrowser . . . 15
2.4.1 Tool Prototypes in HB . . . 16
2.4.2 GSuite . . . 17
2.4.3 Statistics . . . 17
2.4.4 Visualize track elements relative to anchor regions tool 19 2.4.5 Getting preprocessed tracks from the external files . . 19
2.5 Programming Languages Used . . . 20
2.5.1 Python . . . 20
2.5.2 JavaScript . . . 20
2.5.3 FetchAPI . . . 21
2.5.4 HTML . . . 21
2.5.5 CSS . . . 21
2.5.6 R . . . 22
2.6 API . . . 22
2.6.1 HTTP Request . . . 22
2.6.2 HTTP Response . . . 23
2.6.3 Data Formats . . . 24
2.7 ChromeDev Tools . . . 25
2.8 Existing Genome Browsers . . . 25
2.8.1 The UCSC Genome Browser . . . 25
2.8.2 The Galaxy Track Browser . . . 26
2.8.3 Trackster . . . 26
2.9 Pilot Projects . . . 26
II Project 31 3 Introduction to Analyze Data Tool 33 4 User Interface 35 5 Implementation 43 5.1 Framework and Languages . . . 43
5.1.1 Genomic Hyperbrowser . . . 43
5.1.2 Python . . . 43
5.1.3 HTML, CSS, JavaScript . . . 44
5.1.4 Choice of Chart type for each visualization . . . 48
III Conclusion 51 6 Results 53 6.1 A Visual Analysis Tool in Hyperbrowser . . . 53
6.1.1 Main Purpose . . . 53
6.1.2 Use Case . . . 54
7 Discussion 61 7.1 Data Representation . . . 61
7.1.1 Data Formats . . . 61
7.2 Developed Tool . . . 61
7.2.1 Source Code . . . 61
7.2.2 Using the Hyperbrowser Framework . . . 62
7.2.3 Alternative Path for Development . . . 63
7.3 Using fetchAPI instead of XMLHttpRequest . . . 63
7.4 Design Principles . . . 63
7.4.1 Reproducibility . . . 63
7.4.2 Transparency . . . 64
7.4.3 Usability . . . 64
7.5 Weaknesses in the Implementations . . . 65
7.5.1 Lack of Automated tests . . . 65
7.5.2 Time and Space Complexity . . . 65
7.5.3 Visual Features can be improved . . . 65
7.5.4 Broken Implementations . . . 66
7.5.5 Inconsistent API response . . . 66
8 Conclusion and Future Work 69 8.1 Conclusion . . . 69 8.2 Future Work . . . 69
A Source Code 71
B Creating a GSuite 73
B.1 Creating a GSuite in Hyperbrowser . . . 73
List of Figures
2.1 The visual analytics process . . . 8
2.2 overview plus detail view . . . 11
2.3 focus plus context view . . . 11
2.4 zooming/panning . . . 12
2.5 Double helical structure of DNA showing the 4 bases. . . 13
2.6 Figure showing all chromosomes in a human being . . . 13
2.7 Generic figure explaining Tracks and bins as implemented in Hyperbrowser . . . 15
2.8 Galaxy allows processing of large datasets using powerful infrastructure that the user never sees or directly interacts with. . . 16
2.9 . . . 19
2.10 Figure explaining the flow of information with API in a client-server model . . . 22
2.11 Figure explaining the HTTP request-response cycle . . . 23
2.12 output showing frequency of points present per bin (bin- size=10 for better visibility here) . . . 27
2.13 output showing frequency of segments present per bin . . . 27
2.14 Result obtained with varying binsize . . . 28
4.1 Locating Analyse data tool . . . 35
4.2 Input screen of the tool . . . 36
4.3 Input for Single track file . . . 36
4.4 Input for two track files . . . 36
4.5 Input for GSuite track file . . . 36
4.6 Result screen for single track file scenario . . . 37
4.7 Result screen for two track file scenario . . . 37
4.8 Result screen for GSuite file scenario . . . 38
4.9 Results for chr4 for Single track file scenario . . . 39
4.10 Results for "chr7:4562464-7634525" for Single track file scenario 40 6.1 Input for Single track file . . . 54
6.2 Overview of Single track file . . . 55
6.3 Base pair count . . . 55
6.4 Proportional Base pair coverage . . . 55
6.5 Distribution of Length of segments in the genome . . . 56 6.6 Distribution of distances between segments in the genome . 56 6.7 Distribution of average lengths of segments in the genome . 57
6.8 Selecting only chr12 using the text box . . . 57 6.9 base pair count for chr12 . . . 57 6.10 proportional base pair count for chr12 . . . 58 6.11 distribution of distances between segments for chr12 . . . . 58 6.12 updated results for the region "chr12:112000001-117000000" . 59 6.13 input form for two track scenario . . . 59 6.14 visualization for two track scenario . . . 60 6.15 Input form for GSuite . . . 60 6.16 Result button showing multiple track files available to be
visualized by button click . . . 60 7.1 2 API calls got error in first call. . . 66 B.1 Creating a GSuite . . . 74
Part I
Introduction
Chapter 1
Introduction
The field of Genetics has always been a subject of interest for scientists and learners since the discovery of DNA in 1869. Subsequently, great advancements in technology has been made to study and research about the complex structure and functionality that a genome carries. In particular, the discovery of several biotechnological tools in the last few decades has made a very significant impact on the understanding of biology and specifically human biology.This section gives a short introduction to the why and what about this thesis project. The background details which are needed to understand this project are also explained.
1.1 Motivation
Data Analysis and Visualization have become an important factor to get insight of the data which is large and of utmost importance in every sphere of life. In Genomics, which deals with data from the DNA consisting of upto 3 billion base pairs, a proper analysis and visualization can bring about a significant level of understanding of several concepts which are still hidden from the human intellect.
At University of Oslo, we already have a functioning Genome Browser called Hyperbrowser or HB. It utilizes the computing power from Abel Supercomputer at the UiO and has tools for storing data, finding various statistical analysis values and visualizations as per the need.
With data size being large, one problem is to decide the scale of visualization. At times, many important features in a visualization cannot be seen because of the limits of the screen and human eye. To be able to tweak this visualization according to the need can make a big difference in order to interpret useful results from a visualization.
While there are several Genome Browsers available such as UCSC Genome Browser, or IGV Genome Browser, yet based on users’ needs they might not return or visualize all the required results. For example, the ability to see the overall structure of different chromosomes in a genomic track and at the same time analyzing different perspectives of the dataset like average length of segments per chromosome or proportional base- pair counts per chromosome. Also, being able to work with a collection of genomic datasets (called as GSuite in context of HB) is currently not supported by these browsers.
With Analyse Data Tool, we want to add a feature to the Hyperbrowser which enables interactivity with visualization, also termed as Visual Analytics. This tool, while using the comforts of features available in the HB, will also enable the user to zoom in and out of the selected visualization ranges while updating the respective analysis results in real time. The tool would work for single track, two tracks or collection of tracks (GSuite) and use suitable statistical analysis for each scenario. These features will enhance the capabilities of Hyperbrowser and further work in this direction can make it stand as a better alternative to other available Genomic browsers.
1.2 Goals
With this thesis, we tried to achieve the following goals:
• to try and make a tool that incorporates the ability of visual analytics.
• The tool should be able to handle large as well as small datasets and should provide results efficiently and quickly.
• The tool should be able to analyze data from a bed file and publish multiple visualizations using various statistical analysis.
• The tool should be able to handle single as well as multiple datasets.
• the user should be able to select a smaller dataset within the initial and can select a different visualization range as a way to zoom in and out in an interactive manner according to the need.
• The tool should be intuitive to be used by a relevant user group and it should be using good visual features so the users can find what they need without any external help.
• The tool hence made should be extensible so future works can be initiated without much hassle.
1.3 Chapter Overview
Chapter 2
Background
This work is based on implementation of a new tool in the Genomic Hyper- browser environment to find better ways for visualization of genomic data.
The background of the work is based on current technologies used in ge- nomic data visualization. This thesis will try to work on the shortcomings of currently available tools and ways to make them more efficient. An idea would be to take inspiration from the tools and methods used currently for different purposes of visual analytics such as UCSC Genome Browser, Ensembl, Genome Browser etc.
2.1 Visual Analytics
With the advancement of data in the age of digital technologies and growing dependence on computing and storage mediums, there have been numerous methods developed over the years for making the use of data in better manner. Several approaches such as data mining, machine learning have been used increasingly to analyze data quickly and efficiently.
Although, there have been significant evolution in these methods, they still continue to face challenges related to algorithm scalability, increasing data dimensions, and data heterogeneity. Moreover, these methods cannot be generalized for all analysis scenarios. Users often need to use their experiences and knowledge for refining algorithms to match these challenges. It is also common to find complex interesting patterns and not been able to interpret these findings in an meaningful and intuitive manner.
To deal with such issues, a new method has become popular in recent times which enables the users to actively interact with the visualizations by tweaking certain settings in a visualization. This has been termed as Visual Analytics. The first widely accepted work on Visual Analytics was brought up by (Thomas and Cook 2006) . According to Thomas and Cook, Visual analytics is a multidisciplinary field that includes the following focus areas:
• analytical reasoning techniques that let users obtain deep insights that directly support assessment, planning, and decision making;
• visual representations and interaction techniques that exploit the hu- man eye’s broad bandwidth pathway into the mind to let users see,
explore, and understand large amounts of information simultane- ously;
• data representations and transformations that convert all types of conflicting and dynamic data in ways that support visualization and analysis; and
• techniques to support production, presentation, and dissemination of analytical results to communicate information in the appropriate context to a variety of audiences.
2.1.1 Visual Analytic Process
The process for Visual Analytics involves combining of automatic and visual analytic methods through human interaction in order to gain knowledge from data. If there are heterogenous data sources, they need to be integrated before visual or automatic analysis methods are applied.
This can be done by data cleaning, normalization, grouping etc. Data mining techniques are used in automated analysis for generating models of original data which can be visualized for evaluation and refinement.
Figure 2.1: The visual analytics process
Source: http://www.vismaster.eu/book/chapter-2-visual-analytics/
2.1.2 Application of Visual Analytics
Visual Analytics has been widely growing as one of the primary tools in application areas where large information spaces are processed and analyzed. Some of the fields where Visual Analytics is specifically popular are Astronomy, Biology, Medicine, Physics, Climate Monitoring etc. A visual approach usually helps to interpret massive amounts of data and to gain insights into the dependencies of output variable(s) that would otherwise would be tougher to identify.
2.2 Visualization
Although many genome data analysis tasks can already be realised with automated processes, but some steps are still dependent(or produce better results) on human judgement. A better visualization can amplify our
capability to justify a complex data, hence improving the efficiency of manual analysis. In many cases, an appropriate pictorial depiction of the data makes the solution obvious. The visual and automated approaches, when combined, are particularly powerful where a user can seamlessly inspect and perform computations on their data, iteratively refining their analysis.
2.2.1 Visualization Tasks
Usually, users visualize collections of items with multiple attributes corresponding to every item irrespective of the data type of items and a basic search task can be to select all items that satisfy values of a set of attributes. So, an example task would be to find all students in a class whose Grade was more than ’C’ in Maths. Based on this concept, there are certain tasks (Shneiderman 1996) that can be performed to visualize the data in a better way and as chosen by the user for their requirements:
1. Overview
This task can be performed to gain an overview of entire collection of data. This can be considered as viewing objects on a larger scale for example, a whole chromosomes or multiple chromosomes before digging in for specific regions.
2. Zoom
This can be done so as to visualize a specific region of interest in higher resolutions to get a better understanding of that region in the genome. This can also be considered as changing the input parame- ters to visualize a small area of the genome in a high resolution set- ting. The functionality to go back to a previous zoom setting(zoom out) is also expected. Example: Zoom in and out action in google maps.
3. Filter
Sometimes, not all of the information present in a display is relevant for analysis. So, a filtering technique to leave out the information not required in a certain context can help in getting a less dense output with only the needed details highlighted.
4. Details-on-Demand
Once an item has been zoomed and filtered to get in the smaller units, it should be possible to get the known information about the items by clicking or hovering over the units. This can be accompanied by links/urls to places(other web pages) where more information can be found.
5. Relate
The relationship between the items can be made visible by using pa- rameters such as appearances. For example, the different regions of DNA which are found to be related to a certain functionality can be displayed in a single color.
6. History
The actions performed in the course of analysis can be stored in a his- tory to support undo, replay or storing the parameters used in a cer- tain simulation. The Genomic Hyperbrowser provides an easy way of storing the history for every user.
7. Extraction
Once a relevant information is found, it would be useful to be able to extract that information which can be shared to a file for further actions or analysis needed outside of the environment.
2.2.2 Visualization Techniques
In most computer applications, users are needed to interact with more information and interface components than possible to view appropriately on a single screen. Traditionally, paging, scrolling and panning or windowing were used to deal with such constraints. But these methods can cause a discontinuity between information displayed at different times and places which can cause further mechanical burden to mentally absorb the overall structure of the information space and their location within it and to manipulate controls in order to navigate through it. As a result, alternative visualization techniques (Baudisch et al. 2002) which support features such as enabling of multiple views at varying levels of detail are applied based on the tasks mentioned above. They have been found efficient in different contexts. With better and faster technologies, these techniques are getting more productive with time.They are:
1. Overview plus detail
This visualization is a multi-window arrangement. In this interface, one window is the overview, which always displays the entire scope.
The other window is the detail view, which shows up the close up view of a specific portion of the scope. The overview usually con- tains a visual marker to highlight which portion of the overview is currently shown in the detail view. This marker helps to reduce the time required for reorienting when switching from detail to overview.
For example, in the figure below from UCSC browser, the top figure is chromosome view which is our entire scope(overview), whereas the bottom window(in white) is the close up view of the red-marked section of the chromosome(detail view).
Figure 2.2: overview plus detail view
Source: https://genome.ucsc.edu/
2. Focus plus context screens
The focus plus context techniques such as fisheyes views allows users to view selected high resolution displays in additional window with- out requiring a second window. This technique introduce distortion and interferes with any task that required precise judgement about scale, distance, direction or alignment. However, in focus plus con- text screens, which is similar to the fisheyes view as they display con- tent in the same window but do not suffer from the limitations with fisheyes views as the high resolution displays which are of interest are embedded with the surrounding wall-size low resolution displays.
This method is more advantageous over overview plus detail view because of the switching effort in o+d. A good example of f+c view is
Figure 2.3: focus plus context view
Source: https://genome.ucsc.edu/
use of magnifier which uses additional window for showing detailed view of the region in context as shown in the figure using windows magnifier in the same result page.
Note: This is just for reference, UCSC browser actually does not use f+c view as a functionality in this case.
3. Zooming/panning
This strategy enables the user to use zooming and panning to display required information sequentially. Maps can be considered as an
example where the user can click on the zoom in/zoom out button to go deeper/lighter into the details in a stepwise manner. Also panning, which is the horizontal sliding of the visual which is wider than the screen can help to capture only the more relevant information on the screen. The resultant display can be used for other tasks such as Filtering, details-on-demand, or extraction. In
Figure 2.4: zooming/panning
Source: https://genome.ucsc.edu/
Figure, we can see the options available in UCSC genome browser for zooming into a specific location and moving within the region using buttons.
2.2.3 Use of Multiple Visual Interface Resolutions
When working with a human genome which represents a large dataset, it is quite common to encounter highly dense as well as highly sparse data. These can be found in regions of complex visuals which needs to be changed in real time in order to analyze it properly at all scales.
Implementation of multiple visual interface resolutions (Lam, Munzner, and Kincaid 2007) can be highly useful in such contexts. This enables the data to be visualized according to the density of data in the selected region.
The lower density region would be expected to trace a low resolution visual but as soon as a certain threshold related to data density in the region is crossed, a higher resolution image can be expected to be triggered such that it is comfortable to analyze in both scenario. An example can be, using the scatter plots for datasets of low density, but as soon the dataset of density higher than a threshold is chosen, the plot changes to a bar plot simultaneously.
2.3 Background from Biology
2.3.1 Genome
The complete set of genetic information of an organism is referred to as a Genome. This genetic information is usually stored in DNAs(deoxyribonucleic acid). Every living being has a unique genome.
DNA is source of hereditary information in all humans and most living organisms. A small number of viruses have RNAs(ribonucleic acid) as their genetic material. The information stored in the DNA is coded as four
chemical bases referred to as Adenine(A), Thymine(T), Guanine(G), and Cytosine(C) which are arranged in a sequential manner. The order of this sequence determines the characteristics of the genetic information of an or- ganism. The bases A and T, and G and C pair together to what is called a base pair. Each base is attached to a Sugar molecule and a Phosphate molecule. This arrangement is called a nucleotide. The nucleotides are ar- ranged in two long strands to form a double helical structure which was discovered by Watson and Crick in 1953 (J. D. ’WATSON 1953)
Inside the nucleus of a cell, DNA is packed into thread like structures called chromosomes. Human beings have 23 pairs of chromosomes.
These are numbered from 1 to 22 and the last chromosome as X(male) or Y(female) which determines the gender of a person.
Figure 2.5: Double helical structure of DNA showing the 4 bases.
Figure 2.6: Figure showing all chromosomes in a human being
2.3.2 The global Human Reference Genome
The human reference genome is a benchmark using which all human as- semblies must be compared.(“Initial sequencing and analysis of the hu- man genome” 2001) One of the major challenges in sequencing eukary- otic chromosomes is their size which because of the limitations of avail- able technologies are impossible to process from one end to other in one run. To overcome this, researchers commonly isolate genomic DNA from it’s biological sample and fragment the DNA into small pieces that can be sequenced individually. These individual fragments of DNA are called reads and are about 100 to 1000 nucleotides long depending on the tech- nology used. These reads are then assembled into progressively larger con- tiguous pieces which finally result into full chromosome sequences. With the advancements in DNA sequencing technologies over the years, it has been possible to map a human genome in a cost-efficient and fast man- ner to a great level of accuracy. The Genome Reference Consortium(GRC) is the organization responsible for managing and curating the human ref- erence genome. The first draft of human reference genome called as Hu- man Genome Project was released in 2001 which was later improved and reported as nearly complete in 2004. Even though the reference genome was incomplete, it served as the basis of understanding of human body at genetic level. The latest major human reference genome release was in- troduced in December 2013 is GRCh38 also referred to as hg38. However many research related work still use GRCh37 also referred to as hg19 which was released in 2009. The primary reason can be that all databases haven’t been updated yet.
Challenges with the Human Genome Project(HGP)
There have been great improvements in the quality of assembly model since the initiation of Human Genome Project in 2001 but there are still significant challenges (Consortium 2004) with the generated Genome Assemblies. Some of these are:
1. Certain regions of the DNA are more difficult to sequence than others.
For example, centromeres of the chromosomes contain tightly packed DNA calledheterochromatin which are tough to sequence because of high concentration of G and C in the region.
2. Genomic DNA contains many repetitive sequences which discour- ages the process of assembling sequence reads.
3. The HGP DNA samples come from multiple people which makes the resulting genome a randomly mixed conglomerate which in some cases might make it impossible to represent a single sequence correctly.
2.3.3 Annotation Tracks
Annotation Tracks, also referred as Genomic Tracks or simply Tracks, are collection of objects representing specific genomic features such as genes which contains information like base-pair locations. The implementation of Tracks in Hyperbrowser is done by usingndarrayswhich is further stored in a binary file on the disk using memmap. The tracks are provided to the Hyperbrowser in the form of supported file formats such as bed, wig or gTrack. This file is parsed by an internal parser and written to disk as ndarrays. In Hyperbrowser, GTrack,BED, WIG, bedGraph, GFF, FASTA files are the supported track representation files formats.
Abincan be thought of as equal sized slices of a track. Tracks are divided
Figure 2.7: Generic figure explaining Tracks and bins as implemented in Hyperbrowser
into bins during computation. The length of a bin is determined by user in the analysis specification, and is expressed as how many base pairs the bin covers. For each bin in a track, a statistic is computed, before the results from each bin are combined to yield a global result
2.4 The Genomic Hyperbrowser
The Genomic Hyperbrowser(Sandve et al. 2010), or Hyperbrowser in short, is an open-ended system to manage and analyze collections of Genome- wide datasets. It provides a web-based interface based on galaxy platform.
The Galaxy software framework (Afgan et al. 2016) is an open-source appli- cation which is designed to develop and maintain a system which enables researchers without expertise in programming or Informatics to perform computational analysis by using easy to use tools. A user interacts with Galaxy by uploading the data and using options to analyze it. Galaxy is built to interact with existing computational components without expos- ing it to the user.
The Hyperbrowser identifies five different types of Genomic Datasets.
These are:
Figure 2.8: Galaxy allows processing of large datasets using powerful infrastructure that the user never sees or directly interacts with.
Source: https://galaxyproject.org/tutorials/g101/
• pointsare features which can be located at specific base pairs.
• segmentsare features which can be collection of base pairs extending over an area of genome.
• functionswhich represent values in data sets assigned to the each base pair.
adding to the above datasets, there arevalued versions of points and seg- ments, which represents a segment or a point with an attached mark or an expression value.
The results obtained after the analysis can be Global i.e., available for the whole genome, orlocal, representing a set of bins giving values for an isolated part along the genome.
The Hyperbrowser inherits galaxy features such as History which support reproducible research, as the data sets and test runs can be easily shared with collaborators and external audiences (Afgan et al. 2016).
2.4.1 Tool Prototypes in HB
Hyperbrowser provides an easy and fast way to create new tools by follow- ing specific code patterns as described in templates of the Hyperbrowser documentation or wiki (Hyperbrowser Wiki2017). Any Hyperbrowser tool usually takes a genomic dataset as input. The input fields in the web in- terface can be customized using the template code. These input can be considered as parameters for a given tool. The actual analysis is performed when clicked on the execute button which initiates execute function in the tool class file. A developer can choose to implement all the functionalities
in the tool class or inherit already implemented classes and methods, such as statistics in their tool.
2.4.2 GSuite
GSuite (Simovski et al. 2017) is a simple text file representation of a collection of datasets. Each line in this text file consists of the URL(Uniform Resource Locator) for the contained dataset(s). To support further relevant analyses, this format also allows metadata for the datasets which are represented as tab separated values such as headers referring to the file being available locally or remotely.
A tool which uses a GSuite collection file iterates through the source GSuite, downloads each referred file and replaces the URLs with respective paths to the locally stored files.
2.4.3 Statistics
A statistics is a module in Hyperbrowser which takes one or more genomic datasets called tracks as input and applies mathematical or statistical operation on it to provide respective result(s). When a statistic is run on one or multiple tracks, the data is first divided into smaller predefined and equal parts called bins. Bins never overlap for two chromosomes. Largest possible bin size is equal to the length of individual chromosomes (denoted by ’*’ in HB). The statistics in Hyperbrowser work on the principle of map and reduce. Which means applying the statistic to smaller bins first, called as local result, and then the list of local results is combined to get a global result. There are several statistics defined in the Hyperbrowser. Every statistic has a similar interface which is:
1. A main public class ending with Stat name.
2. A class where the main computation is performed over each bin. This class ends with Unsplittable.
3. An optional class for getting a combined global result. This always ends with Splittable.
The Unsplittable class contains the functions:
(i) _createChildren() : to define data available in the stat for the track which is being used for analysis, and
(ii) _compute() : to compute analysis based on the binned data.
The Splittable class is optional but it is required when a global result is needed from the statistic analysis. It contains a function _combineResults() which can access all local results and combines(reduce) them to return a global result.
1
2 c l a s s S t a r t E n d S t a t ( M a g i c S t a t F a c t o r y ) :
3 ’ ’ ’
4 R e t u r n i n g s t a r t s and e n d s p o s i t i o n of the t r a c k
5 ’ ’ ’
6 p a s s
7
8 c l a s s S t a r t E n d S t a t U n s p l i t t a b l e ( S t a t i s t i c ) : 9 def _ c o m p u t e ( se l f ) :
10 tv = se l f . _ c h i l d r e n [ 0 ] . g e t R e s u l t () 11
12 r e t u r n {’ R e s u l t ’:[( tv . s t a r t s A s N u m p y A r r a y () ) . t o l i s t () , ( tv . e n d s A s N u m p y A r r a y () ) . t o l i s t () ]}
13
14 def _ c r e a t e C h i l d r e n ( s e l f ) :
15 s e l f . _ a d d C h i l d ( R a w D a t a S t a t ( s el f . _region , s e l f . _track , T r a c k F o r m a t R e q ( d e n s e = F a l s e ) ) )
Listing 2.1: A code snippet of StartEndStat
Shown above is an example snippet of a stat in hyperbrowser environ- ment. This example shows the class definition for StartEndStat, which was exclusively defined while working on this thesis. TheUnsplittableClass is used to define data analysis and representation for local analysis. Some- times, Splittableclasses are also used(example- RawOverlapStat). These are targeted at global data analysis and representation by combining all the local results. Usually in HB, the execute function is used to call the dif- ferent statistical analysis. This is done by using a method called Analysis- Spec(stat). This method takes the statistic name as parameter and returns dictionary with numpy arrays as result.
Several statistics have been implemented in Hyperbrowser for different type of analysis of Genomic data. This thesis uses the below mentioned statistics.
1. CountStatIt returns the number of points for every bin in a track. The bin size can be set explicitly. The default bin size is ’*’ which assumes a complete chromosome as one bin. In this thesis, a Column chart has been used to visualize this.
2. StartEndStat This stat was added for the purpose of this thesis. It returns the start and end points of each segment in the genome. This stat has been used for representing a big picture of how the points and segments are arranged in a chromosome. It has been visualized by using x-range charts.
3. SegmentDistancesStat It returns the distance between consecutive segments in a bin.
4. SegmentLengthsStat It returns the lengths of each segment in the genome separated in bins.
5. AvgSegLenStatIt gives the mean of segment lengths in each bin.
6. SegmentCountStatIt provides the number of segments in each bin for the genome.
7. RawOverlapStatThis stat is executed for a pair of genomes, First being the reference track and second query track. It provides results related to overlap feature of the two genomes.
During this thesis, some pre-existing features and functions have been used from the Hyperbrowser. Below is a short description of these:
2.4.4 Visualize track elements relative to anchor regions tool This tool is used to visualize all the points and segments for a genome.
With the help of this tool, we can use data of any size and see a big picture of how the points are arranges in each chromosome of the genome. A run of this tool on a sample bed file data would look like this:
Figure 2.9
As can be seen in the above image, this visualization does not give more details about the genome, but we can get an impression of how the points are located on a specific genome.
2.4.5 Getting preprocessed tracks from the external files
The method getPreProcessedTrackFromGalaxyTN() from ExternalTrack- Manager class which takes a Galaxy track name as input, pre-processes the data that GalaxyTN refers to, and finally returns an External track name that refers to the pre-processed data and can be used as a normal track name.
1 E x t e r n a l T r a c k M a n a g e r . g e t P r e P r o c e s s e d T r a c k F r o m G a l a x y T N ( kwd . get (’ g e n o m e ’) , kwd . get (’ d a t a s e t ’) , p r i n t E r r o r s = False ,
p r i n t P r o g r e s s = F a l s e )
2.5 Programming Languages Used
Several Programming/Scripting/Markup languages have been used in this thesis. This section describes in short about these languages and the important libraries or features used in each.
2.5.1 Python
Python(Python Programming Language 2017) is a powerful high level programming language which is quite easy to learn and understand. It has simple and effective object oriented approach and has efficient data structures. Python’s elegant syntax and dynamic typing along with the interpreted nature, makes it an ideal language for scripting and application development across most platforms. Python files have .py extensions.
Python was first released in 1991 and the latest version of Python which can be downloaded is 3.6.4 . However, for working with Hyperbrowser, we use Python 2.7.
Some important libraries of Python which have been used in this thesis are:
Numpy
Numpy(Description about Numpy 2017) is a fundamental package for scientific computing in Python. It has powerful support for N-dimensional array object and sophisticated functions.
2.5.2 JavaScript
JavaScript (Description about JavaScript 2017) or JS, is a highlevel, lightweight, interpreted, programming language with first class functions.
It is popularly called as the scripting language for web pages. It has a good support for Object oriented and functional programming. JS scripts have the extension .js.
JavaScript first appeared in 1995, and has been through many changes since. It supports standardization of language specification. The latest JavaScript standard is called as ECMAScript 2016. The standard that we use in Hyperbrowser is ECMAScript 2015 also known as ES6. This is the most commonly found standard across the world and is supported by almost all web browsers.
Highcharts
Highcharts (Description about Highcharts 2017) is a library based on pure JavaScript and it aims at enhancing web applications by adding interactive charting capability. There are a wide variety of charts available in HighCharts like Line Charts, Pie Charts, Area Charts, Column and Bar Charts, etc.
JQuery
JQuery (Description about JQuery2017) is a popular library for JavaScript. It is lightweight and wraps a lot of common tasks into methods which can be called directly in a JavaScript code. It contains features related to HTML/- DOM Manipulation, HTML event methods, Effects and Animations, AJAX etc.
2.5.3 FetchAPI
The FetchAPI (Description about FetchAPI 2018) is a recent feature which provides an interface for making network requests from a specified location defined via a URL. This provides more flexible and powerful feature set than XMLHttpRequest or AJAX (Description about AJAX2017) which are the other popular ways of getting data from a location. Fetch uses Promise() which avoids the need of callbacks popular in the XMLHttpRequest. It takes one mandatory argument, which is the URL or the path to the resources someone wants to fetch. It then returns a promise and resolves to the response to that request whether it is successful or not.
A basic fetch request looks like this:
1 f e t c h (’ h t t p :// e x a m p l e . com / d a t a . j s o n ’) 2 . t h e n (f u n c t i o n( r e s p o n s e ) {
3 r e t u r n r e s p o n s e . j s o n () ; 4 })
5 . t h e n (f u n c t i o n( m y J s o n ) { 6 c o n s o l e . log ( m y J s o n ) ; 7 }) ;
The above code extract fetches the JSON data file from the specified location and prints it to the console. It supports fetching data cross origin networks also.
2.5.4 HTML
Hyper Text Markup Language (HTML) (Description about HTML 2017) is the standard markup language for creating web pages and web applications. HTMl elements are the building blocks of any web page. It has an tag based structure where tags indicate an option or a section of the page. HTMl files usually end in .html or .htm.
HTML was first released in 1993. The latest version of HTML supported by all web browsers is HTML5.
2.5.5 CSS
Cascaded Style Sheets or CSS (Description about CSS2017), is a style sheet language used for describing the presentation styles of elements in a markup language, such as HTML. The initial release of CSS was in 1996.
CSS files end in .css extension.
HTML, CSS and JavaScript together form the triad of cornerstone technologies for the World Wide Web. Almost everything that we see in a web page contains one or more of these three technologies.
In this thesis, HTML, CSS and JavaScript have been used predomi- nantly for visualizations.
2.5.6 R
Programming language R (Description about R 2017) is an environment for statistical computing and graphical techniques. It is a very flexible and extensible language. First appeared around 1993, R is one of the popular tools in statistical analysis and data mining tasks because of it’s extensibility through user created packages.
2.6 API
API or Application Programming Interface, is a software medium that allows two applications to talk to each other. In a client-server model, an API is the part of the server which receives requests and sends responses to and from the clients. By using an API, a computer using a web browser can view and edit a website’s data.
Figure 2.10: Figure explaining the flow of information with API in a client- server model
Source: https://i1.wp.com/www.robert-drummond.com/wp- content/uploads/2013/05/web20.png
2.6.1 HTTP Request
Communication in HTTP centers around a concept called Request- Response Cycle.
An HTTP request made by a client is said to be valid if it has the following 4 properties:
1. URI(Uniform Resource Identifier)
It tells the server which resources the clients wants to interact with.
Figure 2.11: Figure explaining the HTTP request-response cycle Source: https://restful.io/an-introduction-to-api-s-cee90581ca1b 2. Method
It tells the server, what kind of action the client expects the server to take. There are 4 common methods:
• GET: Asks the server to retrieve information
• POST: It is similar to a GET request. The difference is, in a POST request, any additional information is sent as part of Body rather than as part of URI
• PUT: It is usually a way to upload files on the server. But due to it’s security implications, most servers don’t allow PUT requests.
• DELETE: It is a method to delete a resource on the server but similar to put, it is not commonly used.
3. Headers
Header provides a list of information about the current request.
4. Body
The request’s body is the data that the client is sending to the server.
2.6.2 HTTP Response
It refers to the response by the server to the client.A response has the following components:
1. Status Code The status codes (HTTP Status Codes 2018) confirm the status of the response. The most common status codes in HTTP responses encountered in this thesis are as follows:
• 200: OK (Success)
• 400: Bad Request (Client Error)
• 401: Unauthorized (Client Error)
• 403: Forbidden (Client Error)
• 404: Not Found (Client Error)
• 500: Internal Server Error (Server Error)
2. Headers It contains the metadata of the information about current response.
3. BodyIt contains the data that is being sent to the client in response to the request made at some earlier time.
2.6.3 Data Formats
The most common methods to transfer information is by using JSON or XML formats. In the current project, we have been using JSON as the format for transfer of information. JSON, or JavaScript Object Notation, is a simple text format with two components, key and value. It can be nested very easily and can be easy to understand complex structures.
2.7 ChromeDev Tools
ChromeDev Tools (Chrome Dev Tools, Google 2017) is an inbuilt web development tool build inside of Google Chrome Browser. It is an easier and quicker way to diagnose and debug problems which occur at the time of web page development. Changes related to HTML, CSS or JS can easily be tracked and changed according to the needs and without actually changing the details at code level. It has several tabs which provide different functionalities such as:
• ElementsA developer can easily manipulate the HTML DOM struc- ture and CSS styling of any web page using this tab. There are no actual changes made, so once the page is refreshed, the original set- tings are loaded again.
• ConsoleThis section is useful as it can be used to log and interact with data using JavaScript to check the flow of the code or monitoring outputs, variables etc. The console can also be used as a playground for JavaScript.
• NetworkIt can be used to optimize page load performance and debug request issues. The files or resources loaded during a web page load can also be monitored here.
• Sources JavaScript files can be debugged using this section and if needed, this can be used as a code editor for local files as well.
• PerformancePerformance tab can be helpful to optimize the runtime performance of a web page.
• Memoryis used to monitor memory usage and track down leaks.
• ApplicationThis section monitors all the resources which are loaded including databases, local and session storage, cookies, application cache, images, fonts and stylesheets.
• Security It can be used to debug mixed content issues, certificate problems etc.
2.8 Existing Genome Browsers
There are multiple genome browsers available. Some of them are mentioned here:
2.8.1 The UCSC Genome Browser
It is one of the big players in genomic data visualization. The browser (Kent et al. 2002) represents annotations as a series of horizontal tracks laid over genome. Every track can be viewed in different modes such as dense, or fully expanded or can be hidden. The user can go deeper on the dense track
to open it in full mode. There are many scales possible for the track display.
The lowest is a single chromosome and the highest scale is the sequence of base pairs.
2.8.2 The Galaxy Track Browser
Visual Analytics is the science of using interactive visualizations in order to support analytic reasoning. The Galaxy Track browser (J. Goecks et al.
2011) overcomes some of the shortcomings of other genomic browser by using the concept of Visual Analytics. One of them is that the genome browsers and their analysis tools are not integrated, this makes it tough to change the parameter value of a tool so as to observe how the change impacts the tool output in the browser. This can be done multiple times to tune a tools parameter to obtain a desired output while staying in the browser.
The Galaxy Track Browser gives freedom to the user to repeatedly change the parameter’s value and rerun the tool multiple times. Morever, this can be done interactively because the tool runs on the subset of the data that is visible to the user. This is useful because users can receive feedback by ma- nipulating data in real time. It provides a multi-resolution support model, as well using the Galaxy framework provides visualization analysis easy sharing of the results, all in just a web browser.
2.8.3 Trackster
Trackster (Jeremy Goecks et al. 2012) is another visual analysis environment based on the Galaxy platform. It is targeted at analyzing the next genera- tion sequencing data subsets by enabling the user to try different analysis settings. All the outputs can be then visualized together interactively hence making it easier to compare and inspect for the setting which works the best. This also reduces the computational time by a large margin. It allows dynamic integration of tools which are incorporated in the Galaxy frame- work. The firm coupling of tool settings and visualization enables rapid tool parameter space exploration and dynamic data filtering.
2.9 Pilot Projects
Before beginning the thesis work, a set of pilot projects were undertaken to understand the basic functionalities and implementations in HB. These minor pilot projects were aptly helpful to get a sense of know how as a new developer in HB. Several implementations in these pilot projects were later used in the main thesis as well.
Note: The images used here are based on randomly generated data as sample bed file.]
1. Bed Frequency Plotter Tool
The task here is to plot the frequency of points(from pointBed file) for the selected part of a chromosome. Bin size is calculated as analysis region size divided by 100. The statistic file used here is CountPointStat which count the points for every bin and output is hundred values which are plotted as a simple line plot in R.
Figure 2.12: output showing frequency of points present per bin (bin- size=10 for better visibility here)
Input: A point bed File, start and end positions for the region of chromosome to look for.
Output: A simple line R-plot is generated which shows frequency of points across each bin within the specified range.
2. Bed Sequence Frequency Plotter Tool
The task is to extend the functionality of point bed files to plain bed files.
Figure 2.13: output showing frequency of segments present per bin The bins here are calculated based on the start and end points of the segments taken as input. For undefined start and end points, the start and end of chromosome is taken as input values. Statistic file used in this case is CountSegmentStat which counts the number of segments
per bin.
Input: It accepts a plain bed file and takes input from the user in dif- ferent formats such as chr1:10-200, or chr3, or ’*’, where * denotes the whole genome.
Output: A simple line R-plot is generated which shows frequency of segments within the specified range.
3. Point Segment Frequency Plotter ToolThis is an extension of second pilot. The implementation here is based on the Multiple Resolutions functionality i.e., if binsize is less than or equal to 1000, then an R plot with similar sized vertical lines are shown for count greater than zero.
If the binsize is greater than 1000, then the same plot as in the second tool is the output. There are two outputs in this tool. If the binsize is less than 1000, the plot image is stored and the link to that plot is generated on the output page.
(a) If binsize<=1000, Same length lines in R-plot
(b) If binsize>1000, R-plot for fre- quency of segments
Figure 2.14: Result obtained with varying binsize
Input: A plain bed file, and chromosome, start and end positions.
Output: R-plot with equal sized vertical lines if binsize less than 1000.
4. The fourth tool is an extension of Point Segment Frequency Plotter Tool, which is expected to show average length of segments in same bins. The statistics file used here is AvgSegLenStat. It calculates the average segment length across the bins. The resulting average length values are used for generating an R plot.
Input: A plain bed file, and chromosome, start and end positions.
Output: R-plot with equal sized vertical lines if binsize less than 1000.
Figure 2.15: output showing frequency of segments present per bin
Part II
Project
Chapter 3
Introduction to Analyze Data Tool
This thesis focuses on the development of Analyze Data Tool. This tool is built within the scope of the HyperBrowser which is designed to emphasize on the visual analytics of genomic data. The core of this tool revolves around an API which was coded for this Masters Thesis. This API, takes input data in the form of Trackfile(bedfile) and perform different statistical analysis on this data after which the data is used with highcharts for visualizing it emphasizing several characteristics of the data. Efforts have been made to allow the visualizations to be interactive and intuitive. In the following chapters, the implementation aspect of this tool are described.
Chapter 4
User Interface
This section provids a step by step user interface of the Analyze Data Tool.
1. On opening the instance of the Hyperbrowser, Analyse data tool can be seen located in the tool list in left pane as shown in Figure 5.1. A user needs to select a proper history with respective bed or GSuite file instances uploaded in the history.
Figure 4.1: Locating Analyse data tool
2. Next, click on the Analyze Data Tool to proceed to the input page.
Here, genome build and the bed file needed for analysis as shown in the image need to be selected according to single/double or GSuite track scenarios. Then click on Execute button. This causes the page to be reloaded and a new instance of the Analyse data tool can be observed running in the right hand pane.
Figure 4.2: Input screen of the tool
Figure 4.3: Input for Single track file
Figure 4.4: Input for two track files
Figure 4.5: Input for GSuite track file
3. To know the results from tool run, the ’eye’ icon in front of the running instance of the tool needs to be clicked. This loads the results.
By default, the initial visualizations are for the parameter ’*’ which means the whole genome altogether.
Figure 4.6: Result screen for single track file scenario
Figure 4.7: Result screen for two track file scenario
Figure 4.8: Result screen for GSuite file scenario
4. Visualizations for the smaller segments of the genome can be obtained as well. The required value needs to be entered in the text box at the top of the screen. For example, if a user wants to find the analysis of specifically Chr4 in this genome, the input ’chr4’ is written in the text box and Submit button is clicked.
Figure 4.9: Results for chr4 for Single track file scenario
5. Analysis and visualization to an even lower level can be achieved.
Say, if a user wants to analyze a segment of Chr7 which extends from 4562464 to 7634525 base pairs. This can be written in the HB format as "chr7:4562464-7634525". Clicking on submit will show visualizations which belong to only this range and every chart is adjusted accordingly. However, certain visualizations are always for global context and they remain the same
Figure 4.10: Results for "chr7:4562464-7634525" for Single track file scenario
6. For single track file, a user can choose any part of the genome, enter the appropriate chromosome and/or part of chromosome coordinate values and get the appropriate visualizations for a track file repeatedly without needing another instance of the tool to be executed.
7. There is another case if the input track data file is of larger size. In this case, the first visualization, which represents a probable arrangement of basepairs over the genome, is replaced by another visualization from an existing HB tool to ensure robustness of data loading in the web browser. Rest of the visualizations function the same as usual.
Chapter 5
Implementation
The Analyse Data Tool has been developed both on the front end and back end as a tool for Hyperbrowser. This chapter presents the implementation choices, such as frameworks, programming and scripting languages and data representations. Further, each code file created for this thesis are explained in the terms of how the control flows with the execution of tool.
1. Front end: code for data input and analysis ->in Python primarily 2. Back end: code for Visualization ->in JavaScript primarily, also using
HTML and CSS.
5.1 Framework and Languages
5.1.1 Genomic Hyperbrowser
The tool is available in the Hyperbrowser which is developed primarily using Python programming language(Python 2.7). Hence, Python was chosen as the backend programming language for this tool.
5.1.2 Python
Python is a popular programming language within scientific community and the field of bioinformatics. It is open source, interpreted, object ori- ented, and highly extensible high level programming language. Although, being an interpreted language, it is slower than the lower level languages and compiled languages such as C, C++. An extensive support of libraries for numerous features accross domains and a huge user community makes Python a very popular choice among data scientists and researchers.
Python Libraries
The Python Languages’ functionality can be greatly enhanced by the use of numerous libraries created and available to be used by it’s large user base. This is one of the reasons of Python being highly popular within the field of scientific computations. In this thesis, the Numpy library is
used for implementations of methods. The output of statistical analysis in Hyperbrowser is usually in dictionaries made of numpy arrays. These outputs are not consistent with the JSON output which is needed to be generated by the APIs. So, a data transformation needs to be done here which is facilitated by numpy library.
5.1.3 HTML, CSS, JavaScript
A major part of this tool is visualization in a browser screen which is part of web development. The primary resources for web development are HTML, CSS and JavaScript. As this tool needs customized visualizations, the default HTML features provided within the Hyperbrowser are not used.
Instead, custom and dynamic HTML page is generated with styling in CSS according to the users’ choices. The scripts used for plotting the charts by extracting data from the API output is done in JavaScript.
JavaScript Extensions and Libraries
JavaScript is also supported by many libraries which can be used to serve numerous purposes according to the need. Highcharts.js is a popular charting library for data analysis which is used in this thesis for plotting the data. Fetch API is an interface in JavaScript for fetching resources from the web and is used for retreiving the data from API outputs.
The code/code-changes for the Analyse Data Tool is implemented in the below mentioned 5 files located in Hyperbrowser. These are:
1. buildapp.py - For API configuration (additions for APIs) 2. newAPI.py - For data processing in API
3. AnalyseDataTool.py - For Hyperbrowser interface and integration 4. usingAPIScritps.js - For visualization logic using Highcharts 5. usingAPIStyle.css - CSS styling of the tool page.
The basic flow of execution while using a single track file as input is described below. Similar steps would be followed for two tracks or multiple tracks(GSuite) as input.
1. Taking inputs for executing the tool
Entry point of the tool is Hyperbrowser interface. Hence, the first file which is executed is AnalyseDataTool.py. This file contains the basic configuration code for Analyse Data Tool which includes input options. It also contains the HTML and JavaScript code for the page generated as in Hyperbrowsers’ context which is to be used for displaying the results. There are three methods defined to be used for generating HTML which are:
(a) resultPrintGeneric()has the web page structure for visualisation in case of single track.
(b) resultPrintOverlap() contains the HTML code for two track scenario.
(c) resultPrintGSuite()displays the visualisations when a GSuite in- put is taken.
The execute method in this file specifies the actions to be performed when the tool is executed. The following purposes are served by execute() method currently:
(a) Initialization of the variables for dataset, genome build, bin and reg values.
(b) a call to preprocessTrack() which handles preprocessing of input files if the input is a bed file. If the input is GSuite, however, then the preprocessing is already implicitly done while creating the GSuite in HB. So this methods extracts the intended data such as trackName and trackTitle which are later used as identifiers for the individual tracks.
1 def p r e p r o c e s s T r a c k ( genome , d a t a s e t ) :
2 s u f f i x F o r F i l e =
e x t r a c t F i l e S u f f i x F r o m D a t a s e t I n f o ( d a t a s e t )
3
4 if ( s u f f i x F o r F i l e == ’ bed ’) :
5 t r a c k N a m e = E x t e r n a l T r a c k M a n a g e r . g e t P r e P r o c e s s e d T r a c k F r o m G a l a x y T N ( genome , dataset , p r i n t E r r o r s = False , p r i n t P r o g r e s s = F a l s e )
6 r e t u r n t r a c k N a m e
7
8 e l i f ( s u f f i x F o r F i l e == ’ g s u i t e ’) :
9 t r a c k N a m e = []
10 t r a c k T i t l e = []
11 g S u i t e = g e t G S u i t e F r o m G a l a x y T N ( d a t a s e t )
12 for i , i T r a c k in e n u m e r a t e ( g S u i t e .
a l l T r a c k s () ) :
13 t r a c k N a m e . a p p e n d ( i T r a c k . t r a c k N a m e ) 14 t r a c k T i t l e . a p p e n d ( i T r a c k . t i t l e )
15 r e t u r n t r a c k N a m e , t r a c k T i t l e
(c) conditional statement to check if the user wants to visualize for single track or two tracks case.
(d) conditional statement to specify if the input bed file is large or small which is currently set to 30000 lines of track file which has been used to maintain smooth functioning of the web browser when handling bigger input files. If the input file is greater than this limit, then an instance of an Hyperbrowser tool, VisualizeTrackPresenceOnGenome is executed to generate
an image of the probable chromosome structures within the genome. This condition is only implemented for single track file scenario as of now. If the input file size is smaller than 30000 lines, then a plot consisting similar features is constructed using xrange charts from JavaScript Highcharts.
(e) Call to methods for generating web page according to Single track, two tracks, or multi-track scenario depending on the inputs.
During the generation of web pages in their respective methods, a set of variables are appended as options to the URL which is passed as an argument to the JavaScript method plotFinalChart(). In the HTML code methods mentioned above, the JavaScript methods are called using variables as options to the URL.
1 p l o t F i n a l C h a r t (’ n e w A P I ? g e n o m e = ’ + g e n o m e + ’ & st a t = ’+ s t a t + ’
& bin = ’+ bin + ’ & reg = ’+ reg + ’ & f i r s t T r a c k = ’+ f i r s t T r a c k , u r l _ p a t h , c o n t a i n e r ) ;
The above code excerpt is for the URL construction for single track visualization scenario. As the tool is executed, an instance of the tool is run in the Hyperbrowser which can be tracked in the History pane. On clicking at the view results button(with picture of an eye), an appropriate web page is generated as iFrame in Hyperbrowser’s context which in turn calls plotFinalChart() defined in usingAPIScripts.js for single track scenario. Further, there are different visualisation methods called for each statistical analysis.
2. API calls and Visualization When plotFinalChart() is called, one method for each statistic is executed from usingAPIScripts.js
1 let p l o t F i n a l C h a r t = f u n c t i o n ( par , path , c o n t a i n e r ) { 2 let url = a c c e p t U R L ( par , p a t h ) ;
3 a l l S t a t s = [ g e t S t a r t E n d S t a t ( url , c o n t a i n e r ) ,
4 g e t P r o p o r t i o n C o u n t S t a t ( url , c o n t a i n e r ) ,
5 g e t C o u n t S t a t ( url , c o n t a i n e r ) ,
6 g e t S e g m e n t L e n g t h s S t a t ( url , c o n t a i n e r ) , 7 g e t S e g m e n t D i s t a n c e s S t a t ( url , c o n t a i n e r ) , 8 g e t A v g S e g L e n S t a t ( url , c o n t a i n e r )
9 ];
10 };
These methods contain a FetchAPI call which looks for the API response when a specified URL is called. This response is collected to be processed by the highcharts visualization methods. The code excerpt below shows the Fetch method execution for StartEndStat.
1 f u n c t i o n g e t S t a r t E n d S t a t ( url , c o n t a i n e r ) { 2 if ( url . i n c l u d e s (’ s t a t = S t a r t E n d S t a t ’) ) { 3 r e t u r n m y F e t c h ( url )
4 . t h e n (( r e s p o n s e ) = > r e s p o n s e . j s o n () ) 5 . t h e n (f u n c t i o n( d a t a ) {
6 m y D a t a = d a t a [ 0 ] ;
7 l a b e l s = d a t a [ 1 ] ;