Three Dimensional High Throughput GoMiner (HTGM3D)

Barry Zeeberg [aut, cre]

2025-05-27

Three Dimensional High Throughput GoMiner (HTGM3D)

 

 

Barry Zeeberg



Motivation

‘Three Dimensional High Throughput GoMiner (HTGM3D)’ is the final package of the five CRAN packages that together comprise the GoMiner suite. The other four are ‘minimalistGODB,’ ‘GoMiner,’ ‘High Throughput GoMiner (HTGM),’ and ‘Two Dimensional High Throughput GoMiner (HTGM2D)’

The relationships between the functions in this set of packages can be shown using the package ‘foodwebWrapper’.

The calling functions are designated along the column on the left The calling functions are designated along the column on the left

The Gene Ontology (GO) Consortium https://geneontology.org/ organizes genes into hierarchical categories based on biological process (BP), molecular function (MF) and cellular component (CC, i.e., subcellular localization). Tools such as GoMiner (see Zeeberg, B.R., Feng, W., Wang, G. et al. (2003) doi:10.1186/gb-2003-4-4-r28) can leverage GO to perform ontological analysis of microarray and proteomics studies, typically generating a list of significant functional categories. Microarray studies are usually analyzed with BP (Figure 2), whereas proteomics researchers often prefer CC (Figure 3).

Figure 2. GoMiner BP heatmap for cluster52 gene list Figure 2. GoMiner BP heatmap for cluster52 gene list


Figure 3. GoMiner CC heatmap for cluster52 gene list Figure 3. GoMiner CC heatmap for cluster52 gene list

To capture the benefit of both of those ontologies simultaneously, I had developed a two-dimensional version of High-Throughput GoMiner (HTGM2D). I generate a 2D heat map whose axes are any two of BP, MF, or CC, and the value within a picture element of the heat map reflects the Jaccard metric p-value for the number of genes in common for the corresponding ontology pair (Figure 4).

Figure 4. HTGM2D BP versus CC heatmap (containing clickable hyperlinks to reveal hidden gene lists) for cluster52 gene list Figure 4. HTGM2D BP versus CC heatmap (containing clickable hyperlinks to reveal hidden gene lists) for cluster52 gene list

I had been under the impression that HTGM2D would be the final version. It seems to provide substantial additional insight into the biological relationships underlying a gene set, compared with the original GoMiner or HTGM.

More or less as a thought experiment, I was imagining how one would extend HTGM2D to a three dimensional version. After all, there are three branches of the Gene Ontology, so it seemed like a natural idea. But the Jaccard metric, which I had used for HTGM2D, seemed to be inherently limited to two dimensions - I could not find any theoretical discussion or implementation that would apply to three dimensions. And I had trouble imagining how to graphically present the (hypothetical) three dimensional version.

It seemed that in principle I could extend the Jaccard idea to three dimensions, simply by counting how many genes are in common in any triple of categories drawn from each of the three ontology branches. It was not at all clear to me that there would ever be a substantial number of genes in common in three categories, but I was pleasantly surprised to see this did occur.

HTGM3D

The HTGM3D study was invoked by

load("~/GODB_RDATA/GOGOA3.RData")
dir<-tempdir()
geneList<-cluster52
mat3d<-HTGM3Ddriver(dir,geneList,GOGOA3,nrand=1)

Columns ‘n1’,‘n2’,‘n3’ in Figure 5 show substantial numbers of genes mapping to individual categories in the three ontologies, and column ‘lint’ shows substantial numbers of genes in the three-way intersection. This was in stark contrast to randomized controls (not shown). In fact, the randomized results are so anemic that I decided to forgo a statistical analysis, and just take the real data at face value.

Figure 5. mat3d
Figure 5. mat3d

The Fun Part: 3D Graphics

After rejecting many imaginative but impractical approaches to the graphics, I came upon a package ‘rgl’ that had very nice three dimensional scatter plots, with facilities to interact with the plots. This turned out to provide a fairly complete solution.

The graphics program is invoked (Figure 6) by

interactWithGraph3D(l3$mat3d)
Figure 6. HTGM3D 3D graphic for cluster52 gene list
Figure 6. HTGM3D 3D graphic for cluster52 gene list

Upon invoking the program, the function plot3d() is called first, and the user can rotate the image as much as desired. A menu will appear in the R Console that allows the user to interactively select a point, or specify a row number within mat3d (see Figure 5) that provides an alternative way to select a point. Either way, annotations will appear on the 3d graph, and a list of common genes mapping to the categories of that triplet will appear (not shown).

Figure 7. Annotated HTGM3D 3D graphic for cluster52 gene list Figure 7. Annotated HTGM3D 3D graphic for cluster52 gene list

Figure 8. Information about the selected triplet
Figure 8. Information about the selected triplet

You can add one or two more annotations, but the graphic tends to become cluttered if you add too many.

You can subsequently over write the graphic using

interactWithGraph3D(l3$mat3d,newWindow=FALSE)

or retain the graphic and start working in a new window using

interactWithGraph3D(l3$mat3d,newWindow=TRUE)

Nitty Gritty

Each point represents one triplet of categories (one each from “molecular_function”, “cellular_component”, “biological_process”). The triplet is characterized by the number of genes in the intersection of the three categories. The intersection is given in the column ‘lint’ in Figure 5. For instance, one of the bright red points corresponds to the high ‘lint’ value given in the first line of Figure 5. That is, the three categories of this triplet contain 42, 15, and 11 genes, of which 8 are in common.

Each triplet may be assigned any (x,y,z) in an entirely arbitrary manner, as long as no two triplets are assigned the same (x,y,z). This is the basis that allows us to perform a meaningful visualization. One thing that is not arbitrary is the fundamental principle that the triplet must remain associated with its proper ‘lint’ value.

Each triplet is assigned a Cartesian coordinate (x, y, z) in the following manner: All categories within, say “molecular_function” are sorted with respect to how many triplets it is associated with (Figure 9). The highest category is given x = 1, and so forth. That way, the triplets closer to (1, 1, 1) are the ‘heavy hitters’ that we might want to focus on. Figure 9 shows that GO:000515, being associated with a whopping 38 triplets, is the heavy hitter within “molecular_function.”

That “molecular_function” category GO:000515 will always have coordinate 1 within all of its triplets (see column ‘x’ in Figure 5). The coordinate x = 1 defines the locus of a plane within the graphic, so that all triplets containing GO:000515 will lie within the x = 1 plane.

Figure 9. Frequency of Appearance of Each Category in Triplets
Figure 9. Frequency of Appearance of Each Category in Triplets

In addition to the position within the 3D graphic, each triplet is also associated with a color that reflects the “lint” value (i.e., the number of genes in the intersection for that triplet). Bright red reflects the highest, orange is middle of the road, yellow is the low end (Figures 6, 7).

Exercise for the Student

Run the example for interactWithGraph3D(). Click on any point, to bring up its annotation. You can rotate the graphic if needed, and you should be able to see that the point in fact is in the position given by the annotation that is like Figure 8. The annotations for the position and category names should match those given in the Safari rendering of mat3D (like Figure 5).