Baseline Detection of Potential Cancer Biomarkers with Linear Models from Microarray Experiments || Rivera Publications || Cancer Studies

helpdesk@riverapublications.com

Home / Journal / Cancer Studies

Baseline Detection of Potential Cancer Biomarkers with Linear Models from Microarray Experiments

cancer biomarkers microarray experiments

Eulisa M. Rivera, Zahira I. Irizarry, Matilde L. Sánchez-Peña, M. Cabrera-Ríos and C.E. Isaza

DOI: https://doi.org/10.31532/CancerStud.1.1.004 31 Dec 2017

Download

View Peer Review History

Article

Abstract

High throughput biological experiments such as DNA microarrays are very powerful tools to understand and characterize multiple illnesses. These types of experiments, however, have also been described as large, complex, expensive, and hard to analyze. For these reasons, analyses with linear assumptions are frequently bypassed for more sophisticated procedures with higher complexity from the onset. In this work, a search procedure for potential biomarkers using data from microarray experiments is proposed under purely linear assumptions. Linearity offers the possibility of verifiability, convergence and compactness in the selection of a set of potential biomarkers. The method shows a high discrimination rate and does not require the adjustment of parameters by the user, thus preserving analysis objectivity and repeatability. A case study in the identification of potential biomarkers for cervix cancer, as well as a validation study for colon cancer is presented to illustrate the application of the proposed procedure.

Keywords

Cancer biomarkers, microarray experiments

Introduction

An important research objective in biology and the medical sciences is the search for genes whose change in relative expression is an indication of a particular state of an organism such as cancer. These genes are known as biomarkers. Microarray experiments have played an important role in the identification of this type of genes. The successful identification of potential biomarker genes can lead to an eventual clinical confirmation and thus to enhanced disease diagnosis and prognosis capabilities.

Based on our own experience with microarray data, the following challenges regarding microarray experiments can be identified: (1) the available data is highly dimensional in terms of the number of genes to be studied (~10⁴) while showing a scarce number of replicates, (2) there is a rather large variation across replicates, (3) the data is not normally distributed and does not exhibit similar variances, (4) there is a considerable number of missing observations in the majority of experiments, (5) the data is commonly found already being normalized or nonlinearly transformed. All of these complicate the detection of potential biomarkers.

Furthermore, when it comes to data analysis, the following are also important challenges: (i) there is no standard way to compare results for gene selection or identification between studies, (ii) even with the same data (and sometimes with the same technique) different researchers end up with different sets of genes¹, thereby leading to a large number of potential biomarkers to be investigated, the research of which could prove lengthy and very expensive.

Truly integrated work across disciplines is not frequent in most microarray analysis works. Biology and Medicine experts are usually left with the burden of using coded analysis tools with a series of parameters of statistical, computational or mathematical nature that significantly affect the outcome of the software packages². This leads to issues in results reproducibility and comparability between studies.

These challenges motivate the search for microarray analysis techniques from which consistent results can be achieved across several experiments and analysts, particularly for the identification of potential biomarkers.

The purpose of this work is to introduce an approach to identify potential biomarkers from the analysis of microarray experiments based solely on linear models and assumptions. One of the uses proposed for this method is to establish a baseline of comparison for the many sophisticated methods with underlying nonlinear assumptions. The use of linearity throughout the analysis makes the method verifiable, repeatable across different analysts, and computationally tractable. In the following sections, the general analysis strategy is described and two illustrative case studies are presented, the first one in cervix cancer and the second one in colon cancer.

Analysis Strategy

Fig. 1 schematically shows the strategy proposed in this work. Each step is explained below.
Step 1: Microarray Experiment. The process begins with a microarray experiment with m₁ tissues in state one (Control – Cancer-free) and m₂ tissues in state two (Cancer) characterized in n genes. The intersection of each of the n genes with each of the m1+m2 tissues quantifies the relative expression of that particular gene in the selected tissue.

Step 2: Represent each gene with multiple performance measures. Different analysts measure the difference in relative genetic expression across two states differently. P_values are common examples of these measures. A p_value can be computed from the application of a statistical comparison test of a gene’s relative expression in the Control and Cancer states. The Mann-Whitney nonparametric test for the difference of medians of two populations is an instance of such comparisons. Finding an additional p_value for the same gene –since at least two measures are required for each gene- can be done through the removal of a couple of tissues from the microarray experiment under analysis.

Step 3: Apply Data Envelopment Analysis. Data Envelopment Analysis (DEA) finds the convex envelop of a particular data set consistently and without the need of varying parameters manually. If, for example, two p_values are used to represent each of the n genes in the experiment, then DEA can be used to find the envelope conformed by the dominating genes following the minimization direction of both p_values. Finding such envelope is done through the application of a linear programming formulation, which is the first instance where linearity becomes useful. Fig. 2 schematically shows how this envelope is formed by a series of lines, or in more mathematically correct terms, a series of hyperplanes. Indeed, if only two performance measures are used, it is possible to identify the said envelope graphically with the use of a simple ruler.

Step 4: Select genes in a series of efficient frontiers. The envelopes found through DEA are formally known as efficient frontiers. When an efficient frontier is found, then the solutions lying on it can be removed (as a layer of an onion), to then find the efficient frontier right underneath it. Following this scheme, several layers can be chosen containing different numbers of genes. These genes, having been found through the simultaneous optimization of the chosen performance measures, are the most likely candidates to be biomarkers. These will be referred to as efficient genes.

Step 5: Create an experimental design to vary the presence of the efficient genes. An experimental design that uses the presence of the genes as controllable variables is built in this step. Each variable can take a value of 0 or 1 (0 for absence of the gene). In this case, one run within the experiment corresponds to a combination of efficient genes. Later in the method, this design is used to record the effect of the presence of the said genes in a linear classifier.

Step 6: At each experimental design point, measure classification performance through linear discriminant analysis. Using the experimental design from the previous step, at each combination of efficient genes it is possible to obtain a measure of classification performance using a linear classifier through linear discriminant analysis. A linear classifier of this kind will always converge to the same position, thus preserving results consistency. At the end of this step, then, a complete experimental design relating classification rate with the absence or presence of the potential biomarkers is available.

Step 7: Fit a 1st order linear regression model. With the complete experimental design, it is possible to fit a 1st order linear regression model. This model will relate classification performance (response) to the absence or presence of the efficient genes (independent variables) in an empirical function.

Step 8: Apply integer linear programming to choose the potential biomarkers that maximize classification performance. An optimization problem can be set up in this stage. This problem entails finding the combination of efficient genes –recall that each gene is represented by a variable that can take values of 0 or 1 to indicate absence or presence of that gene- that maximizes the classification performance predicted by the regression model from the previous step.

The linear models and methods explained previously favor repeatability and auditability of results. Furthermore, the set of selected genes do not depend upon the setting of any parameters by the user.

Figure 1. Analysis Strategy based on Linear Models.

Case Study on Cervix Cancer

This case study helps to illustrate the application and the performance of the proposed procedure.

Step 1. The microarray database under analysis is related to cervix cancer and was compiled by Wong et al.³. The database consists of 8 control tissues and 25 cervix cancer tissues, all of them with relative expression level readings for 10,690 genes.

Step 2. The Mann-Whitney nonparametric two-sided test for comparison of medians of two populations was used to generate two different p_values per gene, following a leave-one-tissue-out strategy. This strategy focuses on extracting a particular tissue associated with one state. By removing a tissue, a statistical replicate is effectively deleted from the set, thereby forcing a p_value that is different to the original one. Thus, two different p_values are effectively created to represent each gene. The selection of the tissue to be removed to create a distinct matrix is performed randomly in this instance, however, it can be selected based on the magnitude of its variance (e.g. remove the tissue with the largest variance).

Step 3. The Data Envelopment Analysis model used for this case study was the Banks-Charnes-Cooper (BCC) model⁴. This is a linear programming model with the following associated formulations (P1 and P2):

The optimal values of the decision variables correspond to the interceptor and the partial first derivatives (with respect of each performance measure involved) of a supporting hyper plane lying on top of extreme points of the data set under analysis. At the end of the analysis, a piece-wise frontier is distinguishable as shown in Fig. 2. Do notice that, if the analysis requires only two performance measures, this piece-wise frontier can be detected through the superposition of a ruler directly onto the graph.

Figure 2. Representation of genes characterized through two different p_values. Only the case with 2 p_values has a convenient graphical representation, but the analysis can be extended to as many dimensions as performance measures selected.

Step 4. The first ten frontiers were kept for this analysis containing a total of 28 genes. It is important to note the discrimination rate shown by the method already at this point: a reduction of four orders of magnitude in the number of genes to analyze.

Step 5. In this step a composite experimental design involving 28 binary variables (one per gene in the shortlist from the previous step), was created and used. The experimental design consisted of 122 different runs in which each gene could take a value of 0 (absence) or 1 (presence). In the next step, at each run the analyst will build a linear classifier using only the genes identified with a value of 1 and measure its classification performance. Three different experimental designs form the total of 122 runs. The first design (DOE 1) is an orthogonal array consisting on 47 runs with 10 to 18 genes each; the second design (DOE 2) has 47 runs with 1 to 28 genes generated randomly; and the third design (DOE 3) consisted of 28 runs, each with only one gene. Fig. 3 shows the composite experimental design.

Figure 3. Composite Experimental Design. Values of 1 to indicate the presence of the gene in the associated column are shown shaded.

Step 6. Linear discriminant analysis was carried out using the combination of genes prescribed by each run of the composite design to record the classification performance of a linear classifier. Depending on the absence and presence of certain genes in each run, a range of different classification performances results. It is expected that classification performance improves with more efficient genes present. Using an analogy, a person has a better chance to be identified correctly if more distinctive characteristics of such person are provided. It is also important, however, to search for the minimal number of distinctive characteristics that allows correct classification. This last objective is aided by the next step.

Tables 1-3. Classification performance (CP) for each experimental design

DOE 1		DOE 2		DOE 3
No.	CP	No.	CP	No.	CP
1	0.37	48	0.37	96	0.909
2	1	49	1	97	0.879
3	1	50	1	98	0.879
4	1	51	1	99	0.909
5	1	52	0.909	100	0.879
6	1	53	0.939	101	0.909
7	1	54	0.939	102	0.818
8	1	55	0.939	103	0.879
9	1	56	1	104	0.818
10	1	57	0.939	105	0.879
11	1	58	1	106	0.879
12	1	59	1	107	0.818
13	1	60	1	108	0.848
14	1	61	1	109	0.879
15	1	62	0.97	110	0.879
16	1	63	0.909	111	0.848
17	1	64	1	112	0.939
18	1	65	1	113	0.879
19	1	66	1	114	0.848
20	1	67	1	115	0.788
21	1	68	1	116	0.879
22	1	69	1	117	0.909
23	1	70	1	118	0.758
24	1	71	1	119	0.818
25	1	72	1	120	0.879
26	1	73	1	121	0.818
27	1	74	1	122	0.818
28	1	75	1	123	0.879
30	1	77	1
31	1	78	1
32	1	79	1
33	1	80	1
34	1	81	1
35	1	82	1
36	1	83	1
37	1	84	1
38	1	85	1
39	1	86	1
40	1	87	1
41	1	88	1
42	1	89	1
43	1	90	1
44	1	91	1
45	1	92	1
46	1	93	1
47	1	94	1
47	1	95	1

Step 7. With the experimental design complete, a linear regression of the classification performance as a function of the presence or absence of the 28 genes is built. Noticing that most of the classification performances give a 100% correct classification, two different regression models were constructed (from two different samples). The first one included all of the 122 original runs (DOE 1 + DOE 2 + DOE 3). For the second model, the sample was reduced to only 38 runs, using the 28 runs from DOE 3 and 10 random ones from DOE 2. This was done with the idea of having samples showing a higher sensitivity to the presence of the different genes, as opposed as having mostly runs with a 100% classification performance. With the regression, the relationship between each important gene and the classification performance can be seen as in Equation (1), with variables defined as in Equation (2).

Table 4. Linear regression model using 122 experimental designs

Variable	Coefficient Symbol	Regression Coefficient
	β₀	0.8868
g₁	β₁	0.0152
g₂	β₂	0.0027
g₃	β₃	0.0097
g₄	β₄	0.0146
g₅	β₅	0.003
g₆	β₆	0.0083
g₇	β₇	-0.0034
g₈	β₈	0.0051
g₉	β₉	0.0001
g₁₀	β₁₀	0.0054
g₁₁	β₁₁	0.0008
g₁₂	β₁₂	-0.002
g₁₃	β₁₃	0.012
g₁₄	β₁₄	-0.0027
g₁₅	β₁₅	0.0138
g₁₆	β₁₆	0.0089
g₁₇	β₁₇	0.0166
g₁₈	β₁₈	0.0145
g₁₉	β₁₉	0.0089
g₂₀	β₂₀	0.012
g₂₁	β₂₁	0.0137
g₂₂	β₂₂	0.0105
g₂₃	β₂₃	-0.0068
g₂₄	β₂₄	-0.0025
g₂₅	β₂₅	0.0093
g₂₆	β₂₆	0.005
g₂₇	β₂₇	0.0079
g₂₈	β₂₈	0.0158

Table 4 represents the first regression model. A positive regression coefficient indicates that the gene associated to it increases classification performance. Twenty-three different genes have positive coefficients, and thus, contribute to improve classification performance when using all 122 runs to build this first regression model.

Table 5. Linear regression model using 38 experimental designs

Variable	Coefficient Symbol	Regression Coefficient
	β₀	0.854962
g₁	β₁	0.02688
g₂	β₂	0.01723
g₃	β₃	0.04222
g₄	β₄	0.02565
g₅	β₅	0.02926
g₆	β₆	0.05826
g₇	β₇	-0.0345
g₈	β₈	-0.00439
g₉	β₉	-0.03323
g₁₀	β₁₀	0.0265
g₁₁	β₁₁	0.0189
g₁₂	β₁₂	-0.01878
g₁₃	β₁₃	0.03906
g₁₄	β₁₄	0.02826
g₁₅	β₁₅	0.04174
g₁₆	β₁₆	-0.02205
g₁₇	β₁₇	0.04799
g₁₈	β₁₈	0.00947
g₁₉	β₁₉	-0.00714
g₂₀	β₂₀	-0.01286
g₂₁	β₂₁	0.01577
g₂₂	β₂₂	0.02561
g₂₃	β₂₃	-0.08053
g₂₄	β₂₄	-0.02284
g₂₅	β₂₅	0.00324
g₂₆	β₂₆	-0.03274
g₂₇	β₂₇	-0.02898
g₂₈	β₂₈	0.0265

After reducing the dataset to only 38 runs the model becomes more targeting. The model represented in Table 5, shows that 17 genes can be used to improve classification performance.

Step 8. Using the two linear regression models described before, the optimization model is to find the combination of genes (through the use of binary variables) to maximize the predicted classification performance using Linear Integer Programming. Formulation P3 shows the resulting optimization problem.

Table 6. Optimization models

Index	Accession Number	Optimization
1	AA488645	x
2	H22826	x
3		x
4	T71316	x
5	AA243749	x
6	AA460827	x
7	AA454831
8	AA913408, AA913864	x
9	AA487237	x
10	AA446565	x
11	H23187	x
12	AI221445	x
13	R36086	x
14	AA282537
15	N93686	x
16	R91078	x
17	R44822	x
18	AI334914	x
19	R93394	x
20	AA621155	x
21	AA705112	x
22	R52794	x
23	AA424344
24	H69876
25	H55909	x
26	W74657	x
27	AI017398	x
28	H99699	x

Such optimization resulted in the identification of 17 important genes, that is, potential cervix cancer biomarkers. These are shown on Table 6, marked with an ‘x’. Also, a hierarchy of genes can be found using this procedure by sorting them by decreasing magnitude of their regression coefficients. This helps visualize what genes and in which order are most necessary and significant when it comes to detect Cervix Cancer, as shown in Table 7. In this table, CP_Predicted comes from the Classification Performance estimated using the Linear Regression from model 2. CP_Real is the Classification Performance obtained by applying the linear discriminant analysis method.

Table 7. Hierarchy of genes to obtain a maximized classification performance

Hierarchy	Genes	Classification Performance- Predicted	Classification Performance Real
1	6	0.91	0.91
2	17	0.96	0.97
3	3	1.00	0.94
4	15	1.00	0.97
5	13	1.00	0.97
6	5	1.00	1.00
7	14	1.00	1.00
8	1	1.00	1.00
9	10	1.00	1.00
10	28	1.00	1.00
11	4	1.00	1.00
12	22	1.00	1.00
13	11	1.00	1.00
14	2	1.00	1.00
15	21	1.00	1.00
16	18	1.00	1.00
17	25	1.00	1.00

Case Study on Colon Cancer

As part of our study and to reinforce the performance of the proposed method we proceeded to validate the steps illustrated before with another microarray database. The microarray database under analysis for this study was related to colon cancer, made available by Alon, et al.⁵. This database consisted of 22 healthy tissues and 40 colon cancer tissues, all of them with expression level readings of 2000 genes. We kept only 27 genes of the 2000 original genes prevenient of the first ten frontiers founded through DEA. Table 8 shows the resulting efficient genes selected through DEA analysis.

Table 8. Efficient genes prevenient of DEA analysis.

Genes	Frontier	Accession
g₁	1	M22382
g₂	1	R87126
g₃	2	H08393
g₄	2	R36977
g₅	3	J05032
g₆	3	M26383
g₇	4	X63629
g₈	4	H40095
g₉	4	Z50753
g₁₀	4	M63391
g₁₁	5	J02854
g₁₂	5	X12671
g₁₃	6	U09564
g₁₄	6	H43887
g₁₅	6	M76378
g₁₆	7	M36634
g₁₇	7	T86473
g₁₈	8	H06524
g₁₉	8	R84411
g₂₀	8	X14958
g₂₁	9	T92451
g₂₂	9	M26697
g₂₃	9	T71025
g₂₄	10	X86693
g₂₅	10	T47377
g₂₆	10	U30825
g₂₇	10	D31885

After the identification of these 27 efficient genes, a composite experimental design was created to explore the presence/absence effect of these on a linear classifier. The experimental design involved 27 binary variables and formed 37 runs. The first design consists of 10 runs using between 18 to 24 genes generated randomly; and the second design consisted of 27 runs, each with only one gene activated. Fig. 4 shows the resulting designs.

Figure 4. Composite Experimental Design (10 runs using between 18 to 24 genes generated randomly and 27 runs with only one gene activated). Values of 1 to indicate the presence of the gene in the associated column are shown shaded.

Using the composite experimental design we measured and recorded the classification performance through linear discriminant analysis as shown in Table 9.

Table 9. Classification performances for experimental designs

Run	Classification Performance
1	0.968
2	0.952
3	0.952
4	0.919
5	0.919
6	0.919
7	0.952
8	0.887
9	0.919
10	0.935
11	0.694
12	0.823
13	0.71
14	0.726
15	0.629
16	0.597
17	0.71
18	0.661
19	0.758
20	0.855
21	0.79
22	0.71
23	0.694
24	0.758
25	0.823
26	0.774
27	0.661
28	0.726
29	0.677
30	0.694
31	0.79
32	0.677
33	0.774
34	0.774
35	0.742
36	0.677
37	0.645

After the completion of the experimental designs a linear regression model was used to relate the classification performance with the absence or presence of each of the 27 efficient genes. Fifteen of these genes had a positive regression coefficient, thereby becoming potential colon cancer biomarkers. Table 10 shows a list of these potential colon cancer biomarkers.

Table 10. Genes needed to obtain a 100% classification performance identified through their positive regression coefficients.

Variable	Coefficient Symbol	Regression Coefficient
1	β₀	0.716
g₁	β₁	-0.00379
g₂	β₂	0.081
g₃	β₃	0.01
g₄	β₄	-0.02
g₅	β₅	-0.1
g₆	β₆	-0.08
g₇	β₇	-0.01
g₈	β₈	-0.03
g₉	β₉	0.027
g₁₀	β₁₀	0.134
g₁₁	β₁₁	0.068
g₁₂	β₁₂	0.01
g₁₃	β₁₃	-0.05
g₁₄	β₁₄	0.077
g₁₅	β₁₅	0.074
g₁₆	β₁₆	0.036
g₁₇	β₁₇	-0.05
g₁₈	β₁₈	0.01
g₁₉	β₁₉	-0.05
g₂₀	β₂₀	0
g₂₁	β₂₁	0.016
g₂₂	β₂₂	-0.01
g₂₃	β₂₃	0.035
g₂₄	β₂₄	0.01
g₂₅	β₂₅	0.028
g₂₆	β₂₆	0

Once the linear regression analysis was completed, this equation was used to find the combination of genes that maximize the classification performance applying integer linear programming. The use of this tool allowed us to obtain the hierarchy of genes as in the previous example. This hierarchy is shown in Table 11.

Table 11. Hierarchy of genes to obtain a maximized classification performance

Hierarchy	Genes	Prediction	Classification Performance- Real
1	10	0.85	0.855
2	2	0.931	0.855
3	14	1	0.855
4	15	1	0.855
5	11	1	0.855
6	16	1	0.855
7	23	1	0.839
8	25	1	0.871
9	9	1	0.903
10	21	1	0.903
11	3	1	0.919
12	12	1	0.935
13	24	1	0.935
14	18	1	0.935
15	27	1	0.935

Conclusions

In this work, a strategy to detect potential biomarkers from the analysis of microarray experiments is proposed. The strategy is based solely on linear models and assumptions and, as such, is intended as a solid baseline to explore if more complex non-linear based methods bring better results. Its consistent convergence and lack of parameter setting by the users make this method a very competitive and attractive one for repeatability and auditability. This is especially important in high throughput experiments and in a highly interdisciplinary field like bioinformatics. A case study involving the analysis of a microarray database on cervix cancer and validation study on colon cancer were presented to demonstrate the capabilities of the strategy. Indeed, in the studies it was possible to discriminate among more than 10,000 genes to converge to less than 30 potential biomarkers in each case.

Funding

This work was made possible thanks to the NIH-MARC grant “Assisting Bioinformatics Efforts at Minority Institutions” PAR-03-026 and BioSEI UPRM grant 330103080301.

References

Ein-Dor L, Kela I, Getz G, et al. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics. 2015;21: 171–178.
Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002;18: 546–554.
Wong YF, Selvanayagam ZE, Wei N, et al. Expression Genomics of Cervical Cancer: Molecular Classification and Prediction of Radiotherapy Response by DNA Microarray. Clin Cancer Res. 2003;9: 5486–5492.
Charnes A, Cooper WW, Lewin AY, Seiford LM. Data Envelopment Analysis: Theory, Methodology, and Applications. Boston, MA: Kluwer Academic Publishers;2015.
Alon U, Barkai N, Notterman DA, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA. 1999;96: 6745–6750.

Baseline Detection of Potential Cancer Biomarkers with Linear Models from Microarray Experiments

Article Type

Publication history

Copyright

Corresponding author

Abstract

Cite this article

Abstract

Keywords

Introduction

Analysis Strategy

Case Study on Cervix Cancer

Case Study on Colon Cancer

Conclusions

Funding

References

Authors

Editors and Reviewers

Health and Medicine

Social Sciences

Baseline Detection of Potential Cancer Biomarkers with Linear Models from Microarray Experiments

Article Type

Publication history

Copyright

Corresponding author

Abstract

Cite this article

Abstract

Keywords

Introduction

Analysis Strategy

Case Study on Cervix Cancer

Case Study on Colon Cancer

Conclusions

Funding

References