12 de setembro de 2019

# MIXTURE REGRESSION MODEL FOR INCOMPLETE DATA

Compartilhar: VOLUME 1, NÚMERO 3, DEZEMBRO DE 2018

ISSN: 2595-8402

DOI: 10.5281/zenodo.2528978

MIXTURE REGRESSION MODEL FOR INCOMPLETE DATA

Loc Nguyen1, Anum Shafiq2

[email protected]

[email protected]

ABSTRACT

The Regression Expectation Maximization (REM) algorithm, which is a variant of Expectation Maximization (EM) algorithm, uses parallelly a long regression model and many short regression models to solve the problem of incomplete data. Experimental results proved resistance of REM to incomplete data, in which accuracy of REM decreases in significantly when data sample is made sparse with loss ratios up to80%. However, as traditional regression analysis methods, the accuracy of REM can be decreased if data varies complicatedly with many trends. In this research, we propose a so-called Mixture Regression Expectation Maximization (MREM) algorithm. MREM is the full combination of REM and mixture model in which we use two EM processes in the same loop. MREM uses the first EM process for exponential family of probability distributions to estimate missing values as REM does. Consequently, MREM uses the second EM process to estimate parameters as mixture model method does. The purpose of MREM is to take advantages of both REM and mixture model. Unfortunately, experimental result shows that MREM is less accurate than REM. However, MREM is essential because a different approach for mixture model can be referred by fusing linear equations of MREM into a unique curve equation.

Keywords: Regression Model, Mixture Regression Model, ExpectationMaximizationAlgorithm, IncompleteData

• INTRODUCTION

1.1. Main work

As a convention, regression model is a linear regression function Z = α0 + α1X1 + α2X2 + … + αnXn in which variable Z is called response variable or dependent variable whereas each Xi is called regression variable, regressor, predictor, regression variable, or independent variable. Each αi is called regression coefficient. The essence of regression analysis is to calculate regression coefficients from data sample. When sample is complete, these coefficients are determined by least squares method [1, pp. 452-458]. When sample is incomplete, there are some approximation approaches to estimate regression coefficients such as complete case method, ad-hoc method, multiple imputation, maximum likelihood, weighting method, and Bayesian method. We focus on applying expectation maximization (EM) algorithm into constructing regression model in case of missing data with note that EM algorithm belongs to maximum likelihood approach. In previous research, we proposed a so-called Regression Expectation Maximization (REM) algorithm to learn linear regression function from incomplete data in which some values of Z and Xi are missing. REM is a variant of EM algorithm, which is used to estimate regression coefficients. Experimental results in previous research  proved that accuracy of REM decreases insignificantly whereas loss ratios increase significantly. We hope that REM will be accepted as a new standard method for regression analysis in case of missing data when there are currently 6 standard approaches such as complete case method, ad-hoc method, multiple imputation, maximum likelihood, weighting method, and Bayesian method. Here we combine REM and mixture model with expectation that the accuracy is improved, especially in case that data is incomplete and has many trends. Our proposed algorithm is called Mixture Regression Expectation Maximization (MREM) algorithm. The purpose of MREM is to take advantages of both REM and mixture model. Unfortunately, experimental result shows that MREM is less accurate than REM. However, MREM is essential because a different approach for mixture model can be referred by fusing linear equations of MREM into a unique curve equation, as discussed later. Because this research is the successive one after our previous research, they share some common contents related to research survey and experimental design, but we confirm that their methods are not coincide although MREM is derived from REM.

Because MREM is the combination of REM and mixture model whereas REM is a variant of EM algorithm, we need to survey some works related to application of EM algorithm to regression analysis. Kokic proposed an excellent method to calculate expectation of errors for estimating coefficients of multivariate linear regression model. In Kokic’s method, response variable Z has missing values. Ghitany, Karlis, Al-Mutairi, and Al-Awadhi  calculated the expectation of function of mixture random variable in expectation step (E-step) of EM algorithm and then used such expectation for estimating parameters of multivariate mixed Poisson regression model in the maximization step (M-step). Anderson and Hardin  used reject inference technique to estimate coefficients of logistic regression model when response variable Z is missing but characteristic variables (regressors Xi) are fully observed. Anderson and Hardin replaced missing Z by its conditional expectation on regressors Xi where such expectation is logistic function. Zhang, Deng, and Su used EM algorithm to build up linear regression model for studying glycosylated hemoglobin from partial missing data. In other words, Zhang, Deng, and Su aim to discover relationship between independent variables (predictors) and diabetes.

Besides EM algorithm, there are other approaches to solve the problem of incomplete data in regression analysis. Haitovsky  stated that there are two main approaches to solve such problem. The first approach is to ignore missing data and to apply the least squares method into observations. The second approach is to calculate covariance matrix of regressors and then to apply such covariance matrix into constructing the system of normal equations. Robins, Rotnitzki, and Zhao  proposed a class of inverse probability of censoring weighted estimators for estimating coefficients of regression model. Their approach is based on the dependency of mean vector of response variable Z on vector of regressors Xi when Z has missing values. Robins, Rotnitzki, and Zhao  assumed that the probability λit(α) of existence of Z at time point t is dependent on existence of Z at previous time point t–1 but independent from Z. Even though Z is missing, the probability λit(α) is also determined and so regression coefficients are calculated based on the inverse of λit(α) and Xi. The inverse of λit(α) is considered as weight for complete case. Robins, Rotnitzki, and Zhao used additional time-dependent covariates Vit to determine λit(α).

In the article “Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models”, Horton and Kleinman  classified 6 methods of regression analysis in case of missing data such as complete case method, ad-hoc method, multiple imputation, maximum likelihood, weighting method, and Bayesian method. EM algorithm belongs to maximum likelihood method. According to complete case method, regression model is learned from only non-missing values of incomplete data [2, p. 3]. The ad-hoc method refers missing values to some common value, creates an indicator of missingness as new variable, and finally builds regression model from both existent variables and such new variable [2, p. 3]. Multiple imputation method has three steps. Firstly, missing values are replaced by possible values. The replacement is repeated until getting an enough number of complete datasets. Secondly, some regression models are learned from these complete datasets as usual [2, p. 4]. Finally, these regression models are aggregated together. The maximum likelihood method aims to construct regression model by maximizing likelihood function. EM algorithm is a variant of maximum likelihood method, which has two steps such as expectation step (E-step) and maximization step (M-step). In E-step, multiple entries are created in an augmented dataset for each observation of missing values and then probability of the observation is estimated based on current parameter [2, p. 6]. In M-step, regression model is built from augmented dataset. The REM algorithm proposed in this research is different from the traditional EM for regression analysis because we replace missing values in E-step by expectation of sufficient statistics via mutual balance process instead of estimating the probability of observation. The weighting method determines the probability of missingness and then uses such probability as weight for the complete case. The aforementioned research of Robins, Rotnitzki, and Zhao  belongs to the weighting approach. Instead of replacing missing values by possible values like imputation method does, the Bayesian method imputes missing values by the estimation with a prior distribution on the covariates and the close relationship between the Bayesian approach and maximum likelihood method [2, p. 7].

1.2. Related Studies

Recall that MREM is the combination of REM and mixture model and so we need to survey other works related to regression model with support of mixture model. As a convention, such regression model is called mixture regression model. In literature, there are two approaches of mixture regression model:

• The first approach is to use logistic function to estimate the mixture coefficients.

• The second approach is to construct a joint probability distribution as product of the probability distribution of response variable Z and the probability distribution of independent variables Xi.

According to the first approach , the mixture probability distribution is formulated as follows: (1)

Where Θ = (αk, σk2)T is compound parameter whereas αk and σk2 are regression coefficient and variance of the partial (component) probability distribution Pk(Z|αkTX, σk2). Note, mean of Pk(Z|αkTX, σk2) is αkTX and mixture coefficient isck. In the first approach, regression coefficients αk are estimated by least squares method whereas mixture coefficients ckare estimated by logistic function as follows [11, p. 4]: (2)

The mixture regression model is: (3)

According to the second approach, the joint distribution is defined as follows [12, p. 4]: (4)

Where αk are regression coefficients and σk2 is variance of the conditional probability distribution Pk(Z|αkTX, σk2) whereas μk and Σk are mean vector and covariance matrix of the prior probability distribution Pk(X|μk, Σk), respectively. The mixture regression model is [12, p. 6]: (5)

Where, (6)

The joint probability can be defined by different way as follows [13, p. 21], [14, p. 24], [15, p. 4]: