@article{Li:2019, author = {Li, Y and Bellotti, A and Adams, N}, journal = {Foundations of Data Science}, title = {Issues using logistic regression with class imbalance, with a case study from credit risk modelling}, year = {2019} }
TY - JOUR AB - The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than themajority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen [19] has shown that, ina theoretical context related to infinite imbalance, logistic regression behavesin such a way that all data in the rare class can be replaced by their meanvector to achieve the same coefficient estimates. We build on Owen’s results toshow the phenomenon remains true for both weighted and penalized likelihoodmethods. Such results suggest that problems may occur if there is structurewithin the rare class that is not captured by the mean vector. We demonstratethis problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logisticregression is not able to provide the best out-of-sample predictive performanceand that an approach that is able to model underlying structure in the minorityclass is often superior. AU - Li,Y AU - Bellotti,A AU - Adams,N PY - 2019/// TI - Issues using logistic regression with class imbalance, with a case study from credit risk modelling T2 - Foundations of Data Science ER -