Abstract

The term “Big Data” emphasizes data quantity, not quality.  A 5-elment Euler-formula like identity shows that for any dataset of size n, probabilistic or not, the difference between the sample and population averages is the product of
three measures: (1) data quality, (2) data quantity, and (3) problem difficulty. This decomposition tells us: (I) Probabilistic sampling ensures high data quality by controlling a data defect index at  the level of 1/√N, where N is the population size; (II) When we lose this control,  the estimation error, relative to the benchmarking rate 1/√n, increases with √N, forming the Law of Large Populations; (III) The “bigness” of Big Data (for population inferences) should be measured by the relative size  n/N, not the absolute size n.  This formula shows that once we take into account the data quality, the effective sample size of a “Big Data” set can be vanishingly small.  Without understanding this phenomenon, “Big Data” can do more harm than good because of the drastically inflated precision assessment hence a gross overconfidence, setting us up to be caught by surprise when the reality unfolds, as we experienced during 2016. Data from Cooperative Congressional Election Study (CCES, conducted by Stephen Ansolabehere, Douglas River and others, and analyzed by Shiro Kuriwaki), are used to assess the data quality in 2016 US Presidential election polls, with the aim to gain a clearer vision for the 2020 election and beyond. 

(This talk is based on Meng, X.-L. (2018) Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election. Annals of Applied Statistics, Vol. 12, No. 2, 685–726, available at https://statistics.fas.harvard.edu/people/xiao-li-meng)

Professor Xiao-Li Meng  , Dean of the Harvard University Graduate School of Arts and Sciences (GSAS), Whipple V. N. Jones Professor and former chair of Statistics at Harvard, is well known for his depth and breadth in research, his innovation and passion in pedagogy, and his vision and effectiveness in administration, as well as for his engaging and entertaining style as a speaker and writer. Meng has received numerous awards and honors for the more than 120 publications he has authored in at least a dozen theoretical and methodological areas, as well as in areas of pedagogy and professional development; he has delivered more than 400 research presentations and public speeches on these topics, and he is the author of “The XL-Files,” a regularly appearing column in the IMS (Institute of Mathematical Statistics) Bulletin.

Link to bio: Professor Xiao-Li Meng