## Publications

139 results found

Newson RB, Post-parmest peripherals: fvregen, invcise, and qqvalue

The parmest package is used with Stata estimation commands to produce output datasets (or results-sets) with one observation per estimated parameter, and data on parameter names, estimates, confidence limits, p-values, and other parameter attributes. These results-sets can then be input to other Stata programs to produce tables, listings, plots, and secondary results-sets containing derived parameters. Three recently added packages for post-parmest processing are fvregen, invcise, and qqvalue. fvregen is used when the parameters belong to models containing factor variables, introduced in Stata version 11. It regenerates these factor variables in the results-set, enabling the user to plot, list, or tabulate factor levels with estimates and confidence limits of parameters specific to these factor levels. invcise calculates standard errors inversely from confidence limits produced without standard errors, such as those for medians and for Hodges-Lehmann median differences. These standard errors can then be input, with the estimates, into the metaparm module of parmest to produce confidence intervals for linear combinations of medians or of median differences, such as those used in meta-analysis or interaction estimation. qqvalue inputs the p-values in a results-set and creates a new variable containing the quasi-q-values, which are calculated by inverting a multiple-test procedure designed to control the familywise error rate (FWER) or the false discovery rate (FDR). The quasi-q-value for each p-value is the minimum FWER or FDR for which that p-value would be in the discovery set if the specified multiple-test procedure was used on the full set of p-values. fvregen, invcise, qqvalue, and parmest can be downloaded from SSC.

Newson RB, Homoskedastic adjustment inflation factors in model selection

Insufficient confounder adjustment is viewed as a common source of "false discoveries",especially in the epidemiology sector. However, adjustment for "confounders" that are correlatedwith the exposure, but which do not independently predict the outcome, may cause loss of powerto detect the exposure effect. On the other hand, choosing confounders based on "stepwise"methods is subject to many hazards, which imply that the confidence interval eventuallypublished is likely not to have the advertized coverage probability for the effect that wewanted to know. We would like to be able to find a model in the data on exposures andconfounders, and then to estimate the parameters of that model from the conditional distributionof the outcome, given the exposures and confounders. The haif package, downloadable from SSC,calculates the homoskedastic adjustment inflation factors (HAIFs), by which the variances andstandard errors of coeffcients for a matrix of X-variables are scaled (or inflated), if a matrixof unnecessary confounders A is also included in a regression model, assuming equal variances(homoskedasticity). These can be calculated from the A- and X-variables alone, and can be usedto inform the choice of a set of models eventually fitted to the outcome data, together with theusual criteria involving causality and prior opinion. Examples are given of the use of HAIFs andtheir ratios.

Newson RB, parmest and extensions

The parmest package creates output datasets (or results sets) with one observation for each of a set of estimated parameters, and data on the parameter estimates, standard errors, degrees of freedom, t or z statistics, p-values, confidence limits, and other parameter attributes specified by the user. It is especially useful when parameter estimates are "mass-produced", as in a genome scan. Versions of the package have existed on SSC since 1998, when it contained the single command parmest. However, the package has since been extended with additional commands. The metaparm command allows the user to mass-produce confidence intervals for linear combinations of uncorrelated parameters. Examples include confidence intervals for a weighted arithmetic or geometric mean parameter in a meta-analysis, or for differences or ratios between parameters, or for interactions, defined as differences (or ratios) between differences. The parmcip command is a lower-level utility, inputting variables containing estimates, standard errors, and degrees of freedom, and outputting variables containing confidence limits and p-values. As an example, we can input genotype frequencies and calculate confidence intervals for geometric mean homozygote/heterozygote ratios for genetic polymorphisms, measuring the size and direction of departures from Hardy-Weinberg equilibrium.

Newson RB, Robust confidence intervals for Hodges–Lehmann median difference

The cendif module is part of the somersd package, and calculates confidence intervals for the Hodges–Lehmann median difference between values of a variable in two subpopulations. The traditional Lehmann formula, unlike the formula used by cendif, assumes that the two subpopulation distributions are different only in location, and that the subpopulations are therefore equally variable. The cendif formula therefore contrasts with the Lehmann formula as the unequal-variance t-test contrasts with the equal-variance t-test. In a simulation study, designed to test cendif to destruction, the performance of cendif was compared to that of the Lehmann formula, using coverage probabilities and median confidence interval width ratios. The simulations involved sampling from pairs of Normal or Cauchy distributions, with subsample sizes ranging from 5 to 40, and between-subpopulation variability scale ratios ranging from 1 to 4. If the sample numbers were equal, then both methods gave coverage probabilities close to the advertized confidence level. However, if the sample numbers were unequal, then the Lehmann coverage probabilities were over-conservative if the smaller sample was from the less variable population, and over-liberal if the smaller sample was from the more variable population. The cendif coverage probability was usually closer to the advertized level, if the smaller sample was not very small. However, if the sample sizes were 5 and 40, and the two populations were equally variable, then the Lehmann coverage probability was close to its advertised level, while the cendif coverage probability was over-liberal. The cendif confidence interval, in its present form, is therefore robust both to non-Normality and to unequal variablity, but may be less robust to the possibility that the smaller sample size is very small. Possibilities for improvement are discussed.

Newson R, On the central role of Somers' D

Somers' D and Kendall's tau-a are parameters behind rank or nonparametric statistics, interpreted as differences between proportions. Given two bivariate data pairs (X1, Y1) and (X2, Y2), Kendall’s tau-a parameter tau-XY is the difference between the probability that the two X–Y pairs are concordant and the probability that the two X–Y pairs are discordant, and Somers' D parameter DYX is the difference between the corresponding conditional probabilities, given that the X-values are ordered. The somersd package computes confidence intervals for both parameters. The Stata 9 version of somersd uses Mata to increase computing speed and greatly extends the definition of Somers' D, allowing the X and/or Y variables to be left- or right-censored and allowing multiple versions of Somers' D for multiple sampling schemes for the X–Y pairs. In particular, we may define stratified versions of Somers' D, in which we compare only X–Y pairs from the same stratum. The strata may be defined by grouping a Rubin–Rosenbaum propensity score, based on the values of multiple confounders for an association between exposure variable X and an outcome variable Y . Therefore, rank statistics can have not only confidence intervals but also confounder-adjusted confidence intervals. Usually, we either estimate DYX as a measure of the effect of X on Y , or we estimate DXY as a measure of the performance of X as a predictor of Y, compared with other predictors. Alternative rank-based measures of the effect of X on Y include the Hodges–Lehmann median difference and the Theil–Sen median slope, both of which are defined in terms of Somers' D.

Newson R, Generalized confidence interval plots using commands or dialogs

Confidence intervals may be presented as publication-ready tables or as presentation-ready plots. -eclplot- produces plots of estimates and confidence intervals. It inputs a dataset (or resultsset) with one observation per parameter and variables containing estimates, lower and upper confidence limits, and a fourth variable, against which the confidence intervals are plotted. This resultsset can be used for producing both plots and tables, and may be generated using a spreadsheet or using -statsby-, -postfile- or the unofficial Stata -parmest- package. Currently, -eclplot- offers 7 plot types for the estimates and 8 plot types for the confidence intervals, each corresponding to a -graph twoway- subcommand. These plot types can be combined to produce56 combined plot types, some of which are more useful than others, and all of which can be either horizontal or vertical. -eclplot- has a -plot()- option, allowing the user to superimpose other plots to add features such as stars for P-values. -eclplot- can be used either by typing a command, which may have multiple lines andsub-suboptions, or by using a dialog, which generates the command for users not fluent in the Stata graphics language.

Newson R, Multiple test procedures and smile plots

Scientists often have good reasons for wanting to calculate multiple confidence intervals and/or p-values, especially when scanning a genome. However, if we do this, then the probability of not observing at least one "significant" difference tends to fall, even if all null hypotheses are true. A skeptical public will rightly ask whether a difference is "significant" when considered as one of a large number of parameters estimated. This presentation demonstrates some solutions to this problem, using the unofficial Stata packages parmest and smileplot. The parmest package allows the calculation of Bonferroni-corrected or Sidak-corrected confidence intervals for multiple estimated parameters. The smileplot package contains two programs, multproc (which carries out multiple test procedures) and smileplot (which presents their results graphically by plotting the p-value on a reverse log scale on the vertical axis against the parameter estimate on the horizontal axis). A multiple test procedure takes, as input, a set of estimates and p-values, and rejects a subset (possibly empty) of the null hypotheses corresponding to these p-values. Multiple test procedures have traditionally controlled the family-wise error rate (FWER), typically enabling the user to be 95% confident that all the rejected null hypotheses are false, and that all the corresponding "discoveries" are real. The price of this confidence is that the power to detect a difference of a given size tends to zero as the number of measured parameters become large. Therefore, recent work has concentrated on procedures that control the false disco very rate (FDR), such as the Simes procedure and the Yekutieli-Benjamini procedure. FDR-controlling proced

Newson R, Team TALSPACS, Multiple-test procedures and smile plots, *Stata Journal*, Vol: 3, Pages: 109-132

multproc carries out multiple-test procedures, taking as input a list of p-valuesand an uncorrected critical p-value, and calculating a corrected overall critical pvaluefor rejection of null hypotheses. These procedures define a conÞdence regionfor a set-valued parameter, namely the set of null hypotheses that are true. Theyaim to control either the family-wise error rate (FWER) or the false discoveryrate (FDR) at a level no greater than the uncorrected critical p-value. smileplotcalls multproc and then creates a smile plot, with data points corresponding toestimated parameters, the p-values (on a reverse log scale) on the y-axis, and theparameter estimates (or another variable) on the x-axis. There are y-axis referencelines at the uncorrected and corrected overall critical p-values. The reference linefor the corrected overall critical p-value, known as the parapet line, is an informalÒupper confidence limitÓ for the set of null hypotheses that are true and defines aboundary between data mining and data dredging. A smile plot summarizes a setof multiple analyses just as a Cochrane forest plot summarizes a meta-analysis. Copyright 2003 by Stata Corporation.

Newson R, RGLM: Stata module to estimate robust generalized linear models

rglm fits generalized linear models and calculates a Huber (sandwich) estimate of the variance-covariance matrix of estimates. It can be used alone or called without arguments after a previous call to glm. As with other "robust" commands, the units may be considered to fall into clusters. This version was posted on 28 February 1999.

Newson R, Resultssets, resultsspreadsheets, and resultsplots in Stata

Most Stata users make their living producing results in a form accessible to end users. Most of these end users cannot immediately understand Stata logs. However, they can understand tables (in paper, PDF, HTML, spreadsheet, or word processor documents) and plots (produced by using Stata or non-Stata software). Tables are produced by Stata as resultsspreadsheets, and plots are produced by Stata as resultsplots. Sometimes (but not always), resultsspreadsheets, and resultsplots are produced using resultssets. Resultssets, resultsspreadsheets and resultsplots are all produced, directly or indirectly, as output by Stata commands. A resultsset is a Stata dataset, which is a table whose rows are Stata observations and whose columns are Stata variables. A resultsspreadsheet is a table in generic text format, conforming to a TeX or HTML convention, or to another convention with a column separator string and possibly left and right row delimiter strings. A resultsplot is a plot produced as output, using a resultsset or a resultsspreadsheet as input. Resultsset-producing programs include statsby, parmby, parmest, collapse, contract, xcollapse, and xcontract. Resultsspreadsheet-producing programs include outsheet, listtex, estout, and estimates table. Resultsplot-producing programs include eclplot and mileplot. There are two main approaches (or dogmas) for generating resultsspreadsheets and resultsplots. The resultsset-centered dogma is followed by parmest and parmby users and states: “Datasets make resultssets, which make resultsplots and resultsspreadsheets”. The resultsspreadsheet-centered dogma is followed by estout and estimates table users and states: “Datasets make resultsspreadsheets, which make resultssets, which make resultsplots”. The two dogmas are complementary, and each dogma has its advantages and disadvantages. The resultsspreadsheet dogma is much easier for the casual user to learn to apply in a hurry and is therefore probably preferred

Newson R, Splines with parameters that can be explained in words to non-mathematicians

This contribution is based on my programs bspline and frencurv, which are used to generate bases for Schoenberg B-splines and splines parameterized by their values at reference points on the X-axis (presented in STB-57 as insert sg151). The program frencurv ("French curve") makes it possible for the user to fit a model containing a spline, whose parameters are simply values of the spline at reference points on the X-axis. For instance, if I am modeling a time series of daily hospital asthma admissions counts to assess the effect of acute pollution episodes, I might use a spline to model the long-term time trend (typically a gradual long-term increase superimposed on a seasonal cycle), and include extra parameters representing the short-term increases following pollution episodes. The parameters of the spline, as presented with confidence intervals, might then be the levels of hospital admissions, on the first day of each month, expected in the absence of pollution. The spline would then be a way of interpolating expected pollution-free values for the other days of the month. The advantage of presenting splines in this way is that the spline parameters can be explained in words to a non-mathematician (e.g., a medic), which is not easy with other parameterizations used for splines.

Newson R, Creating plots and tables of estimation results using parmest and friends

Statisticians make their living mostly by producing confidence intervals and p-values. However, those supplied in the Stata log are not in any fit state to be delivered to the end user, who usually at least wants them tabulated and formatted, and may appreciate them even more if they are plotted on a graph for immediate impact. The parmest package was developed to make this easy, and consists of two programs. These are parmest, which converts the latest estimation results to a data set with one observation per estimated parameter and data on confidence intervals, p-values and other estimation results, and parmby, a ``quasi-byable'' front end to parmest, which is like statsby, but creates a data set with one observation per parameter per by-group instead of a data set with one observation per by-group. The parmest package can be used together with a team of other Stata programs to produce a wide range of tables and plots of confidence intervals and p-values. The programs descsave and factext can be used with parmby to create plots of confidence intervals against values of a categorical factor included in the fitted model, using dummy variables produced by xi or tabulate. The user may easily fit multiple models, produce a parmby output data set for each one, and concatenate these output data sets using the program dsconcat to produce a combined data set, which can then be used to produce tables or plots involving parameters from all the models. For instance, the user might tabulate or plot unadjusted and adjusted regression parameters side by side, together with their confidence limits and/or p-values. The parmest team is particularly useful when dealing with large volumes of results derived from multiple multi-parameter mode

Newson R, Stata tip 13: generate and replace use the current sort order, *Stata Journal*, Vol: 4, Pages: 484-485

Newson R, Generalized power calculations for generalized linear models and more, *Stata Journal*, Vol: 4, Pages: 379-401

The powercal command can compute any one of the five quantities involved in power calculations from the other four. These quantities are power, significance level, detectable difference, sample number, and the standard deviation (SD) of the influence function, which is equal to the standard error multiplied by the square root of the sample number. powercal can take arbitrary expressions (involving constants, scalars, or variables) as input and calculate the output as a new variable. The user can therefore plot input variables against output variables, and this often communicates the tradeoffs involved better than a point calculation as output by the sampsi command. General formulas are given for calculating the SD of the influence function when the detectable difference is a linear combination of link functions of subpopulation means for an outcome variable distributed according to a generalized linear model (GLM). This general case includes a very broad range of special cases, where the parameters to be estimated are differences between subpopulation proportions, arithmetic means and algebraic means, or ratios between subpopulation proportions, arithmetic means, geometric means, and odds. However, powercal is not limited to GLMsand can even be used with rank methods. Copyright 2004 by StataCorp LP.

Newson R, Stata tip 5: Ensuring programs preserve dataset sort order, *Stata Journal*, Vol: 4, Pages: 94-94

Newson R, Stata tip 1: The eform() option of regress, *Stata Journal*, Vol: 3, Pages: 445-445

Newson R, Confidence intervals and p-values for delivery to the end user, *Stata Journal*, Vol: 3, Pages: 245-269

Statisticians make their living producing confidence intervals and pvalues.However, those in the Stata log are not ready for delivery to the end user,who usually wants to see statistical output either as a plot or as a table. This articledescribes a suite of programs used to convert Stata results to one or other of theseforms. The eclplot package creates plots of estimates with conÞdence intervals,and the listtex package outputs a Stata dataset in the form of table rows that canbe inserted into a plain TEX, LATEX, HTML, or word processor table. To create aStata dataset that can be output in these ways, we can use the parmest, dsconcat,and lincomest packages to create datasets with one observation per estimatedparameter; the sencode, tostring, ingap, and reshape packages to process thesedatasets into a form ready to be output; and the descsave and factext packagesto reconstruct, in the output dataset, categorical predictor variables representedby dummy variables in regression models. Copyright 2003 by StataCorp LP.

Newson R, Review of Generalized Linear Models and Extensions by Hardin and Hilbe, *Stata Journal*, Vol: 1, Pages: 98-100

The new book Hardin and Hilbe (Stata Press, 2001) is reviewed. Copyright 2001 by Stata Corporation.

Newson R, PARMEST: Stata module to create new data set with one observation per parameter of most recent model

The parmest package has 4 modules: parmest, parmby, parmcip and metaparm. parmest creates an output dataset, with 1 observation per parameter of the most recent estimation results, and variables corresponding to parameter names, estimates, standard errors, z- or t-test statistics, P-values, confidence limits and other parameter attributes. parmby is a quasi-byable extension to parmest, which calls an estimation command, and creates a new dataset, with 1 observation per parameter if the by() option is unspecified, or 1 observation per parameter per by-group if the by() option is specified. parmcip inputs variables containing estimates, standard errors and (optionally) degrees of freedom, and computes new variables containing confidence intervals and P-values. metaparm inputs a parmest-type dataset with 1 observation for each of a set of independently-estimated parameters, and outputs a dataset with 1 observation for each of a set of linear combinations of these parameters, with confidence intervals and P-values, as for a meta-analysis. The output datasets created by parmest, parmby or metaparm may be listed to the Stata log and/or saved to a file and/or retained in memory (overwriting any pre-existing dataset). The confidence intervals, P-values and other parameter attributes in the dataset may be listed and/or plotted and/or tabulated.

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.