Post-parmest peripherals: fvregen, invcise, and qqvalue
Roger B. Newson
National Heart and Lung Institute, Imperial College London
parmest package is used with Stata estimation commands to produce
output datasets (or results-sets) with one observation per estimated
parameter, and data on parameter names, estimates, confidence limits,
p-values, and other parameter attributes. These results-sets can then
be input to other Stata programs to produce tables, listings, plots, and
secondary results-sets containing derived parameters. Three recently added
packages for post-
parmest processing are
invcise, and
fvregen is used when the parameters belong to models containing
factor variables, introduced in Stata version 11. It regenerates these
factor variables in the results-set, enabling the user to plot, list, or
tabulate factor levels with estimates and confidence limits of parameters
specific to these factor levels.
invcise calculates standard errors inversely from confidence limits
produced without standard errors, such as those for medians and for
Hodges–Lehmann median differences. These standard errors can then be
input, with the estimates, into the
metaparm module of
parmest to produce confidence intervals for linear combinations of
medians or of median differences, such as those used in meta-analysis or
interaction estimation.
qqvalue inputs the
p-values in a results-set and creates a new
variable containing the quasi-
q-values, which are calculated by inverting a
multiple-test procedure designed to control the familywise error rate (FWER)
or the false discovery rate (FDR). The quasi-
q-value for each
p-value is the minimum FWER or FDR for which that
would be in the discovery set if the specified multiple-test procedure was
used on the full set of
qqvalue, and
parmest can be downloaded from SSC.
Additional information
A Stata program for calibration weighting
John D'Souza
National Centre for Social Research, London
Although survey data are sometimes weighted by their selection weights, it
is often preferable to use auxiliary information available on the whole
population to improve estimation. Calibration weighting (Deville and
Sarndal, 1992,
Journal of the American Statistical Association 87:
376–382) is one of the most common methods of doing this. This method
adjusts the selection weights so that known population totals for the
auxiliary variables are reproduced exactly, while ensuring that the
calibrated weights are as close as possible to the original sampling weight.
The simplest example of calibration is poststratification. This is the
special case where the auxiliary variable is a single categorical variable.
General calibration extends this to deal with more than one auxiliary
variable and allows the user to include both categorical and numerical
A typical example might occur in a population survey, where the selection
weights could be calibrated to ensure that the sample weighted by the
calibration weights has exactly the same distribution as the population on
variables such as age, sex, and region.
Many packages have routines for calibration. SAS has the macro CALMAR;
GenStat has the procedure SVCALIBRATE; and R has the function
calibrate. However, no such routine is publicly available in Stata. I
will introduce a user-written Stata program for calibration and will also
discuss a simple extension to show how it can incorporate a nonresponse
correction. I will also briefly discuss the program’s strengths and limitations when
compared to rival packages.
Additional information
Estimating and modeling cure within the framework of flexible parametric survival models
Therese M.-L. Andersson
Department of Medical Epidemiology and Biostatistics,
Karolinska Institutet, Stockholm
Cure models can be used to simultaneously estimate the proportion of cancer
patients who are eventually cured of their disease and the survival of
those who remain “uncured”. One limitation of parametric cure models is
that the functional form of the survival of the “uncured” has to
be specified. It can sometimes be hard to fit survival functions flexible
enough to capture high mortality rates within a few months from a diagnosis
or a high cure proportion (e.g., over 90).
If instead the flexible parametric
survival models implemented in
stpm2 could be used, then these
problems could potentially be avoided. Flexible parametric survival models
are fit on the log cumulative hazard scale using restricted cubic splines
for the baseline. When cure is reached, the excess hazard rate (the
difference in the observed all-cause mortality rate among the patients
compared with that expected in the general population) is zero, and the
cumulative excess hazard is constant. By incorporating an extra constraint
on the log cumulative excess hazard after the last knot so that we force it
not only to be linear but also to have zero slope, we are able to estimate
the cure proportion. The flexible parametric survival model can be written
as a special case of a nonmixture cure model, but with a more flexible
distribution, which also enables estimation of the survival of
“uncured” patients.
We have updated the user-written
stpm2 command for flexible
parametric models and added a cure option as well as postestimation
predictions of the cure proportion and survival of the “uncured”. We will
compare the use of flexible parametric cure models implemented in
stpm2 with standard parametric cure models implemented in
strsmix and
This is joint work with Sandra Eloranta and Paul W. Dickman (same
institution) and Paul C. Lambert (same institution and Department of Health
Sciences, University of Leicester).
Additional information
Simulation of “forward-backward” multiple-imputation technique in longitudinal clinical dataset
Catherine Welch
Department of Primary Care & Population Health, University College London
Most standard missing-data techniques have been designed for cross-sectional
data. A “forward-backward” multiple-imputation algorithm has
been developed to impute missing values in longitudinal data (Nevalainen,
Kenward, and Virtanen, 2009,
Statistics in Medicine 28:
36577–3669) This technique will be applied to The Health
Improvement Network (THIN), a longitudinal primary-care database to impute
variables associated with incidence of cardiovascular disease (CVD).
A sample of 483 patients was extracted from THIN to test the performance of
the algorithm before it was applied to the whole dataset. This dataset
included individuals with information available on age, sex, deprivation
quintile, height, weight, systolic blood pressure, and total serum
cholesterol for each age from 65 to 69 years. CVD was identified if the
patient was diagnosed with one of a predefined list of conditions at any of
these ages. They were then considered to have CVD at each subsequent age.
In this sample, measurements of weight, systolic blood pressure, and
cholesterol were replaced with missing values such that the probability that
data are missing decreases as age increases; i.e., the data are missing at
random and the overall percentage of missing data is equivalent to that in
THIN. We then applied the forward-backward algorithm, which
imputes values at each time point by using measurements before and after the
one of interest and updates values sequentially. Ten complete datasets were
created. A Poisson regression was performed using data in each dataset, and
estimates were combined using Rubin’s rules. These steps were repeated 200
times and the coefficients were averaged.
I will explain in more detail how the forward-backward algorithm works and
also will demonstrate the results following multiple imputation using this
algorithm. I will compare these results with the analysis before data were
replaced with missing values and a complete case analysis to assess the
performance of the algorithm.
This is joint work with Irene Petersen (same institution) and James
Carpenter (Medical Statistics Unit, London School of Hygiene and Tropical
Additional information
Thirty graphical tips Stata users should know
Nicholas J. Cox
Department of Geography, Durham University
Stata’s graphics were completely rewritten for Stata 8, with further
key additions in later versions. Its official commands have, as usual, been
supplemented by a variety of user-written programs. The resulting variety
presents even experienced users with a system that undeniably is large,
often appears complicated, and sometimes seems confusing. In this talk, I
provide a personal digest of graphics strategy and tactics for Stata users
emphasizing details large and small that, in my view, deserve to be known by
Additional information
Mata, the missing manual
William Gould
StataCorp, College Station, Texas
Mata is Stata’s matrix programming language. StataCorp provides
detailed documentation on it, but so far has failed to give users—and
especially users who add new features to Stata—any guidance in when and
how to use the language. This talk provides what has been missing. In
practical ways, this talk shows how to include Mata code in Stata ado-files,
it reveals when to include Mata code and when not to, and it provides an
introduction to the broad concepts of Mata, the concepts that will make the
Mata Reference Manual approachable.
Additional information
Hunting for genes with longitudinal phenotype data using Stata
J. Charles Huber Jr.
Texas A&M Health Science Center School of Rural Public Health, College Station, Texas
Project Heartbeat! was a longitudinal study of metabolic and morphological
changes in adolescents aged 8–18 years and was conducted in the 1990s.
A study is currently being conducted to consider the relationship between a
collection of phenotypes (including BMI, blood pressure, and blood lipids) and
a panel of 1,500 candidate SNPs (single nucleotide polymorphisms).
Traditional genetics software such as PLINK and HelixTree lacks the ability
to model longitudinal phenotype data.
This talk will describe the use of
Stata for a longitudinal genetic association study from the early stages of
data checking (allele frequencies and Hardy–Weinberg equilibrium),
modeling of individual SNPs, the use of false discovery rates to control
for the large number of comparisons, exporting and importing data
through PHASE for haplotype reconstruction, selection of tagSNPs in Stata,
and the analysis of haplotypes. We will also discuss strategies for scaling
up to an Illumina 100k SNP chip using Stata. All SNP and gene names will be
de-identified, because this is a work in progress.
This is joint work with Michael Hallman, Ron Harrist, Victoria Friedel,
Melissa Richard, and Huandong Sun (same institution).
Additional information
Haplotype analysis of case–control data using haplologit: New features
Yulia Marchenko
StataCorp, College Station, Texas
In haplotype-association studies, the risk of a disease is often determined
not only by the presence of certain haplotypes but also by their
interactions with various environmental factors. The detection of such
interactions with case–control data is a challenging task and often requires
very large samples. This prompted the development of more efficient
estimation methods for analyzing case–control genetic data. The
haplologit command implements efficient semiparametric methods, recently
proposed in the literature, for fitting haplotype-environment models in the
very important special cases of 1) a rare disease, 2) a single candidate
gene in Hardy–Weinberg equilibrium, and 3) independence of genetic and
environmental factors. In this presentation, I will describe new
features of the
haplologit command.
Additional information
DIY fractional polynomials
Patrick Royston
MRC Clinical Trials Unit, London
Fractional polynomial models are a simple yet very useful extension of
ordinary polynomials. They greatly increase the available range of
nonlinear functions and are often used in regression modeling, both in
univariate format (using Stata’s
fracpoly command) and in
multivariable modeling (using
mfp). The standard implementation in
fracpoly supports a wide range of single-equation regression models
but can not cope with the more complex and varied syntaxes of other types of
multi-equation models. In this talk, I show that if you are willing to do
some straightforward do-file programming, you can apply fractional
polynomials in a bespoke manner to more complex Stata regression commands
and get useful results. I illustrate the approach in multilevel modeling
of longitudinal fetal-size data using
xtmixed and in a seemingly
unrelated regression analysis of a dataset of academic achievement using
Additional information
Forecast evaluation with Stata
Robert A. Yaffee
Silver School of Social Work, New York University
Forecasters are expected to provide evaluations of their forecasts along with
their forecasts. The forecast assessments demonstrate comparative,
adequate, or optimal accuracy by common forecasting criteria to
provide acceptable credence in the forecasts. To assist the Stata user in
this process, Robert Yaffee has written Stata programs to evaluate ARIMA and
GARCH models. He explains how these assessment programs are applied to
one-step-ahead and dynamic forecasts, ex post and ex ante
forecasts, conditional and unconditional forecasts, as well as combinations
of forecasts. In his presentation, he will also demonstrate how assessment
can be applied to rolling origin forecasts of time-series models.
Additional information
An overview of meta-analysis in Stata
Jonathan Sterne
Department of Social Medicine, University of Bristol
Roger Harbord
Department of Social Medicine, University of Bristol
Ian White
MRC Biostatistics Unit, Cambridge
A comprehensive range of user-written commands for meta-analysis is
available in Stata and documented in detail in the recent book
Meta-Analysis in Stata (Sterne, ed., 2009, [Stata Press]).The purpose of this session
is to describe these commands, with a focus on recent developments and
areas in which further work is needed. We will define systematic reviews and
meta-analyses and will introduce the
metan command, which is the main
Stata meta-analysis command. We will distinguish between meta-analyses of
randomized controlled trials and observational studies, and we will discuss the
additional complexities inherent in systematic reviews of the latter.
Meta-analyses are often complicated by heterogeneity, variation between the
results of different studies beyond that expected due to sampling variation
alone. Meta-regression, implemented in the
metareg command, can be
used to explore reasons for heterogeneity, although its utility in medical
research is limited by the modest numbers of studies typically included in
meta-analyses and the many possible reasons for heterogeneity.
Heterogeneity is a striking feature of meta-analyses of diagnostic-test
accuracy studies. We will describe how to use the
midas and
metandi commands to display and meta-analyse the results of such
Many meta-analysis problems involve combining estimates of more than one
quantity: for example, treatment effects on different outcomes or contrasts
among more than two groups. Such problems can be tackled using
multivariate meta-analysis, implemented in the
mvmeta command. We
will describe how the model is fit, and when it may be superior to a set of
univariate meta-analyses. Will will also illustrate its application in a variety of
Additional information
Evaluating one-way and two-way cluster–robust covariance matrix estimates
Christopher F. Baum
Department of Economics, Boston College, Chestnut Hill, Massachusetts
In this presentation, I update Nichols and Schaffer’s 2007 UK Stata Users
Group talk on clustered standard errors. Although cluster–robust standard
errors are now recognized as essential in a panel-data context, official
Stata only supports clusters that are nested within panels. This
requirement rules out the possibility of defining clusters in the time
dimension and modeling contemporaneous dependence of panel units’
error processes. I build upon recent analytical developments that define
two-way (and conceptually,
n-way) clustering and upon the 2010
implementation of two-way clustering in the widely used
ivreg2 and
xtivreg2 packages. I present examples of the utility of one-way and
two-way clustering using Monte Carlo techniques, I present a comparison with
alternative approaches to modeling error dependence, and I consider tests
for clustering of errors.
This is joint work with Mark E. Schaffer (Heriot-Watt University) and Austin
Nichols (Urban Institute).
Additional information
An introduction to matching methods for causal inference and their implementation in Stata
Barbara Sianesi
Institute for Fiscal Studies, London
Matching, especially in its propensity-score flavors, has become an
extremely popular evaluation method. Matching is, in fact, the
best-available method for selecting a matched (or reweighted) comparison
group that looks like the treatment group of interest.
In this talk, I will introduce matching methods within the general problem
of causal inference, highlight their strengths and weaknesses, and offer a
brief overview of different matching estimators. Using
psmatch2, I
will then step through a practical example in Stata that is based on real
data. I will then show how to implement some of these estimators, as well as
highlight a number of implementational issues.
Additional information
Report to users, followed by Wishes and grumbles
William Gould
StataCorp, College Station, Texas
William Gould, as President of StataCorp and Chief of Development, will
report on StataCorp activity over the last year. This will morph into the
traditional voicing from the audience of users’ wishes and grumbles
regarding Stata.
Scientific organizers
Nicholas J. Cox, Durham University
Patrick Royston, MRC Clinical Trials Unit
Logistics organizers
Timberlake Consultants, the official distributor
of Stata in the United Kingdom, Brazil, Ireland, Poland, Portugal, and Spain.