Data transformations and forecasting models: what to use and
when
Transformation Properties When to use Points to keep in mind
Deflation by Converts data from When data are measured To generate a true forecast for the future
CPI or nominal dollars (or in nominal dollars (or in nominal terms, you will need to make an
another price other currency) to other currency) and you explicit forecast of the future value of the
index constant dollars; want to explicitly show price index--i.e., you will need to forecast
usually helps to the effect of the inflation rate (but this is easy if
stablilize variance inflation--i.e., uncover you're in a period of steady inflation)
"real growth"
Deflation at Merely applies a When you only need to When used with a zero-trend model like
a fixed rate constant discount approximately model the simple exponential smoothing or random walk
factor to past data effect of past inflation without growth, the assumed inflation rate
and/or you wish to is precisely the percentage growth in the
impose an assumption future forecasts.
about the current and
future inflation
rate--you can twiddle
the inflation rate to
see what value does the
best job of flattening
out the trend and/or
stabilizing the variance
Logarithm Converts When compound growth is Logging is not the same as deflating: it
multiplicative not due to inflation linearizes growth but does not remove a
patterns to additive (e.g. when data is not general upward trend; if logged data still
patterns and/or measured in currency); have a consistent upward trend, then you
linearizes when you do not need to should use a model that includes a trend
exponential growth; separate inflation from factor (e.g., random walk with growth,
converts absolute real growth; when data ARIMA, linear exponential smoothing).
changes to distribution is positive
percentage changes; and highly skewed (e.g.,
often stablizes the exponential or
variance of data log-normal
with compound distribution); when
growth, regardless variables are
of whether deflation multiplicatively related
is also used
First Converts "levels" to When you need to Differencing is an explicit option in ARIMA
difference "changes" stationarize a series modeling and it is implicitly a part of
with a strong trend random walk and exponential smoothing
and/or random-walk models; therefore you would not manually
behavior (often useful difference the input variable (using the
when fitting regression DIFF function) when specifying model type as
models to time series "random walk" or "exponential smoothing" or
data) "ARIMA"; first difference of LOG(Y) is the
percentage change in Y
Seasonal Converts "levels" to When you need to remove Seasonal differencing is an explicit option
difference "seasonal changes" the gross features of in ARIMA modeling; you MUST include a
seasonality from a seasonal difference (as a modeling option,
strongly seasonal series not an SDIFF transformation of the input
without going to the variable) if the seasonal pattern is
trouble of estimating consistent and you wish it to be maintained
seasonal indices in long-term forecasts
Seasonal Removes a constant When you wish to Adds a lot of parameters to the model--one
adjustment seasonal pattern separate out the for each season of the year.
from a series seasonal component of a (In Statgraphics, the seasonal indices
(either series and then fit are not explicitly shown in the output
multiplicative or what's left with a of the Forecasting procedure--you must
additive) nonseasonal model separately run the Descriptive Methods
(regression, smoothing, procedure to display the seasonal indices.) or trend line); normally
use the multiplicative
version unless data has
been logged
Model type Properties When to use Points to keep in mind
Random walk Predicts that "next As a baseline against Plot of forecasts looks exactly like a plot
period equals this which to compare more of the data, except lagged by one period
period" (perhaps elaborate models; when (and shifted slightly up or down if a growth
plus a constant); applied to logged data, term is included); long term forecasts
a.k.a. ARIMA(0,1,0) it is a "geometric" follow a straight line (horizontal if no
model random walk--the default growth term is included); confidence
model for stock market intervals for long-term forecasts widen
data according to a square-root law (sideways-
parabola shape); logically equivalent to
MEAN model fitted to DIFF(Y)
Linear trend Regression of Y on Rarely the best model Forecasts follow a straight line whose slope
the time index for forecasting--use equals the average slope over the whole
only when you have very estimation period but whose intercept is
few data points and no anchored in the distant past; short-term
obvious pattern in data forecasts therefore may miss badly and
other than a slight confidence intervals for long-term forecasts
trend; can be used in are usually not reliable; other models that
conjunction with extrapolate a linear trend into the future
seasonal adjustment--but (random walk with growth, linear exponential
if you have enough data smoothing, ARIMA models with 1
to seasonally adjust, difference w/constant or 2 differences w/o
you probably should use constant) often do a better job by
another model "reanchoring" the trend line on recent data
Simple moving Simple (equally When data are in short Primitive but relatively robust against
average weighted) average of supply and/or highly outliers and messy data; long-term forecasts
recent data irregular are a horizontal line extrapolated from the
most recent average; a long-term trend can
be incorporated via fixed-rate deflation at
an assumed interest rate
Simple Exponentially When data are nonseasonal Long-term forecasts are a horizontal line
exponential weighted average of (or deseasonalized) and extrapolated from the most recent smoothed
smoothing recent data; display a time-varying value; same as a random walk model without
"average age" of mean without a growth if alpha=0.9999; forecasts get
data in forecast consistent trend smoother and slower to respond to turning
(amount by which points as alpha approaches zero; confidence
forecasts lag behind intervals widen less rapidly than in the
turning points) is random walk model; a long-term trend can be
1/alpha; same as an incorporated via fixed-rate deflation at an
ARIMA(0,1,1) model assumed interest rate or by fitting an
without constant ARIMA(0,1,1) model with constant
Linear Assumes a When data are nonseasonal Long-term forecasts follow a straight line
exponential time-varying linear (or deseasonalized) and whose slope is the estimated local trend at
smoothing trend as well as a display time-varying the end of the series; confidence intervals
(Brown's or time-varying level local trends (usually for long-term forecasts widen rapidly--the
Holt's) (Brown's uses 1 applicable to data that model assumes that the future is VERY
parameter, Holt's are "smoother" in uncertain because of time-varying trends;
uses separate appearance--i.e., less often does not outperform simple exponential
smoothing parameters noisy--than what would smoothing, even for data with trends,
for level and be well fitted by simple because extrapolation of time-varying trends
trend); essentially exponential smoothing) is risky
an ARIMA(0,2,2)
model without
constant
Seasonal Predicts that "next As a baseline against Long-term forecasts have same seasonal
random walk period equals same which to compare fancier pattern as last year; long-term trend is
period last year" seasonal models; as equal to the average trend over whole past
(plus constant); an foundation for seasonal history of series; confidence intervals
ARIMA(0,0,0)x(0,1,0) ARIMA models (e.g., widen slowly; slow to respond to cyclical
model with constant (1,0,0)x(0,1,1)) upturns and downturns; logically equivalent
to MEAN model fitted to SDIFF(Y,s)
Seasonal Predicts that change As a baseline against Long-term forecasts have same seasonal
random trend from this period to which to compare fancier pattern as last year; long-term trend is
next period will be seasonal models; as equal to the most recently observed annual
the same as change foundation for seasonal trend; confidence intervals widen rapidly;
observed at this ARIMA models (e.g., quick to respond to cyclical upturns and
time last year; an (0,1,1)x(0,1,1) without downturns; logically equivalent to MEAN
ARIMA(0,1,0) constant) model fitted to DIFF(SDIFF(Y)) (with no
x(0,1,0) model constant--i.e., mean is assumed to be zero)
without constant
Winter's Assumes time-varying When data are trended and Initialization of seasonal indices and joint
seasonal level, trend, and seasonal and you wish to estimation of three smoothing parameters is
smoothing seasonal indices decompose it into local sometimes tricky--watch to see that
(either level/trend/seasonal parameter estimates converge and that
multiplicative or factors; normally you forecasts and confidence intervals look
additive use the multiplicative reasonable; a popular choice for "automatic"
seasonality) version unless data is forecasting because it does a little of
logged everything, but has a lot of parameters and
sometimes overfits the data or is unstable
Multiple A general linear When data are Forecasts cannot be extrapolated into the
regression forecasting equation correlated with other future unless and until values are available
involving other explanatory or causal for the independent variables; for this
variables variables (e.g., price, reason the independent variables must often
advertising, promotions, be lagged by one or more periods--but when
interest rates, only lagged variables are used, a regression
indicators of general model may fail to outperform a time series
economic activity, model which relies only on the history of
etc.); the key is to the original series; regressions of
choose the right nonstationary variables often have high
variables and the right "R-squared" but poor performance compared to
transformations of those time series models; it often helps to
variables to justify the stationarize the dependent variable and/or
assumption of a linear add lags of the dependent and independent
model and to take into variables to the model; "automatic" model
account the time selection techniques such as stepwise
dimension in the data regression and all-possible regressions are
available, but beware of overfitting; it is
important to validate the model by testing
it on hold-out data and by computing its
"effective R-squared" (percent of variance
explained) relative to a random walk model
or other appropriate time series model
ARIMA A general class of When data are relatively ARIMA models are designed to squeeze all
models that includes plentiful (4 seasons or autocorrelation out of the original time
random walk, random more) and can be series; a systematic procedure exists for
trend, seasonal and satisfactorily identifying the best ARIMA model for any
non-seasonal stationarized by given time series; features of ARIMA
exponential differencing and other and multiple regression models can be
smoothing, and auto- mathematical combined in a natural way; ARIMA models
regressive models; transformations; when it often provide a good fit to highly
forecasts for the is not necessary to aggregated, highly plentiful data; they may
stationarized explicitly separate out perform relatively less well on
dependent variable the seasonal component disaggregated, irregular, and/or sparse data
are a linear (if any) in the form of
function of lags of seasonal indices
the dependent
variable and/or lags
of the errors