Seemingly Unrelated Regression (SUR/SURE)
-----------------------------------------

Seemingly unrelated regression is a system regression estimator which
jointly estimates multiple models. This allows joint hypothesis testing
of parameters across models since the parameter covariance is robust to
correlation of residuals across models. It can also lead to more precise
parameter estimates if some residuals are conditionally homoskedastic
and regressors differ across equations.

Basic Notation
~~~~~~~~~~~~~~

Assume :math:`K` series are observed for :math:`N` time periods. Denote
the stacked observations of series :math:`i` as :math:`Y_{i}` and its
corresponding regressor matrix :math:`X_{i}`. The complete model
spanning all series can be specified as

.. math::

   \begin{aligned}
   Y & =X\beta+\epsilon\\
   \left[\begin{array}{c}
   Y_{1}\\
   Y_{2}\\
   \vdots\\
   Y_{K}
   \end{array}\right] & =\left[\begin{array}{cccc}
   X_{1} & 0 & 0 & 0\\
   0 & X_{2} & 0 & 0\\
   \vdots & \vdots & \ddots & \vdots\\
   0 & 0 & 0 & X_{K}
   \end{array}\right]\left[\begin{array}{c}
   \beta_{1}\\
   \beta_{2}\\
   \vdots\\
   \beta_{K}
   \end{array}\right]+\left[\begin{array}{c}
   \epsilon_{1}\\
   \epsilon_{2}\\
   \vdots\\
   \epsilon_{K}
   \end{array}\right].\end{aligned}

The OLS estimator of :math:`\beta` is then

.. math:: \hat{\beta}=\left(X^{\prime}X\right)^{-1}X^{\prime}Y.

A GLS estimator can be similarly defined as

.. math:: \hat{\beta}_{GLS}=\left(X^{\prime}\Omega^{-1}X\right)^{-1}X^{\prime}\Omega^{-1}Y

where :math:`\Omega=\Sigma\otimes I_{N}` is the joint covariance of the
residuals. In practice :math:`\Sigma` is not known as so a feasible GLS
(FGLS) is implemented in two steps. The first uses OLS to estimate
:math:`\hat{\epsilon}` and then estimates the residual covariance as

.. math::

   \hat{\Sigma}=N^{-1}\left[\begin{array}{cccc}
   \hat{\epsilon}_{1} & \hat{\epsilon}_{2} & \ldots & \hat{\epsilon}_{N}\end{array}\right]^{\prime}\left[\begin{array}{cccc}
   \hat{\epsilon}_{1} & \hat{\epsilon}_{2} & \ldots & \hat{\epsilon}_{N}\end{array}\right].

The feasible GLS estimator is then

.. math:: \hat{\beta}_{FGLS}=\left(X^{\prime}\hat{\Omega}^{-1}X\right)^{-1}X^{\prime}\hat{\Omega}^{-1}Y

where :math:`\hat{\Omega}=\hat{\Sigma}\otimes I_{N}`.

Covariance Estimation
---------------------

**Homoskedastic Data**

When residuals are assumed to be homoskedastic, the covariance can be
consistently estimated by

.. math:: \left(X^{\prime}\Delta^{-1}X\right)^{-1}\left(X^{\prime}\Delta^{-1}\hat{\Omega}\Delta^{-1}X\right)\left(X^{\prime}\Delta^{-1}X\right)^{-1}

where :math:`\Delta` is the weighting matrix used in the parameter
estimator. For example, in OLS :math:`\Delta=I_{NK}` while in
FGLS\ :math:`\Delta=\hat{\Omega}`. The estimator supports using FGLS
with an assumption that :math:`\hat{\Sigma}` is diagonal or using a
user-specified value of :math:`\Sigma`. When the FGLS estimator is used,
this simplifies to

.. math:: \left(X^{\prime}\Delta^{-1}X\right)^{-1}

**Heteroskedastic Data**

When residuals are heteroskedastic, the covariance can be consistently
estimated by

.. math:: \left(X^{\prime}\Delta^{-1}X\right)^{-1}\hat{S}\left(X^{\prime}\Delta^{-1}X\right)^{-1}

where :math:`\hat{S}` is a covariance estimator that is clustered by
time. Define the matrix of scores as

.. math::

   \hat{g}=\left[\begin{array}{c}
   \hat{g}_{1}\\
   \hat{g}_{2}\\
   \vdots\\
   \hat{g}_{N}
   \end{array}\right]=\Delta^{-\frac{1}{2}}X\odot\left(\hat{\epsilon}\iota_{M}^{\prime}\right)

where :math:`\iota_{M}` is a vector of ones with :math:`M` elements
where :math:`M` is the number of columns in :math:`X` and :math:`\odot`
is element by element multiplication. The clustered-by-time score
covariance estimator is then

.. math:: \hat{S}=N^{-1}\sum_{i=1}^{N}\Psi_{i}^{\prime}\Psi_{i}

where

.. math:: \Psi_{i}=\sum_{j=1}^{K}\hat{g}_{ij}

is the sum of the scores across models :math:`\left(j\right)` that have
the same observation index :math:`i`. This estimator allows arbitrary
correlation of residuals with the same observation index.

**Debiased Estimators**

When the debiased flag is set, a small sample adjustment if applied so
that element :math:`ij` of :math:`\hat{\Sigma}` is scaled by

.. math:: \frac{T}{\sqrt{\left(T-P_{i}\right)\left(T-P_{j}\right)}}.

Other Statistics
~~~~~~~~~~~~~~~~

**Goodness of fit**

The reported :math:`R^{2}` is always for the data, or weighted data is
weights are used, for either OLS or GLS. The means that the reported
:math:`R^{2}` for the GLS estimator may be negative

**F statistics**

When the debiased covariance estimator is used (small sample adjustment)
the reported :math:`F` statistics use :math:`K\left(T-\bar{P}\right)`
where :math:`\bar{P}=K^{-1}\sum_{i=1}^{K}P_{i}` where :math:`P_{i}` is
the number of variables including the constant in model :math:`i`. When
models include restrictions it may be the case that the covariance is
singular. When this occurs, the :math:`F` statistic cannot be
calculated.

Memory efficient calculations
-----------------------------

The parameter estimates in the SUR model can are

.. math:: \left(X^{\prime}\Omega^{-1}X\right)^{-1}\left(X^{\prime}\Omega^{-1}Y\right)^{-1}

where

.. math::

   \Omega^{-1}=\left[\begin{array}{cccc}
   \sigma^{11}I_{T} & \sigma^{12}I_{T} & \ldots & \sigma^{1k}I_{T}\\
   \sigma^{21}I_{T} & \sigma^{22}I_{T} & \dots & \sigma^{2k}I_{T}\\
   \vdots & \vdots & \ddots & \vdots\\
   \sigma^{k2}I_{T} & \sigma^{k2}I_{T} & \dots & \sigma^{kk}I_{T}
   \end{array}\right]

where :math:`\sigma^{ij}=\left[\Sigma^{-1}\right]_{ij}`. Since :math:`X`
is block diagonal,

.. math::

   X^{\prime}\Omega^{-1}=\left[\begin{array}{cccc}
   \sigma^{11}X_{1}^{\prime} & \sigma^{12}X_{1}^{\prime} & \ldots & \sigma^{1k}X_{1}^{\prime}\\
   \sigma^{21}X_{2}^{\prime} & \sigma^{22}X_{2}^{\prime} & \dots & \sigma^{2k}X_{2}^{\prime}\\
   \vdots & \vdots & \ddots & \vdots\\
   \sigma^{k2}X_{k}^{\prime} & \sigma^{k2}X_{k}^{\prime} & \dots & \sigma^{kk}X_{k}^{\prime}
   \end{array}\right]

and

.. math::

   X^{\prime}\Omega^{-1}X=\left[\begin{array}{cccc}
   \sigma^{11}X_{1}^{\prime}X_{1} & \sigma^{12}X_{1}^{\prime}X_{2} & \ldots & \sigma^{1k}X_{1}^{\prime}X_{k}\\
   \sigma^{21}X_{2}^{\prime}X_{1} & \sigma^{22}X_{2}^{\prime}X_{2} & \dots & \sigma^{2k}X_{2}^{\prime}X_{k}\\
   \vdots & \vdots & \ddots & \vdots\\
   \sigma^{k2}X_{k}^{\prime}X_{1} & \sigma^{k2}X_{k}^{\prime}X_{2} & \dots & \sigma^{kk}X_{k}^{\prime}X_{k}
   \end{array}\right]

Similarly

.. math::

   X^{\prime}\Omega^{-1}Y=\left[\begin{array}{c}
   \sigma^{11}X_{1}^{\prime}Y_{1}+\sigma^{12}X_{1}^{\prime}Y_{2}+\ldots\sigma^{1k}X_{1}^{\prime}Y_{k}\\
   \sigma^{21}X_{2}^{\prime}Y_{1}+\sigma^{22}X_{2}^{\prime}Y_{2}+\dots+\sigma^{2k}X_{2}^{\prime}Y_{k}\\
   \vdots\\
   \sigma^{k2}X_{k}^{\prime}Y_{1}+\sigma^{k2}X_{k}^{\prime}Y_{2}+\dots+\sigma^{kk}X_{k}^{\prime}Y_{k}
   \end{array}\right]

This suggests that constructing the high dimensional :math:`\Omega^{-1}`
can be avoided and many needless multiplications can be avoided by
directly computing these two components and the solution can be found
using ``solve``.

Three Stage Least Squares (3SLS)
--------------------------------

Three-stage least squares extends SUR to the case of endogenous
variables. It is the system extension of two-stage least squares (2SLS).
Like SUR, 3SLS jointly estimates multiple models which allows joint
hypothesis testing of parameters. It can also lead to more precise
parameter estimates if some residuals are conditionally homoskedastic
and regressors differ across equations.

.. _basic-notation-1:

Basic Notation
~~~~~~~~~~~~~~

Assume :math:`K` series are observed for :math:`N` time periods. Denote
the stacked observations of series :math:`i` as :math:`Y_{i}` and its
corresponding regressor matrix :math:`X_{i}`. The complete model
spanning all series can be specified as

.. math::

   \begin{aligned}
   Y & =X\beta+\epsilon\\
   \left[\begin{array}{c}
   Y_{1}\\
   Y_{2}\\
   \vdots\\
   Y_{K}
   \end{array}\right] & =\left[\begin{array}{cccc}
   X_{1} & 0 & 0 & 0\\
   0 & X_{2} & 0 & 0\\
   \vdots & \vdots & \ddots & \vdots\\
   0 & 0 & 0 & X_{K}
   \end{array}\right]\left[\begin{array}{c}
   \beta_{1}\\
   \beta_{2}\\
   \vdots\\
   \beta_{K}
   \end{array}\right]+\left[\begin{array}{c}
   \epsilon_{1}\\
   \epsilon_{2}\\
   \vdots\\
   \epsilon_{K}
   \end{array}\right].\end{aligned}

where :math:`X_{i}` contains both exogenous and endogenous variables.
For each equation denote the available instruments as :math:`Z_{i}`. The
instrument include both exogenous regressors and excluded instrumental
variables. Define the fitted values of the exogenous variables as

.. math:: \hat{X}_{i}=Z_{i}\left(Z_{i}^{\prime}Z_{i}\right)^{-1}Z_{i}^{\prime}X_{i}.

The IV estimator of :math:`\beta` is then

.. math:: \hat{\beta}_{IV}=\left(\hat{X}^{\prime}\hat{X}\right)^{-1}\hat{X}^{\prime}Y

where

.. math::

   \hat{X}=\left[\begin{array}{cccc}
   \hat{X}_{1} & 0 & 0 & 0\\
   0 & \hat{X}_{2} & 0 & 0\\
   \vdots & \vdots & \ddots & \vdots\\
   0 & 0 & 0 & \hat{X}_{N}
   \end{array}\right].

A GLS estimator can be similarly defined as

.. math:: \hat{\beta}_{GLS}=\left(\hat{X}^{\prime}\Omega^{-1}\hat{X}\right)^{-1}\hat{X}^{\prime}\Omega^{-1}Y.

Inference is identical to the SUR replacing :math:`X` with
:math:`\hat{X}` where model residuals are estimated using :math:`X` and
:math:`Y`.

System Generalized Method of Moments (GMM)
------------------------------------------

The system GMM estimator uses a set of moment conditions of the form

.. math::

   \begin{aligned}
   E\left[g_{i}\right] & =0,i=1,\ldots,T\\
   E\left[\begin{array}{c}
   g_{1i}\\
   g_{2i}\\
   \vdots\\
   g_{Ki}
   \end{array}\right] & =0\end{aligned}

where :math:`g_{ji}=z_{ji}^{\prime}\left(y_{ji}-x_{ji}\beta\right)`
where :math:`j` is the equation index, :math:`z_{ji}` is a set of
instruments for equation :math:`j` at time period i and :math:`x_{ji}`
and :math:`y_{ji}` are similarly defined. Define

.. math:: g_{N}\left(\beta\right)=N^{-1}\sum_{i=1}^{N}g_{i}\left(\beta\right),

then the parameters are estimated to minimize

.. math:: \min_{\beta}g_{N}\left(\beta\right)^{\prime}W^{-1}g_{N}\left(\beta\right).

The solution to this minimization problem is

.. math:: \hat{\beta}=\left(X^{\prime}ZW^{-1}Z^{\prime}X\right)^{-1}\left(X^{\prime}ZW^{-1}Z^{\prime}Y\right).

:math:`W` is the moment weighting matrix and for efficiency should be
set to the covariance of the moment conditions. In practice a 2-step
estimator is required since estimators of :math:`W` depend on
:math:`\beta`. In the first step, :math:`W=N^{-1}Z^{\prime}Z` is used to
construct an initial estimate of :math:`\hat{\beta}`. This would be
efficient if residuals were homoskedastic and uncorrelated. This choice
will produce estimates that are identical to system 2SLS. Using the
initial parameter estimated, the covariance of the moment conditions is
computed used one of two forms using the estimated model residuals
:math:`\hat{\epsilon}`.

Homoskedastic Weighting
~~~~~~~~~~~~~~~~~~~~~~~

When residuals are assumed to be conditionally homoskedastic, then

.. math:: \hat{W}=N^{-1}Z^{\prime}\left(\hat{\Sigma}\otimes I_{N}\right)Z

where :math:`\hat{\Sigma}`\ is the :math:`K` by :math:`K` covariance of
the model residuals. This specification assumes that residuals are not
correlated across observation index (only across models)

Heteroskedastic Weighting
~~~~~~~~~~~~~~~~~~~~~~~~~

When residuals are heteroskedastic, the weighting matrix is estimated
from

.. math:: \hat{W}=N^{-1}\sum\hat{g}_{i}\hat{g}_{i}^{\prime}

which allows cross-sectional dependence but no dependence across
observation index.

The efficient estimates are computed using the same expression only with
:math:`\hat{W}`,

.. math:: \hat{\beta}=\left(X^{\prime}Z\hat{W}^{-1}Z^{\prime}X\right)^{-1}\left(X^{\prime}Z\hat{W}^{-1}Z^{\prime}Y\right).

Both weighting estimators can be centered which will enforce an exact
mean 0 to the moment conditions. When models are over-identified, this
can improve accuracy of 2-step estimators.

Parameter Covariance
~~~~~~~~~~~~~~~~~~~~

Covariance estimators are available for either homoskedastic or
heteroskedastic data. The only difference is in how the covariance of
the moments is computed. In either case, the parameter covariance can be
estimated by

.. math:: \widehat{Var\left(\hat{\beta}\right)}=N^{-1}\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)^{-1}\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\hat{\Omega}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)^{-1}.

The covariance of the scores :math:`\hat{\Omega}` estimated using either
the homoskedastic weighting formula or the heteroskedastic weighting
formula immediately above. This allow the choice of weighting matrix to
be set separately from the score covariance estimator, although in most
cases these should be the same, and so the covariance of the estimated
parameters will simplify to

.. math:: \widehat{Var\left(\hat{\beta}\right)}=N^{-1}\left(\frac{X^{\prime}Z}{N}\hat{W}^{-1}\frac{Z^{\prime}X}{N}\right)^{-1}.

Testing Covariance and Correlations
-----------------------------------

Two tests are available to test whether the residual covariance is
diagonal. These are useful diagnostics when considering GLS estimation.
If the tests reject the null, then the data suggest that GLS estimation
should improve efficiency as long as the regressors are not all common.
If the null is not rejected, then the covariance is not statistically
different from a diagonal covariance and there are unlikely to be gains
to using GLS. The Breusch-Pagan test directly examines the correlations
of the residuals, and is defined as

.. math:: N\left(\sum_{i=1}^{K}\sum_{j=i+1}^{K}\hat{\rho}\right)\sim\chi_{K\left(K-1\right)/2}^{2}.

The likelihood ratio is defined as the difference between the log
determinants of a diagonal covariance matrix and the full unrestricted
covariance matrix,

.. math:: N\left(\sum_{i=1}^{K}\ln\hat{\sigma}_{i}^{2}-\ln\left|\hat{\Sigma}\right|\right)=N\left(\sum_{i=1}^{K}\ln\left|\hat{\Sigma}\odot I_{K}\right|-\ln\left|\hat{\Sigma}\right|\right)\sim\chi_{K\left(K-1\right)/2}^{2}.

The asymptotic distribution of the likelihood ratio test requires
homoskedasticity.

System Measures of Fit (:math:`R^{2}`)
--------------------------------------

Most measures of fit for systems of equations assume that all equations
contains a constant (or equivalent). Caution is needed when interpreting
if equations exclude constant terms.

Overall :math:`R^{2}`
~~~~~~~~~~~~~~~~~~~~~

The overall :math:`R^{2}` is defined as

.. math:: R^{2}=1-\frac{\sum_{i=1}^{K}SSR_{i}}{\sum_{i=1}^{K}TSS_{i}}

where :math:`TSS_{i}` is centered if equation :math:`i` contains a
constant and uncentered if it does not. When all equations contain
constants, it is identical to Judge’s measure.

McElroy
~~~~~~~

McElroy’s (1977) measure is defined as

.. math:: R^{2}=1-\frac{\epsilon^{\prime}\Omega^{-1}\epsilon}{Y^{\prime}\left(\Sigma^{-1}\otimes\left(I_{N}-\frac{\iota\iota^{\prime}}{N}\right)\right)Y}

where :math:`\iota` is a :math:`N` by 1 vector of 1s. This is
implemented as

.. math:: R^{2}=1-\frac{\sum_{i=1}^{N}\sum_{j=1}^{K}\hat{\xi}_{ij}^{2}}{\sum_{i=1}^{N}\sum_{j=1}^{K}\hat{\eta}_{ij}^{2}}

where

.. math::

   \begin{aligned}
   \hat{\xi} & =\hat{E}\hat{\Sigma}^{-\frac{1}{2}}\\
   \hat{E} & =\left[\begin{array}{cccc}
   \hat{\epsilon}_{1} & \hat{\epsilon}_{2} & \ldots & \hat{\epsilon}_{N}\end{array}\right]\end{aligned}

and

.. math::

   \begin{aligned}
   \hat{\eta} & =\tilde{Y}\hat{\Sigma}^{-\frac{1}{2}}\\
   \tilde{Y} & =\left[\begin{array}{cccc}
   Y_{1}-\hat{\mu}_{1} & Y_{2}-\hat{\mu}_{2} & \ldots & Y_{N}-\hat{\mu}_{N}\end{array}\right].\end{aligned}

where the vector of mean parameters is estimated by fitting a SURE to
the data (using user specified weights, if provided) where
:math:`X_{i}=\iota` contains only a constant. Greene provides an
alternative formulation of this measure as

.. math:: R^{2}=1-\frac{K}{\mathrm{tr}\left(\hat{\Sigma}^{-1}\hat{\Psi}\right)}

where :math:`\hat{\Psi}=N^{-1}\tilde{Y}^{\prime}\tilde{Y}` is the
covariance of the demeaned data.

Berndt
~~~~~~

Berndt’s measure is defined as

.. math:: R^{2}=1-\frac{\left|\hat{\Sigma}\right|}{\left|\hat{\Psi}\right|}.

Judge
~~~~~

Judge’s measure is the naive OLS :math:`R^{2}` for the system,

.. math:: R^{2}=1-\frac{\sum_{i=1}^{N}\sum_{j=1}^{K}\hat{E}_{ij}^{2}}{\sum_{i=1}^{N}\sum_{j=1}^{K}\tilde{Y}_{ij}^{2}}.

Dhrymes
~~~~~~~

Dhrymes’ measure of fit is a weighted average of the :math:`R^{2}` of
each equation,

.. math:: R^{2}=\sum_{i=1}^{K}R_{i}^{2}\frac{\hat{\Psi}_{ii}}{\mathrm{tr}\left(\hat{\Psi}\right)}

where :math:`R_{i}^{2}` is the coefficient of determination from
equation :math:`i`.
