# Cox proportional-hazards regression

## Description

Whereas the Kaplan-Meier method with log-rank test is useful for comparing survival curves in two or more groups, Cox proportional-hazards regression allows analyzing the effect of several risk factors on survival.

The probability of the endpoint (death, or any other event of interest, e.g. recurrence of disease) is called the hazard. The hazard is modeled as:

where x1 ... xk are a collection of predictor variables and H_{0}(t) is the baseline hazard at time t, representing the hazard for a person with the value 0 for all the predictor variables.

By dividing both sides of the above equation by H_{0}(t) and taking logarithms, we obtain:

We call H(t) / H_{0}(t) the hazard ratio. The coefficients b1...bk are estimated by Cox regression, and can be interpreted in a similar manner to that of multiple logistic regression.

Suppose the covariate (risk factor) is dichotomous and is coded 1 if present and 0 if absent. Then the quantity exp(b*i*) can be interpreted as the instantaneous relative risk of an event, at any time, for an individual with the risk factor present compared with an individual with the risk factor absent, given both individuals are the same on all other covariates.

Suppose the covariate is continuous, then the quantity exp(b*i*) is the instantaneous relative risk of an event, at any time, for an individual with an increase of 1 in the value of the covariate compared with another individual, given both individuals are the same on all other covariates.

## Required input

**Survival time**

The name of the variable containing the time to reach the event of interest, or the time of follow-up.

**Endpoint**

The name of a variable containing codes 1 for the cases that have reached the endpoint, or code 0 for the cases that have not reached the endpoint, because they withdrew from the study, or the end of the study was reached.

**Predictor variables**: names of variables that you expect to predict survival time.

The Cox proportional regression model assumes that the effects of the predictor variables are constant over time. Furthermore there should be a linear relationship between the endpoint and predictor variables. Predictor variables that have a highly skewed distribution may require logarithmic transformation to reduce the effect of extreme values. Logarithmic transformation of a variable *var* can be obtained by entering LOG(*var*) as predictor variable.

**Filter**

(Optionally) enter a data filter in order to include only a selected subgroup of cases in the analysis.

**Options**

Method: select the way independent variables are entered into the model.

- Enter: enter all variables in the model in one single step, without checking
- Forward: enter significant variables sequentially
- Backward: first enter all variables into the model and next remove the non-significant variables sequentially
- Stepwise: enter significant variables sequentially; after entering a variable in the model, check and possibly remove variables that became non-significant.

Enter variable if P<

A variable is entered into the model if its associated significance level is less than this P-value.

Remove variable if P>

A variable is removed from the model if its associated significance level is greater than this P-value.

**Graph options**

Graph:

- Survival probability (%): plot Survival probability (%) against time (descending curves)
- 100 - Survival probability (%): plot 100 - Survival probability (%) against time (ascending curves)

Graph subgroups: here you can select one of the predictor variables. The graph will display different survival curves for all values in this covariate (which must be categorical, and may not contain more than 8 categories). If no covariate is selected here, then the graph will display the survival at mean of the covariates in the model.

## Results

### Cases summary

This table shows the number of cases that reached the endpoint (Number of events), the number of cases that did not reach the endpoint (Number censored), and the total number of cases.

### Overall Model Fit

The Chi-squared statistic tests the relationship between time and all the covariates in the model.

### Coefficients and Standard Errors

The program lists the covariates included in the model, their regression coefficient b with standard error (SE), Wald statistic (b/SE)^{2} and associated P-value, Exp(b) and the 95% confidence interval for Exp(b).

Suppose the covariate is **dichotomous** and is coded 1 if present and 0 if absent. Then the quantity exp(b) can be interpreted as the instantaneous relative risk of an event, at any time, for an individual with the risk factor present compared with an individual with the risk factor absent, given both individuals are the same on all other covariates.

Suppose the covariate is **continuous**, then the quantity exp(b) is the instantaneous relative risk of an event, at any time, for an individual with an increase of 1 in the value of the covariate compared with another individual, given both individuals are the same on all other covariates.

### Variables not included in the model

The variables (if any) that were not found to significantly contribute to the prediction of time, and that were not included in the model.

### Baseline cumulative hazard function

Finally, the program lists the baseline cumulative hazard H_{0}(t), with the cumulative hazard and survival at mean of all covariates in the model.

The baseline cumulative hazard can be used to calculate the survival probability S(t) for any case at time t:

where PI is a prognostic index:

### ROC curve analysis

Another method to evaluate the Cox proportional-hazards regression model makes use of ROC curve analysis (Harrell et al., 1996; Pencina & D'Agostino, 2004). In this analysis, the power of the model's prognostic indices to discriminate between positive and negative cases is quantified by the Area under the ROC curve (AUC). The AUC, sometimes referred to as the C-statistic (or concordance index) (Harrell et al., 1996), is a value that varies from 0.5 (discriminating power not better than chance) to 1.0 (perfect discriminating power).

### Sample size considerations

Based on the work of Peduzzi et al. (1995) the following guideline for a minimum number of cases to include in a study can be suggested.

Let **p** be the smallest of the proportions of positive cases (cases that reached the endpoint) and negative cases (cases that did not reach the endpoint) in the population and **k** the number of predictor variables, then the minimum number of cases to include is:

N = 10 k / p

For example: you have 3 predictor variables to include in the model and the proportion of positive cases in the population is 0.20 (20%). The minimum number of cases required is

N = 10 x 3 / 0.20 = 150

If the resulting number is less than 100 you should increase it to 100 as suggested by Long (1997).

## Graph

The graph displays the survival curves for all categories of the categorical variable selected for Graph - Subgroups, and for mean values for all other covariates in the model.

If no covariate was selected for Graph - Subgroups, or if the selected variable was not included in the model, then the graph displays a single survival curve at mean of all covariates in the model.

A survival curve represents the probability (Y-axis) of surviving a given length of time (X-axis).

## Literature

- Christensen E (1987) Multivariate survival analysis using Cox's regression model. Hepatology 7:1346-1358.
- Harrell FE Jr, Lee KL, Mark DB (1996) Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine 15:361-387
- Long JS (1997) Regression Models for categorical and limited dependent variables. Thousand Oaks, CA: Sage Publications.
- Pampel FC (2000) Logistic regression: A primer. Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-132. Thousand Oaks, CA: Sage Publications.
- Peduzzi P, Concato J, Feinstein AR, Holford TR (1995) Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. Journal of Clinical Epidemiology 48:1503-1510.
- Pencina MJ, D'Agostino RB (2004) Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Statistics in Medicine 23:2109-2123.
- Rosner B (2006) Fundamentals of Biostatistics. 6
^{th}ed. Pacific Grove: Duxbury.