Kaplan-Meier estimator

From Ganfyd

Jump to: navigation, search

Statistical tool used to describe the time to the occurence of an event among a set of subjects, i.e. a method of survival analysis.[1][2][3] Its most common application is as the survival function, that is describing survival data, for instance, in cancer patients, where death is the event being examined. It could equally apply to any event, e.g. relapse of a disease, failure of an implanted device, etc.

Kaplan-Meier example.png

The data are used to generate the probability of the event. It is most commonly displayed as a graphical plot, with time on the x-axis, while the y-axis displays the cumulative proportion of subjects who have experienced the event of interest. The plot consists of steps reflecting either the occurence of an event at a particular time point, or removal of an individual due to censoring.

The method is used for data where length of follow-up varies as other methods such as the t-test are more appropriate for situations where follow-up is comprehensive and of a fixed length. This is achieved by allowing censoring of data in instances where the follow-up of a subject is lost or terminated for reasons unrelated to the outcome measure, e.g. patient withdrawal from the study. These are indicated on the plot with a vertical tick.

While it is possible to have a single plot, it is more common to plot two different groups on the same axes, e.g. patients treated with adjuvant chemotherapy versus a control group, thus permitting both a graphical comparison and a comparison with other statistical methods such as the Logrank test.

The graph usually starts at 1 (i.e. all patients alive) and slowly decreases. However, in some situations, for instance in a study of time taken to achieve a particular skill, it is more appropriate to look at the cumulative rate of the event, i.e. (1 - cumulative survival).



The calculations assume that:[4]

  1. Censored subjects behave in a similar way to those still under follow-up. There could be a biologically significant reason why some patients are lost to follow-up.
  2. Subjects recruited early and late in a study are similar. The longer a study, the less likely this is to hold true. For instance, in cancer, case mix may change and earlier detection may mean a biologically less aggressive tumours.
  3. The event is assumed to occur at the time point recorded. This is obvious in the case of death, but this may not be the case, for instance in cancer recurrence, where this is dependent on time of detection.

How to do it?

SPSS (Statistical Package for the Social Sciences)

  • To compare 2 groups, need a minimum of 3 columns:
    • Time to event, or if event did not occur, the length of follow-up.
    • Status, i.e. did the event occur? Can be specified as 1/0, Y/N, etc. Values can be defined in the 'data' mode.
    • Factor, i.e. what is different between the two groups, e.g. adjuvant chemotherapy vs no chemotherapy (again indicated by 1/0 or Y/N).
  • Select from menu: Analysis -> Survival -> Kaplan-Meier
  • Transfer the data columns to the appropriate boxes with the arrows.
  • Define the Status event, e.g. if survival analysis and death=1, then specify '1'.
  • Specify options if required (allows Logrank test).
  • Press OK


Info bulb.png R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. Download from a mirror of R-Project.org. S is another open source statistical framework.

Use the survival library. Then load the data as a data frame with headings time, status and x (where x is the differing factor, e.g. chemotherapy vs no chemotherapy).

The functions used can have several parameters, but for the basic plot, the default settings will suffice. For detailed information, see the R manual.[5][6]

  • Load library. Type: library(survival)
  • The Surv function processes a list of time and status data to produce a sequence of time values. Values which are censored are suffixed with a +. Usage:[7] Surv(mydata$time,mydata$status)
  • The survfit function then processes data objects from Surv by calculating cumulative proportions.[8]
  • Plot using the plot function.
  • Combining this into one line:
    plot(survfit(Surv(mydata$time,mydata$status) ~ x)) or
    plot(survfit(Surv(time, status) ~ x, data = mydata))
Survival event.png
  • For cumulative probability of the event, use:
    plot(survfit(Surv(time, status) ~ x, data = mydata), fun="event")
  • For Logrank test, use survdiff function.

The type of line and colour can be changed using:

  • plot(survfit(Surv(time, status) ~ x, data = mydata), lty=c(1,2), col=c("red", "blue"))