Can you calculate confidence intervals for cluster sampling in Open Epi?
OpenEpi (www.OpenEpi.com) works like a calculator in that summary data are entered and some calculations are presented. For calculating confidence limits for a proportion or percentage based on cluster sampling, information is needed for each cluster in order to calculate the design effect (DEFF). OpenEpi does not perform these calculations. However, if you have an estimate of the design effect, there is a link on the OpenEpi menu towards the bottom that can lead to a web-based program that can do this calculation: OpenEpi Prototypes which will take you to www.sph.emory.edu/~cdckms At this website click on: Confidence intervals for a proportion with DEFF In this module if you provide the following it will calculate various confidence limts for a proportion: Numerator: Denominator: Number of Clusters: Design Effect (DEFF):
Kevin Sullivan

Answered:

12 years ago
The term "complex sampling" is very broad covering both stratified and cluster sampling (and stratified cluster sampling). The procedure required to analyse (e.g.) a PPS cluster sample (such as SMART) are quite different from those required to analyses a stratified sample. There are a number of packages that can handle complex sample survey data including EpiInfo (CSAMPLE), SPSS (Complex Samples module), STATA ("svy" commands), R/S-Plus ("survey" library), SUDAAN, &c. These tend to implement model-based procedures and yield approximate results. An alternative (non-parametric) approach is to use bootstrap / jack-knife estimators. Open-Epi does not provide procedures for complex sample data. The EpiTable module in the MSDOS version of EpiInfo which is available [url=http://www.brixtonhealth.com/epi4dos.html]here[/url] does provide some simple facilities for estimating proportions from cluster sampled surveys. I hope this is of some help.
Mark Myatt
Technical Expert

Answered:

12 years ago
Forgot to mention that the SMART software (ENA for SMART) also handles two-stage cluster sampled survey data producing CIs with acceptable coverage and efficiency. This software is, however, specifically designed for SMART type surveys and is not a general survey analysis package. If your need is to estimate the prevalence of undernutrition using two stage cluster sampled surveys then ENA is a useful tool. You can find it [url=http://nutrisurvey.net/ena2011]here[/url].
Mark Myatt
Technical Expert

Answered:

12 years ago
Dear Dr. Mark Myatt and Dr. Kevin Sullivan, Thank you both very much for the detailed response. I would like to ask a follow up question. Could you please advise whether calculation of CI is the only difference between surveys using simple/systematic sampling method and complex (both stratified and cluster) sampling method? Other estimates such as odds ratio, rate ratio, etc. and chi squaretest is the same between the 2 designs?
Anonymous

Answered:

12 years ago
As I wrote before above ... "The term 'complex sampling' is very broad covering both stratified and cluster sampling (and stratified cluster sampling). The procedure required to analyse (e.g.) a PPS cluster sample (such as SMART) are quite different from those required to analyses a stratified sample". In general terms ... for estimation: Cluster / PPS : Point estimates (odds ratios, risk ratio, means, &c.) derived from PPS cluster samples samples will be the same as calculated as if the data came from a simple random sample. The confidence interval around the estimate will not be the same (it will usually be wider). This is due to loss of sampling variation. It is possible to reduce this loss by careful sample design (i.e. increasing the number of clusters and / or using a within-cluster sampling scheme that helps to maintain sampling variation) although there will be a point at which cost-savings (the main reason for cluster sampling) are lost. Stratified sample : Point estimates will usually be different from those calculated as if the data came from a simple random sample. This is because stratum-specific results must be weighted by some function of stratum population before being combined to form an overall estimate. The confidence intervals around the estimate will not be the same (it will usually be narrower). In the case of hypothesis testing (e.g. Chi-square tests), most testing procedures are not optimal when data are autocorrelated. This is often the case with complex samples. Errors associated with a test may be different from specified (i.e. p < 0.05 may not be p < 0.05). There are special cases such as a chi-square test for twinned observations (e.g. one person has two eyes) - you may be lucky to find a special case that applies. here are a number of approaches to dealing with this problem. The most common is, probably, to ignore it and treat the data as if from a simple random sample. One approach (modelling) uses procedures to correct for correlation. These procedures vary in complexity and between the test being used. Another approach is to use resampling approaches (e.g. the bootstrap). The resampling approach is consistent and simple and works well in most cases. Both modelling and resampling require familiarity to do properly. It is probably easier to recast a hypothesis testing problem as an estimation problem. Most problems are amenable to this apprach. For example, the difference between two proportions (chi-square test commonly used) may be recast as a risk ratio (or odds ratio) with a 95% CI (90%) for a single sided test) that does not include zero. You can do this sort of analysis with (e.g.) CSAMPLE. I hope this is of some use.
Mark Myatt
Technical Expert

Answered:

12 years ago

Dear Dr. Mark Myatt and Dr. Kevin Sullivan, 
My name is Odundo- a nutrition researcher working in Kenya. I am doing some analysis- Acute malnutrition hotspots analysis, looking at historical data spanning over 10 years.
The data is mainly from SMART surveys and would like to estimate the CIs factoring in the sampling methodology (Cluster sampling). Ideally, I want to do graphs with error bars-the data is stored in MS Excel flatfiles and would like to set up a formular in Excel to give me the CIs.
Any assistance with this?
Using MS Excel I have calculated the CI for proportions  but this is obviously narrower compared to CIs obtained from ENA for SMART software. I would like to factor in the margin of error on account of clusters. I have the parameters such as ## of clusters, DEFF etc. I am avoiding using ENA software because that would be too manual and time consuming given the need to granulate/disaggregate the findings..
Thanks in advance!

Elijah Odundo

Answered:

3 years ago

Dear Odundo:

Since neither Mark nor Kevin have submitted an answer to your question, let me give it a try. The simplest formula to calculate 95% confidence intervals assuming simple random sampling (or its equivalent) is:

Lower confidence interval= Mean or proportion - (1.96 x standard error)

Upper confidence interval= Mean or proportion + (1.96 x standard error)

However, as you correctly state, the confidence intervals must account for complex sampling. The design effect (DEFF) is the multiplier to determine by how much the sample size must be inflated to maintain the same position if complex sampling will be done rather than simple random sampling. But when calculating measures of precision, such as confidence intervals, we use the square root of DEFF. Therefore, the equation to calculate confidence intervals for complex sampling surveys are:

Lower confidence interval= Mean or proportion - (1.96 x standard error x square root of DEFF)

Upper confidence interval= Mean or proportion + (1.96 x standard error x square root of DEFF)

So if you have the mean or proportion, the standard error calculated assuming simple random sampling, and the DEFF, you can calculate the appropriate 95% confidence intervals for estimates derived from data obtained by complex sampling.

There are other formulas to calculate confidence intervals which may or may not be more accurate in your situation. My general recommendation would be to use a statistical analysis software package which can account for cluster sampling and automatically calculate the appropriate 95% conference intervals. Such software includes Epi Info, SAS, SPSS, STATA, and R. 

I hope this is helpful. 

Bradley A. Woodruff
Technical Expert

Answered:

3 years ago

Many thanks Bradley
This is profoundly helpful!
 

Elijah Odundo

Answered:

3 years ago
Please login to post an answer:
Login