Name SMART Flagging Criteria (Technical)

I tend to use the WHO "biological plausibility" flagging criteria with anthropometric data. These are very similar to the older CDC / NCHS flagging criteria.

I have been asked (for two separate projects) to apply both the WHO and SMART flagging criteria to some very large datasets and look at the consequences of using each of these. I am anxious to get this right.

I have a couple of questions about the flagging criteria used in SMART surveys.

The SMART manual has:

"In the plausibility report, the program will list and query any value that is ± three standard deviations of the survey mean. After one or two clusters have been entered, or if there has been a previous survey, it is useful to enter in the variable view sheet the limits as the mean ± three standard deviations (or 3 z-scores) during data entry. This enables potential errors to be picked up as early as possible during data entry." (p 83)

This suggests that a flag for (e.g.) WHZ would be raised if:

WHZ = mean(WHZ) - 3 * sd(WHZ)

or:

WHZ = mean(WHZ) + 3 * sd(WHZ)

This suggest that the sample SD is the primary flagging criteria. This approach makes a great deal of sense if we assume normality in the distribution of the indicator (that is a separate issue).

Later the SMART manual has:

"Most children with wrongly measured data give values that are within the plausible range. Inclusion of such errors can be surmised from examination of the standard deviation, and other statistical checks on the data. The standard deviation should be between 0.8 and 1.2 z-score units for WFH in all well-conducted surveys (in 80% of surveys the standard deviation is between 0.9 and 1.1 z-score units). The standard deviation increases as the proportion of erroneous results in the dataset increases; this has a very dramatic effect upon the computed prevalence of wasting. For this reason, if a value is more likely to be an error than a real measurement, it should be removed from the analysis. We do this by taking the mean of the WFH data as the fixed point for describing the status of the population we are surveying. Statistically about 2.5 children out of 900 will lie outside the limits of ±3 z-score units of the mean. Less than 0.5 out of 1,000 will lie outside ±3.5 z-score units from the mean. This forms the basis for deciding if a value is more likely to be an error than a real measurement. The software will list children with these extreme values in the plausibility check list." (p 86)

I find this confusing. The estimates of case-numbers at different values quoted in this paragraph are true only if the sample SD (or 'z') = 1. For SD (or 'z') = 1 and mean z-score = -1 we would expect about 2.43 cases to have a z-score below -4 (that is the mean - 3). If the sample SD is 1.2 then we'd expect 11.18 to have a z-score below -4 (that is the mean - 3). I would not characterise 11.18 is "about 2.5" (it is about 4.5 times larger). If we use the sample SD then we would expect 2.43 cases.

It seems to me that the SMART manual is confused (it confuses me!) about the use of 'standard deviation" and "z". Following the numbers and working back ... the rationale of their being "about 2.5 children out of 900" requires the method to use the sample SD.

Later in the SMART manual we have:

"As explained in the section on extreme values, this tells you if there is substantial random error in the measurements. If the standard deviation is high (over 1.2), it is likely that there are a lot of extreme values and values more than ±3 z-scores of the mean. " (p 87)

This can only be the case if the sample SD is not used and the flagging criteria uses SD = 1 so that a flag for (e.g.) WHZ would be raised if:

WHZ = mean(WHZ) - 3

or:

WHZ = mean(WHZ) + 3

I'm not sure that this approach to flagging makes much sense.

The SMART manual is self-contradictory about this matter and I find myself confused.

My questions are:

(1) Do we use:

WHZ = mean(WHZ) +/- 3 * sd(whz)

or:

WHZ = mean(WHZ) + 3

(2) If we are to use:

WHZ = mean(WHZ) + 3

What is the rationale for this when we can easily get the sample(SD)? Why assume a known variance?

The first question should be easy enough to answer. The second might be more difficult.

All help gratefully received.

Hi Mark,

My understanding is that the SMART criteria is

WHZ = mean(WHZ) +/- 3

and not

WHZ = mean(WHZ) +/- 3 * sd(whz)

The reason, as you write from the manual, that it is expected that the SD should be between 0.8 and 1.2. hence the mid-value.

There are studies looking into this issue elsewehere. See this two

http://www.ncbi.nlm.nih.gov/pubmed/17639241
https://peerj.com/articles/380/

I hope this is useful

Carlos

Anonymous

Answered:

9 years ago

Thanks.

You may be correct ... I find this hard to get from the SMART manual.

I see little rationale in using "assumed" variance (i.e. SD = 1) when we have a large (i.e. n >> 60) sample from which can make a good estimate of the variance. Expecting the SD to be in the range 0.8 to 1.2 is a very different thing from assuming the SD to equal one. What am I missing?

Is this just a rule? I suppose that it doesn't have to make sense as long as we all do the same peculiar thing.

Anonymous

Answered:

9 years ago

Hi Mark,

I know that there is one document written by Mike Golden and others where they explored this issue in more detail, and I think it was the document that lead to the recommendation of SD=1. I do not have this document, but others in this forum might have and might share with you.

I agree with you that it is better to use the observed variance rather than an idealised one. But just like for painting, if you have the skill to paint, you paint free hand; if you do not, you can paint by numbers. I won't give you the same result, but some might argue it will give you comparable results.

Given that we are now witnessing an emergence of the double of malnutrition, ie. individuals at both extremes of the distribution. I am unsure as to how much this suggestion of a SD range of 0.8 to 1.2 will continue to be valid.

My tuppence worth,

Carlos

Anonymous

Answered:

9 years ago

I hope that someone can send a link that document. Then all we need to know is whether the recommendation was adopted.

I find it a very odd thing to do ... if we have the data to calculate a mean we also have the data to calculate the SD. We now that, in SMART, we have to calculate the SD as this is used as a quality check.

I think it will be better as then we will have symmetry. If we have a mixed population with regard to risk of malnutrition then we might expect a long left tail and a high SD. I guess that SMART rules censors some true cases (and underestimate prevalence) because of this ... a separate issue.

Anonymous

Answered:

9 years ago

Dear Carlos,

I guess you refer to this document, suggesting that even in famine situations, SD of WFH is not very different from 1.

http://www.nutrisurvey.de/ena2011/Golden_Population_nutritional_status_during_famile_surveywhzdis.pdf

But this does not mean it is always excatly equal to 1

Anonymous

Answered:

9 years ago

Hi André,

I agree with you, and by no means I am encouraging the use of an SD=1 in all situations. I think we made this point clear in our paper assessing this issue (https://peerj.com/articles/380/).

In our paper, we put out a call to this community to discuss this issue in more detail, as to be able to build consensus about whether or not, a standardised cleaning criteria should be applied widely. And if so, to decide which one this should be.

BW,
Carlos

Anonymous

Answered:

9 years ago

André,

I did see this note (I think it remains unpublished). It makes some grand claims based on the use of a no longer recommended weak statistical test.

Golden & Grellety (2002) tested for normality in 228 nutritional anthropometry surveys using the one-sample Kolmogorov-Smirnov (KS) test and found that the distribution of WHZ did not differ significantly from normality in 225 (98.6%) surveys.

The problem with this approach is that the one-sample KS test uses parameters from the sample to modify the null hypotheses. This reduces the power of the test. Alternative tests such as the Shapiro-Wilk or Ansderson-Darling test are now preferred over the KS test.

I repeated the analysis of Golden & Grellety (2002) using a superset of the data they used (n = 560 surveys) and found:

KS Test 491 / 560 = 88% tested normal Shapiro-Wilk test 118 / 560 = 21% tested normal

There are issues with the Shapiro-Wilk test in that it tends to be overly sensitive when used with large sample sizes. It is common to investigate positive tests using a Q-Q plot as in Figure 4 of Golden & Grellety (2002). Deviations from normality can be seen (particularly in the tails of the distribution) in that figure.

Just be be clear ... here is an example:

In this example there is a clear tail to the left. The tests results are:

KS Test p = 0.1638 Shapiro-Wilk test p < 0.0001

The distribution shown would be classified as normal by the methods employed by Golden & Grellety (2002). BTW ... the SD in the illustration was 1.17 and within the zone of "plausibility" in SMART.

I think this puts Carlos' comment about dual burden in perspective. If we have a dual burden then we would see some symmetry.

I think this also brings into question the fetishisation / reification of normality in the SMART method. If 88% of surveys do return a non-normal dataset similar to that illustrated then the use of a normal assumption to decide "plausibility" will exclude cases in the left tail and may bias prevalence downwards.

This is all besides the point. André is correct. There is a world of difference between something often being within +/- 20% of a given valus and assuming that it must then take that given value. This is nonsense to me.

The fact that I find a couple of things nonsensical is also besides the point. What is the "correct" procedure for SMART?

Carlos,

I agree that there needs to be a debate about this issue. I think we can standardise for surveys of the general population (clinic populations will be different) but that the criteria needs to be one of biological plausibility as is recommended by the WHO and CDC and what we used to before SMART. I dont think it continent to use rules that assume normality when that assumption is very often untrue.

The fact that this debate is needed is also besides the point. What is the "correct" procedure for SMART?

Anonymous

Answered:

9 years ago

Hi Mark,
Very nice analysis, I think you present a convincing argument for evaluating our assumptions further. One comment, for future posts, could you try to make your figures smaller so that I can properly strain my eyes.

Re: your question. I think that there are a lot of SMART people in this forum to provide you with an answer. I am not sure I will qualify as one of them. However, if what ENA for SMART software do is to follow the SMART recommendation, then the "correct" SMART procedure is:

Indicator = observed mean +/- 3 z-scores

BW,
Carlos

Anonymous

Answered:

9 years ago

Sorry about the figure size. The forum software does not seem to autoscale images so I try to make them small enough to be seen on (e.g.) a netbook type device and to not be too big as to make access to the forums impossibly slow for those with poor network speeds. I will write to the forum manager to see if we can fix this. In the meantime try "View -> Zoom In" (or what-have-you) in your favourite browser.

I think we could try to repeat the analysis on the new bigger-better survey database we have collected (permissions needed).

WRT SMART people (I will resist the obvious joke), I am hoping that someone from the SMART organisation will give me the definitive answer.

Thanks for the clarification WRT ENA for SMART. Juergen is a fine chap but may have got the wrong end of the stick (as I may have) because the SMART documentation is far from clear. If is is mean +/- 3 then the SMART manual is wrong in a number of places.

Anonymous

Answered:

9 years ago

Hello all,

Indeed the calculation of SMART flags applies +/- 3 Z-scores from the observed mean (SD=1, not the observed SD).

While an updated version of the SMART Manual has not been released since 2006, to keep up with changes in SMART guidance there have been updates to certain sections of the manual, including Sampling for SMART (June 2012) http://smartmethodology.org/survey-planning-tools/smart-methodology/smart-methodology-paper/ and the ENA Manual (August 2012) http://smartmethodology.org/survey-planning-tools/smart-emergency-nutrition-assessment/ena-software-versions/ . New sections are forthcoming!

For further reference, see the 1995 WHO Technical Report on Anthropometry (p 218) regarding methods for calculating flexible exclusion criteria: http://apps.who.int/iris/bitstream/10665/37003/1/WHO_TRS_854.pdf

Anonymous

Answered:

9 years ago

Alina,

Thank you for the clarification.

I am still at a loss to understand the "mean(z) +/- 3" exclusion criteria. Can anyone explain the rationale for this. The rationale given in the SMART manual supports only a "mean(z) +/- 3 * sd(z)" rule using the observed SD.

BTW : The WHO reference has a "mean(z) + /- 4" rule with some additional rules relating to extreme values as a "flexible exclusion range".

Thanks again. I have what I need to do the job.

Anonymous

Answered:

9 years ago

Hi Mark,

Not sure if you have seen Michael Golden doc :SMART:Ensuring data quality-is the survey result usable? You might already seen it.

Can be downloaded from SMART website here:
http://smartmethodology.org/survey-planning-tools/smart-capacity-building-toolbox/ then on the survey manager training section, click on "complementary tools and resources" then choose to download the "handouts" zip file. You will find a PDF file titled : " Ensuring data quality_Michael Golden.pdf".

It has some explanations regarding the use of SD. Sorry for this primitive method to explain the location of the file because I couldn't upload here or have any direct online link.

Regards,

Sameh

Anonymous

Answered:

9 years ago

Sameh,

Thanks for the link. I had not seen it.

This confuses me (again).

This paragraph from page 2 and 3:

Normally, when we examine survey data we find many more children who have extreme values than is given in this table. It would clearly be incorrect to set the limits at ± 2 SD from the mean because a large number of correct measurements would be excluded. If we set it at ±3.0 we will exclude just over one out of a thousand children incorrectly, and the other children we will have excluded correctly because they were bad measurements. If the boundaries are set at -3.3 or even -3.5, then almost no child will be incorrectly excluded. Excluding one of a thousand measurements when we should include that measurement makes almost no difference to the final result, but including a lot of children outside these boundaries can have a major effect upon the result.

and the associated footnote:

To be absolutely correct, if the flags are set ±3 SD then we should select 1.3 children per thousand from those that have been excluded from the sample ...

The text states clearly that the sample SD is used. The footnote confirms this. For example, with mean = -1, SD = 1 we'd have:

> pnorm(-4.3, -1, 1.1) * 1000 [1] 1.349898

That is the "1.3 children per thousand" from the footnote.

This has left me unsure (again) as to what is the correct flagging procedure.

Anonymous

Answered:

9 years ago

Hi all, To clarify, and go a bit back to basics, flags refer to outliers, extreme values that are so far from the mean that that they are unlikely to be correct measurements. The errors in the Plausibility Check of ENA for SMART software are always identified based on SMART exclusion cut-offs. SMART flags (cut-offs) are based on statistical plausibility, and exclude all values outside of ± 3 Z-scores from the mean of the surveyed population. If measurements from a surveyed population form a normal distribution (bell-shaped curve) with a SD equal to 1 Z-score, based on statistical principles 99.7% of observations should lie within ±3 Z-scores from the mean. Thus, on average about one in 1,000 measurements excluded may be removed incorrectly (this will have a negligible effect upon the results – there is no penalty in the Plausibility Check summary table until there are 25 flagged data per 1000 measurements). The upper and lower cut-off points for SMART flags will be different in each survey since the survey mean is used as the reference point for SMART flags (surveyed children are compared to their own population when using SMART flags). An example: If the survey mean is -1.12 the range of acceptability based on SMART is (-4.12; +1.88). The analysis of the plausibility check (each test) will be performed on the values that lie inside this range. Data (measurements) that lie outside the SMART flag cut-offs should never be manually removed from the dataset (i.e., from the Data Entry Anthropometry module of ENA); they will be automatically excluded during the analysis depending upon the flags chosen and the range set in the Options tab. Anyone who has access to the original dataset must be able to re-examine all of the data (measurements), including those that are excluded when using SMART flags. As the standard deviation of the distribution increases more subjects will be erroneously excluded from the analysis when SMART-flags are applied. A high proportion of flagged data usually means that measurements collected by at least one (or more) survey teams have been poorly taken or recorded. Each of the records listed in the flagged data section should be double-checked with the original copy of the paper questionnaire to ensure there are no data entry mistakes. Once again, please do keep posted for the launch of the updated Plausibility Check chapter, which will hopefully quell any more confusion!

Anonymous

Answered:

9 years ago

I am not sure that you answer my question. I am not asking about the ENA software. I am asking about the SMART methodology as described in key documents. It seems that the ENA software does not implement the methods described in SMART documents that describe the SMART methodology. The 2006 SMART manual is a little confused about whether to use the sample SD or to use Z = 1 but all examples point to the use of the sample SD. The 2014 "Ensuring data quality_Michael Golden.pdf" from the SMART website is very clear that the sample SD is used for flagging / exclusion. Please can you confirm that the 2014 "Ensuring data quality_Michael Golden.pdf" are not correct. I would like to "go a bit back to basics" ... I can understand the use of the sample SD but I cannot understand the use of Z = 1. Can you explain why Z = 1 is used?

Anonymous

Answered:

9 years ago

Z=1 is clearly an error

Anonymous

Answered:

9 years ago

I think so too. I can see no rationale for the Z = 1 approach (there may be one).

Anonymous

Answered:

9 years ago

One additional point not mentioned yet, it’s relevant to note that for SMART flags (calculated using SD=1) exclusions are separately calculated for each indicator. A child does not automatically get excluded for all 3 scores (WHZ, HAZ and WAZ) if only one or two of the three Z-scores are flagged. We understand this to be a distinction from the procedures used by MICS and DHS.

Anonymous

Answered:

9 years ago

Dear Colleagues,

The new chapter of SMART survey plausibility check for anthropometry is published now, please download by first opening the following link :

http://smartmethodology.org/survey-planning-tools/smart-methodology/plausibility-check/

Then click on the download button.

Regards,

Sameh

Anonymous

Answered:

9 years ago

Sameh,

Thanks for that.

I think it is:

sample mean z-score +/- 3

adding a "z" here just confuses.

I think I prefer:

sample mean z-score +/- 3 * the sample SD

but that brings with it problems WRT the source of the mean and SD in wider area surveys such as MICS and DHS. A national level SD could be very large.

Even with the SMART:

sample mean +/- 3

method we have trouble deciding at which level to calculate the mean. If calculated at the national level we might reduce prevalence by censoring cases from the most risky areas.

I guess the correct approach would be to apply censoring on a district by district (stratum by stratum) approach.

Is there a guideline for this? What do they do in MICS and DHS?

Anonymous

Answered:

9 years ago