Wow, just wow. If you think Psychological Science was bad in the 2010-2015 era, you can’t imagine how bad it was back in 1999 « Statistical Modeling, Causal Inference, and Social Science

Wow, just wow. If you think Psychological Science was bad in the 2010-2015 era, you can’t imagine how bad it was back in 1999

Posted by Andrew on 16 June 2021, 9:08 am

Shane Frederick points us to this article from 1999, “Stereotype susceptibility: Identity salience and shifts in quantitative performance,” about which he writes:

This is one of the worst papers ever published in Psych Science (which is a big claim, I recognize). It is old, but really worth a look if you have never read it. It’s famous (like 1400 citations). And, mercifully, only 3 pages long.

I [Frederick] assign the paper to students each year to review. They almost all review it glowingly (i.e., uncritically).

That continues to surprise and disappoint me, but I don’t know if they think they are supposed to (a politeness norm that actually hurts them given that I’m the evaluator) or if they just lack the skills to “do” anything with the data and/or the many silly things reported in the paper? Both?

I took a look at this paper and, yeah, it’s bad. Their design doesn’t seem so bad (low sample size aside):

Forty-six Asian-American female undergraduates were run individually in a laboratory session. First, an experimenter blind to the manipulation asked them to till out the appropriate manipulation questionnaire. In the female-identity-salient condition, participants (n = 14) were asked [some questions regarding living on single-sex or mixed floors in the dorm]. In the Asian- identity-salient condition, participants (n = 16) were asked [some questions about foreign languages and immigration]. In the control condition, participants (n = 16J were asked [various neutral questions]. After the questionnaire, participants were given a quantitative test that consisted of 12 math questions . . .

The main dependent variable was accuracy, which was the number of mathematical questions a participant answered correctly divided by the number of questions that the participant attempted to answer.

And here were the key results:

Participants in the Asian-identity-salient condition answered an average of 54% of the questions that they attempted correctly, participants in the control condition answered an average of 49% correctly, and participants in the female-identily-salient condition answered an average of 43% couectly. A linear contrast analysis testing our prediction that participants in the Asian-identity-salient condition scored the highest, participants in the control condition scored in the middle, and participants in the female-identity-salient condition scored the lowest revealed that this pattern was significant, t(43) = 1.86, p < .05. r = .27. . . .

The first thing you might notice is that a t-score of 1.86 is not usually associated with “p < .05"--in standard practice you'd need the t-score to be at least 1.96 to get that level of statistical significance--but that's really the least of our worries here. If you read through the paper, you'll see lots and lots of researcher degrees of freedom, also lots of comparisons of statistical significance to non-significance, which is a mistake, and even more so here, given that they’re giving themselves license to decide on an ad hoc basis whether to count each particular comparison as “significant” (t = 1.86), “the same, albeit less statistically significant” (t = 0.89), or “no significant differences” (they don’t give the t or F score on this one). This is perhaps the first time I’ve ever seen a t score less than 1 included in the nearly-statistically-significant category. This is stone-cold Calvinball, of which it’s been said, “There is only one permanent rule in Calvinball: players cannot play it the same way twice.”

Here’s the final sentence of the paper:

The results presented here cleariy indicate that test performance is both malleable and surprisingly susceptible to implicit sociocultural pressures.

Huh? They could’ve saved themselves a few bucks and not run any people at all in the study, just rolled some dice 46 times and come up with some stories.

But the authors were from Harvard. I guess you can get away with lots of things if you’re from Harvard.

Why do we say this paper is so bad?

Why do we say this paper is so bad? There’s no reason to suspect the authors are bad people, and there’s no reason to think that the hypothesis they’re testing is wrong. If they could do a careful replication study with a few thousand students at multiple universities, the results could very well turn out to be consistent with their theories. Except for the narrow ambit of the study and the strong generalizations made from just tow small groups of students, the design seems reasonable. I assume the experiments were described accurately, the data are real, and there were no pizzagate-style shenanigans going on.

But that’s my point. This paper is notably bad because nothing about it is notable. It’s everyday bad science, performed by researchers at a top university, supported by national research grants, published in a top journal, cited 1069 times when I last checked—and with conclusions that are unsupported by the data. (As I often say, if the theory is so great that it stands on its own, fine: just present the theory and perhaps some preliminary data representing a pilot study, but don’t do the mathematical equivalent of flipping a bunch of coins and then using the pattern of heads and tails to tell a story.)

Routine bad science using routine bad methods, the kind that fools Harvard scholars, journal reviewers, and 1600 or so later researchers.

From a scientific standpoint, things like pizzagate or that Cornell ESP study or that voodoo doll study (really) or Why We Sleep or beauty and sex ratio or ovulation and voting or air rage or himmicanes or ages ending in 9 or the critical positivity ratio or the collected works of Michael Lacour—these are miss the point, as each of these stories has some special notable feature that makes them newsworthy. Each has some interesting story, but from a scientific standpoint each of these cases is boring, involving some ridiculous theory or some implausible overreach or some flat-out scientific misconduct.

The case described above, though, is fascinating in its utter ordinariness. Scientists just going about their job. Cargo cult at its purest, the blind peer-reviewing and citing the blind.

I guess the Platonic ideal of this would a paper publishing two studies with two participants each, and still managing to squeeze out some claims of statistical significance. But two studies with N=46 and N=19, that’s pretty close to the no-data ideal.

Again, I’m sure these researchers were doing their best to apply the statistical tools they learned—and I can only assume that they took publication in this top journal as a signal that they were doing things right. Don’t hate the player, hate the game.

P.S. One more thing. I can see the temptation to say something nice about this paper. It’s on an important topic, their results are statistically significant in some way, three referees and a journal editor thought it was worth publishing in a top journal . . . how can we be so quick to dismiss it?

The short answer is that the methods used in this paper are the same methods used to prove that Cornell students have ESP, or that beautiful people have more girls, or embodied cognition, or all sorts of other silly things that the experts used to tell us “have no choice but to accept that the major conclusions of these studies are true.”

To say that the statistical methods in this paper are worse than useless (useless would be making no claims at all; worse than useless is fooling yourself and others into believing strong and baseless claims) does not mean that the substantive theories in the paper are wrong. What it means is that the paper is providing no real evidence for its theories. Recall the all-important distinction between truth and evidence. Also recall the social pressure to say nice things, the attitude that by default we should believe a published or publicized study.

No. This can’t be the way to do science: coming up with theories and then purportedly testing them by coming up with random numbers and weaving a story based on statistical significance. It’s bad when this approach is used on purpose (“p-hacking”) and it’s bad when done in good faith. Not morally bad, just bad science, not a good way of learning about external reality.

Filed under Miscellaneous Science, Zombies

Comment (RSS) | Permalink

38 Comments

Dan Bowman says:

June 16, 2021 at 9:47 am

“The main dependent variable was accuracy, which was the number of mathematical questions a participant answered correctly divided by the number of questions that the participant attempted to answer.”

Wow! So, a participant who only attempts the problem(s) he or she knows how to do and then gets it (them) right scores 100%. That seems like it might be tricky to interpret….

Reply to this comment
- Martha (Smith) says:
  
  June 16, 2021 at 5:18 pm
  
  +1
  
  Reply to this comment
garnett says:

June 16, 2021 at 9:50 am

“…just rolled some dice 46 times….”

D6? D12? D20? You can’t leave your readers hanging like this….

Reply to this comment
- chris says:
  
  June 16, 2021 at 2:38 pm
  
  And is it with advantage or disadvantage? We MUST know!
  
  Reply to this comment
  - garnett says:
    
    June 16, 2021 at 2:41 pm
    
    Can I apply Bardic Inspiration? Is this an ability check or not?
    
    Reply to this comment
garnett says:

June 16, 2021 at 10:38 am

“Not morally bad, just bad science,…”

I’m ambivalent about this. These researchers spend other people’s money that was given to them in good faith.

Reply to this comment
- Daniel Lakeland says:
  
  June 16, 2021 at 10:42 am
  
  I’m not ambivalent at all. It’s morally bad to do this kind of stuff. It’s equivalent to the “snake oil salesman” of the late 1800’s selling bullshit medicines with harmful ingredients to unsuspecting frontier townspeople.
  
  Reply to this comment
  - Garnett says:
    
    June 16, 2021 at 10:51 am
    
    I can’t disagree, but is it morally bad if they don’t know any better?
    
    Reply to this comment
    - Daniel Lakeland says:
      
      June 16, 2021 at 12:39 pm
      
      It’s morally bad if they *should have known better* and if a relatively simple search of say Paul Meehl’s papers from the 1960’s would have informed them, and there are many such similar papers from others, and if simple simulation studies within reach of even 1999 era computers could have shown them the folly of their ways, and so on and soforth… then yes. It’s like the snake oil salesman just listening to what the snake oil company says and purposefully doesn’t do any real external research to see if the snake oil is in fact doing what the company says.
      
      That there are whole fields built up of just such snake oil salesmen, and whole university departments devoted to promoting such snake oil and their salespeople is also a serious **moral** failing.
      
      The problem here is that although what I’m saying above seems obvious, it’s not at all accepted. Even Andrew shies away from coming out and saying “The Emperor Has No Clothes, and has been actively harming his citizens for fun and profit”
      
      Reply to this comment
      - Garnett says:
        
        June 16, 2021 at 1:49 pm
        
        Do you suspect that such a call-out would indict colleagues? That’s made me very careful about what I say.
        I wonder about everyone else.
        
        Reply to this comment
        
        Daniel Lakeland says:
        
        June 16, 2021 at 2:35 pm
        
        I personally find it takes a lot of effort to avoid working with this type of project if you’re in academia. My own personal collaborators have been chosen carefully and I don’t think it would indict them. But I left academia after my PhD precisely because I didn’t think it was morally acceptable to participate in these kinds of shenanigans and yet it appeared to be nearly required by granting agencies and dept promotional policies.
        
        I know there are MANY students these days who are opting out of academia after PhDs or postdocs for similar reasons.
        
        There are many parallels with Serpico in my opinion.
        
        Reply to this comment
Matt Skaggs says:

June 16, 2021 at 10:42 am

More context from the last paragraph of the paper:

“Finally, finding that academic performance can be helped as well as hindered through implicit shifts in identification raises important challenges to notions of academic performance and intelligence. Although there is considerable debate about the nature of intelligence (Fraser, 1995; Neisser et al., 1996), strong supporters of genetic differences in IQ assume that ability is fixed and can be quantified through testing (Hermstein &. Murray, 1994). The results presented here clearly indicate that test performance is both malleable and surprisingly susceptible to implicit sociocultural pressures.”

The reference to “Hermstein and Murray, 1994” refers to the book “The Bell Curve.” I get the impression that the authors tortured the data until it confessed to a refutation of the claims in the book. That was the goal and that is why it has been cited so many times.

I’m not endorsing the book, just sayin’.

Reply to this comment
- muddles says:
  
  June 16, 2021 at 10:53 am
  
  “The results presented here clearly indicate”
  
  **clearly**! :)
  
  But amazingly after all the contrivance they still weasel-word the statement with “indicate” instead of “show”! What’s the difference between “clearly indicate” and “clearly show” or “clearly demonstrate”?
  
  Reply to this comment
  - Martha (Smith) says:
    
    June 16, 2021 at 5:25 pm
    
    Perhaps more to the point: What’s the difference between “clearly X” and “clear as mud”?
    
    Reply to this comment
Magpie says:

June 16, 2021 at 11:49 am

This basically all makes sense to me, but slightly confused about this line:

> The first thing you might notice is that a t-score of 1.86 is not usually associated with “p < .05"–in standard practice you'd need the t-score to be at least 1.96 to get that level of statistical significance–but that's really the least of our worries here.

Isn't this normal practice for one-sided hypothesis test? (And z=1.96 is simply the p < .05 bar for a two-sided hypothesis). Is it simply that setting a one-sided hypothesis to weaken the requirements for p < .05 is ill advised?

To be clear, I get that this is FAR from the point. The issue isn't the precise t-score they computed, it's the entire setup. The "hypothesis test" they examine is gibberish, for all the listed reasons (& the endless researcher degrees of freedom involved), so a result of t=2.5 would hardly be much better. (I also note the weird choice to study accuracy with a denominator of "questions attempted"–if that's actually what they did, that doesn't make much sense, imagine if you structured a typical math test that way…).

Reply to this comment
- Andrew says:
  
  June 16, 2021 at 12:11 pm
  
  Magpie:
  
  The point here is that if you’re working within the hypothesis testing framework (which I don’t like), it’s standard practice to use the 2-sided test. The article states, “The results presented here clearly indicate that test performance is both malleable and surprisingly susceptible to implicit sociocultural pressures,” which among other things implies that effects could go in any direction. As you say, this is the least of the problems, as once you talk about “malleable and surprisingly susceptible” etc., you’re way deep into piranha territory.
  
  Reply to this comment
Bob says:

June 16, 2021 at 12:09 pm

Apologies if the answer to this question seems obvious, but I am not familiar with this problem. What was wrong with psychological science in 2010-2015? How did the problem start and end?

Reply to this comment
- Andrew says:
  
  June 16, 2021 at 12:13 pm
  
  Bob:
  
  For some background, see What has happened down here is the winds have changed and Why did it take so many decades for the behavioral sciences to develop a sense of crisis around methodology and replication?.
  
  Reply to this comment
  - Stevec says:
    
    June 16, 2021 at 6:55 pm
    
    Andrew,
    
    Thanks for these older links. Very illuminating. I arrived here around 2019 or 2020 I think so I’ve missed key highlights.
    
    Reply to this comment
Abyss Lander says:

June 16, 2021 at 12:48 pm

This article is full of contradictions. What are you saying? Is the research paper’s methods flawed and that’s why it’s bad or is it bad because it’s results are not “notable” enough (whatever that means)? You then make claims that a lot of notable studies are bad because they are too ridiculous or, reasonably but in no way an aide to your argument, unscientific? There’s no actual argument against the paper aside from an appeal to statistical significance and a low population sample, both of which you excuse in face of the seemingly reasonable design of the study. Plus, excuse me if I’ve been told wrong, but I’m pretty confident stereotype threat is a real, well documented phenomenon. Of course that does not mean the study is good, but it can explain why it is able to get away with it’s supposed flaws.

Reply to this comment
- Andrew says:
  
  June 16, 2021 at 1:22 pm
  
  Abyss:
  
  To answer your questions in order:
  
  1. What I’m saying is that I agree with Shane Frederick that the paper is bad.
  
  2. Yes, I’m saying the paper’s methods are flawed. I wouldn’t say “that’s why it’s bad”; rather, I’d say that the paper has both methodological and conceptual flaws.
  
  3. No, I don’t think the paper is bad because its results are not notable. What I wrote is, “This paper is notably bad because nothing about it is notable.” What I meant was that the most notable thing about this story is how commonplace the problems were with this paper. I was writing this in a paradoxical way so I can see how this could have been confusing. When I said “nothing about it is notable,” I wasn’t talking about whether it is making notable claims; rather, I was talking about how there was nothing notable about its failures, which are shared by many papers.
  
  4. You ask for an “actual argument against the paper.” I just don’t think they demonstrated anything they claimed to demonstrate. To put it another way, whatever positive claims they are making are based on statistical significance, which in turn is based on the idea that, if there were nothing going on and the data were pure noise, that such low p-values could not be found. But that reasoning is correct. As we’ve known for many years (at least since the famous 2011 paper by Simmons, Nelson, and Simonsohn on researcher degrees of freedom and p-hacking) it’s easy to get apparently statistically significant p-values from pure noise. Thus, they’re offering no real evidence for their claims. The low sample size is not a problem on its own; it’s just a clue that results will be noisy, which allows researchers to find apparently large patterns from pure noise. Given all this, it’s not particularly relevant if the design of the study is seemingly reasonable. It can be seemingly reasonable but just too noisy for anything useful to be learned.
  
  5. Stereotype threat may well be a real, well documented phenomenon. This is a topic of debate that we don’t need to get into here. As always in this sort of study, the lack of evidence for a claim does not imply that the claim is false. If someone wants to write a review article about stereotype threat or just a personal statement of belief in the idea, that’s fine; my problem is with a claim of evidence when there is none.
  
  6. I think the main reason the article was published despite is fatal flaws is that, back in 1999, researchers were not so aware of the problems with this sort of approach to research in the face of uncertainty. As I wrote, “I’m sure these researchers were doing their best to apply the statistical tools they learned—and I can only assume that they took publication in this top journal as a signal that they were doing things right. Don’t hate the player, hate the game.” Also see the last paragraph of the above post.
  
  Reply to this comment
- A.L says:
  
  June 16, 2021 at 7:40 pm
  
  It is not well documented topic at all every preregistrated study of this effect fail and last meta-analysis (done by Witchers) said it does not exist.
  
  Reply to this comment
Andy says:

June 16, 2021 at 2:48 pm

“As we’ve known for many years (at least since the famous 2011 paper by Simmons, Nelson, and Simonsohn on researcher degrees of freedom and p-hacking) it’s easy to get apparently statistically significant p-values from pure noise. Thus, they’re offering no real evidence for their claims.”

You are implying that the authors engaged in p-hacking, but you don’t seem to have provided evidence for this, or is your position that as long as p-hacking is a *possible* explanation for the results in a paper, then we should simply presume guilt?

You did mention that there were lots of researcher degrees of freedom available, but is there any evidence that they took advantage of those degrees of freedom in order to p-hack a significant result (e.g. that they performed unreported comparisons, or failed to perform multiple comparison corrections, etc)?

Reply to this comment
- Andrew says:
  
  June 16, 2021 at 3:02 pm
  
  Andy:
  
  There’s no “guilt,” especially considering that this paper was published over a decade before Simmons et al., and the authors were using standard practices! As I wrote above, “It’s bad when this approach is used on purpose (‘p-hacking’) and it’s bad when done in good faith.” Actually I prefer the term “forking paths” to “p-hacking” for exactly the reason that you say, that p-hacking sounds intentional whereas forking paths can occur with no intent. For further discussion see my paper with Loken. As we discuss in that paper, these problems arise even if the researchers reported every comparison they did. No intent is required; all that is needed is that the authors used the standard approach to data analysis, which was to go through the data and look for interesting things. The background is that when variation is high, it will be easy to find interesting-looking results from pure noise. This doesn’t mean that the substantive claims in the paper are wrong (or that they’re right); it’s just that these sorts of p-values are easy to obtain from noise, even with researchers following standard (for 1999) practice and with no hiding of comparisons or malign intent. I think that speaking of “guilt” is really the wrong thing to do here.
  
  Reply to this comment
  - Andy says:
    
    June 16, 2021 at 5:32 pm
    
    “these problems arise even if the researchers reported every comparison they did. No intent is required; all that is needed is that the authors used the standard approach to data analysis, which was to go through the data and look for interesting things.”
    
    If they really went through the data to “look for interesting things”, then the implication is that they did additional *informal* comparisons or assessments, and that those were unreported. But again, you don’t seem to be offering evidence that they actually made this kind of forking-path error, as opposed to merely saying that it’s *possible* that they did so? You say it was standard practice, but even at the time researchers were well aware that adjusting their hypothesis based on the data could be problematic, even if it was more common than it should have been, and under-discussed, and possible to do it unintentionally, etc.
    
    “There’s no ‘guilt’… I think that speaking of ‘guilt’ is really the wrong thing to do here.”
    
    Well, you said the paper was “notably bad” and “bad science”, which suggests that the authors were guilty of committing a scientific error, though not necessarily intentionally. Whether we want to use the word “guilty” for this seems somewhat semantic, but “bad science” is a serious allegation, particularly if your only evidence for the forking-path claim is that they *might* have adjusted their hypothesis based on the data.
    
    (I’m reposting this comment, since it didn’t seem to go through last time)
    
    Reply to this comment
    - Andy says:
      
      June 16, 2021 at 7:11 pm
      
      By the way, even if you are only intending a counterfactual claim that they *would* have tested a different hypothesis if the data had been different (per your link), then this seems even less reasonable as a basis for alleging “bad science” or forking-path errors, since this is pure speculation, i.e. you really don’t what they would have done in those counterfactuals (e.g. perhaps they would have included qualifiers in that case about having shifted their initial hypothesis based on the data, etc).
      
      It’s one thing to say that these kinds of possibilities are a reason to be skeptical of non-preregistered studies (which is true), but it’s another to simply assert scientific errors as fact in cases where the errors are really just being assumed or speculated about. And my perception is that you often do this when critiquing alleged errors in studies. That said, if you have evidence that they actually did commit these kinds of forking-path errors, I’m curious to hear it.
      
      Reply to this comment
    - Andrew says:
      
      June 16, 2021 at 8:41 pm
      
      Andy:
      
      Thanks for following up. These are tricky questions that have confused generations of social scientists and statisticians (including myself), so it’s good to have the opportunity to clarify. Perhaps I should write a full post on the topic. But for now I will answer briefly:
      
      1. Regarding how the researchers did their analyses: All things are possible but I see no evidence that the researchers decided all their analyses before seeing the data, nor do I have any reason to believe that they did so, given that (a) they did different analyses for different parts of their study, (b) they found statistically significant p-values despite having very noisy data, and (c) it was standard practice (and continues to be standard practice, outside of preregistered studies) to decide on the analyses after seeing the data. If the raw data were available, it should not be too difficult to do a multiverse analysis and look at various possible results that could’ve been found with these data.
      
      But the real point here is that doing analysis after seeing the data is the default. It’s what I’ve done in almost every applied project I’ve ever worked on. It’s standard practice in just about every non-preregistered study out there. There’s nothing weird about me assuming that these researchers did what just about everybody else does!
      
      2. Actually, though, by talking about p-hacking or forking paths or whatever, we’re focusing on the wrong thing. Let’s put it this way: Suppose the researchers had done a preregistered analysis, deciding on all their coding and analyses before seeing their data. That’s fine, they could do that—it’s not really what I’d recommend, but they could do it—but, the point is, that would not make this into a good study. Had they preregistered, the most likely result is not that they’d’ve obtained these statistically significant results; rather, the most likely result is that they would not have obtained statistical significance, and they would’ve either had to report null results or dip into post-hoc analysis. Now, there’s nothing wrong with null results—good science can lead to null results—it’s just that the study wouldn’t really be advancing our understanding of psychology.
      
      To put it another way, this study was dead on arrival—or, I should say, dead before data collection. It’s just too noisy a study. The signal-to-noise ratio is too low. It would be like, ummm, here’s an analogy I’ve used before . . . suppose I decide to measure the speed of light by taking a block of wood and a match, weighing them, and then I use the match to set the wood on fire and I carefully measure all the heat released by the fire and I also very carefully collect all the ash and weigh it. I can then estimate the speed of light as c = sqrt(E/m), where E is the energy released and m is the loss of mass (the mass of the original wood and match minus the mass of the ash). In practice, this will not give me a useful estimate of c, because there’s too much noise in the measurements.
      
      What I’m saying is, the fundamental problem with this study is not the statistics, it’s the measurement, or, we could say, the measurement and the theory. So why talk about forking paths at all? I’m only talking about forking paths because the published p-values are presented as representing strong evidence. The point of forking paths is that this helps us understand how researchers can routinely obtain “statistically significant” p-values from such noisy data. In the period around 2010, Greg Francis, Uri Simonsohn, and other researchers wrote a lot about how this could happen. Here’s a particularly charming example from Nosek et al.
      
      3. Regarding the word “guilt”: I have no problem saying that the authors were doing bad science—or, to be more precise, using scientific procedures that had essentially no chance of improve our understanding—but I’d rather not call them “guilty” as if they committed a crime. Recall that honesty and transparency are not enough. And, to put it in contrapositive form, just cos someone did bad science, it doesn’t mean they were dishonest; it just means they were using methods that didn’t work. I’m not making “a serious allegation”; we’re not talking about fraud or anything; they were just unfortunately going about things wrong.
      
      You don’t have to be a bad person to do bad science, just as you don’t have to be a good person to do good science. Science is a product of individual researchers and research teams, and it’s also a product of society. It happens sometimes that a scientific subfield gets stuck in a bad place where bad science gets done. That’s too bad, and sometimes it takes awhile for the subfield to get unstuck. To say this is not to assign guilt or to make allegations, it’s just a description. See this paper with Simine Vazire for a discussion of how this happened in psychology.
      
      If we want to think in terms of meta-science, we could think of the null hypothesis significance testing paradigm as being useful in some settings and not in others. The key statistical point is that null hypothesis significance testing can work well in a high-signal, low-noise setting but not so well when signal is low and noise is high. Discussions of p-hacking and forking paths are separate from this concern. It’s important to understand p-hacking and forking paths, because these help us understand how researchers can consistently produce those magic “p less than 0.05” results, but then we need to step back and think more carefully about what is being studied and how it’s being measured.
      
      I didn’t say all this in the above post because I’ve said it in bits and pieces in other places over the years (for example, this paper from 2015). Not every blog post is self contained. So thanks again for giving me the opportunity to elaborate.
      
      Reply to this comment
      - Adede says:
        
        June 16, 2021 at 9:03 pm
        
        Your block of wood analogy would not just be noisy but biased, because mass would be lost due to smoke and CO2 leaving the burning block.
        
        Reply to this comment
        
        Andrew says:
        
        June 16, 2021 at 9:06 pm
        
        Adede:
        
        Yes, you’d have to capture all the smoke and CO2 too, and also measure the oxygen at the start. But, yeah, bias is part of noise. The noise in social science studies includes bias.
        
        Reply to this comment
      - Andy says:
        
        June 17, 2021 at 2:02 am
        
        Thanks for the detailed reply. Here are a couple of followup comments/questions:
        
        “doing analysis after seeing the data is the default. It’s what I’ve done in almost every applied project I’ve ever worked on. It’s standard practice in just about every non-preregistered study out there. There’s nothing weird about me assuming that these researchers did what just about everybody else does!”
        
        You suggested above that they engaged in (possibly unintentional) p-hacking or forking-paths, so presumably you were saying they did more than just look at the data before analysis. In particular, it’s one thing to do the analysis after seeing the data, it’s another to meaningfully adjust your analysis approach based on the data and not make any mention of this in the study (particularly if the some potential approaches were passed over due to appearing less promising from the data). Are you really suggesting that it’s standard practice to inflate statistical significance with these practices, and not make any acknowledgement of it in the study (and that you do this in your own studies)?
        
        “To put it another way, this study was dead on arrival—or, I should say, dead before data collection. It’s just too noisy a study. The signal-to-noise ratio is too low… What I’m saying is, the fundamental problem with this study is not the statistics, it’s the measurement… I’m only talking about forking paths because the published p-values are presented as representing strong evidence.”
        
        Presumably if the effect size had been large enough they could have convincingly detected a stat sig effect even with this “noisy” small sample-size study, and it’s difficult to know what the effect-size is going to be before conducting a novel experiment, so I’m confused how this is a basis for calling it a “bad” study? On other hand, if they engaged in p-hacking or forking-paths then I can understand how that would be a concern, but it sounds like you’re saying these latter issues are not really your main objection to the study.
        
        “I’m not making “a serious allegation”; we’re not talking about fraud or anything; they were just unfortunately going about things wrong.”
        
        If a prominent researcher publicly accuses you of “bad science” and says your paper is “so bad”, then I’m surprised you don’t see that that’s a serious allegation, from the standpoint of the author’s reputation and status as a competent researcher. (even if it’s a bit mitigating that you’re saying these kinds of errors were more common back then).
        
        @jim: I noticed your reply just as I was about to post this comment. As for your selection-bias concerns, I’m not sure how big of an issue it was in this case, but the study would still be valuable even if it turned out that stereotype threat only applied to a particular subset of asian women (even if non-representative, they were still randomized into the experimental conditions). Also, I’m actually sympathetic to your prior that the effect is somewhat implausible, but that’s not really a basis for claiming the study is “bad science”.
        
        Reply to this comment
        
        Andrew says:
        
        June 17, 2021 at 7:07 am
        
        Andy:
        
        1. When you write, “it’s one thing to do the analysis after seeing the data, it’s another to meaningfully adjust your analysis approach based on the data and not make any mention of this in the study . . .” Your mistake is to think there is a pre-existing “analysis approach” that would be adjusted. It’s more that there’s a general hypothesis that is not precisely tied to any specific hypothesis test or set of hypothesis tests. Recall that they also compare significance to non-significance, which is itself a statistical error.
        
        When you write, “Are you really suggesting that it’s standard practice to inflate statistical significance with these practices, and not make any acknowledgement of it in the study (and that you do this in your own studies)?”, again, you’re making the mistake of thinking there’s a pre-existing or Platonic “statistical significance” that can be inflated. A more accurate description is that there is a pile of data which the authors see, and then they do some analyses. Some of these analyses can be anticipated ahead of time; others can’t; but unless we are told otherwise I strongly doubt there’s any pre-analysis plan. To put it another way, if they had had a specific pre-analysis plan all decided before seeing the data, I think they would’ve acknowledged this in the study.
        
        In my own studies, yes, I gather data and then do analyses, which include some analyses I’d planned to do ahead of time and some new analyses, but there’s no sharp dividing line, and even the analyses that I’d anticipated doing are not precisely specified beforehand. There are lots of data coding decisions too. Again, I refer you to our multiverse study which is of somebody else’s paper but which lays out just some of the many many researcher degrees of freedom that come up with real data. I’m not inflating statistical significance with these practices, because statistical significance is not the product of my analyses; I’m not doing hypothesis tests.
        
        Yes, had the effect size been large enough the study could’ve been reasonable. But there’s no way the effect size could realistically have been this large. This is also a point discussed by Greg Francis, Uri Simonsohn, and others, that in the presence of truly large effect sizes we’d expect to see lots of tiny p-values such as 0.0001 etc, with occasional large p-values—not the range of p-values between 0.005 and 0.05 that typically show up in social science studies. This also comes up when people attempt to replicate such studies. Again, yes, given that these studies are hopelessly noisy, it would be better to see some preregistered replications to reveal the problems (I again point you to the Nosek et al. paper on 50 shades of gray), but the real problem here is a feedback loop of poorly-thought-through hypotheses, noisy designs for data collection, and data coding and analysis procedures that allow statistical significance from any data. That third step of the loop just provides encouragement for more poorly-though-through hypotheses, etc. There’s a reason there was a replication crisis in social psychology, and there’s a reason that the Simmons et al. paper and related work by others around that time were so important.
        
        2. You write, “If a prominent researcher publicly accuses you of “bad science” and says your paper is ‘so bad,’ then I’m surprised you don’t see that that’s a serious allegation, from the standpoint of the author’s reputation and status as a competent researcher.” Sure, I’ve published some mistakes. I’ve published 4 papers that were wrong enough that I published correction notices! One of these was a false theorem, another was a data coding problem that invalidated the empirical claims in the paper. The other two were smaller errors that had more minor effects on the claims in the papers. I wouldn’t say this makes me a competent researcher, but it does mean I made some mistakes that got published. It is what it is. The authors of the paper we’re discussing here had the bad luck to have been working in a fatally flawed research paradigm. It happens. To say this is not a condemnation of them. It’s like, ummm, I dunno, suppose there’s some subfield of cancer research studying a particular mechanism of cancer, and this subfield involves hundreds or thousands of researchers working for 20 years on this mechanism, and then it turns out that the whole thing was a dead end, that this particular phenomenon does not cause or influence cancer. For any given researcher I’d have to say, yeah, it’s too bad, but none of their ideas panned out, that they were working a nonexistent seam and there was actually no gold at all to be found on that particular mountain. But . . . that’s how science goes sometimes. In the grand scheme of things, they were doing their best. I’ve had some ideas that led nowhere. If this sort of dead end happens to represent a large chunk of someone’s career, them’s the breaks. I don’t condemn people for having this sort of bad luck. Not everyone can be at the right place at the right time when it comes to scientific progress. One of the reasons I’ve written so much on these topics is to help future researchers not get stuck in this way! And one reason I write about these older articles is to provide some historical perspective, as well as some encouragement that things are getting better. I do think it’s less likely nowadays that a top journal would publish a paper like this. Top journals make other errors (to see some examples, search this blog for “regression discontinuity”), but I do think that by recognizing these errors and looking at them carefully, we can help researchers do better and stop them from wasting so much of their time and effort.
        
        Reply to this comment
    - jim says:
      
      June 17, 2021 at 12:51 am
      
      Andy:
      
      I was going to respond to this earlier but got called away. Andrew pointed out many things I was going to say that he would have said :). In particular:
      
      “So why talk about forking paths at all? I’m only talking about forking paths because the published p-values are presented as representing strong evidence. “
      
      But Andrew is speaking in generalities and I’d like to highlight some of the specifics. First, we have a study that’s testing these hypothesis:
      
      1) when undergraduate Asian American females are reminded of the fact that they are female (and thus by implied stereotype bad at math) by some priming text, do they perform worse on math tests?
      
      2) when undergraduate Asian American females are reminded of the fact that they are Asian (and thus by implied stereotype good at math) by some priming text, do they perform better on math tests?
      
      There’s no indication of who these women are other that undergraduate Asian American women, nor where they come from, their academic background or anything else. Presumably, they’re somewhere in the upper two thirds of “ability” among undergraduate Asian American females (because they’re in college), but that’s just a guess. They could be all engineering students or all Asian American Studies majors. So there’s already huge potential for selection bias.
      
      That potential is amplified dramatically when we see how many of these undergraduate Asian American females participated in the various questions on the first “experiment”:
      
      “In the female-identity-salient condition, participants, (n = 14)”
      “In the Asian-identity-salient condition, participants (n = 16)”
      “In the control condition, participants (n = 16)”
      
      There were 16 or less undergraduate Asian American females in each prong of the experiment.
      
      They were tested by 12 problems / 20 min from the Canadian math competition for high school students. We have no information about the problems, or the relative ability of the women tested, even though they were asked to voluntarily provide their SAT math scores.
      
      Remember that we’re not trying to find out if undergraduate Asian American females perform better or worse on math tests. We’re trying to find out if they perform worse on math tests **because they were recently reminded of the fact that they are women**, and this reminding is presumed to automatically trigger some psychological response regarding stereotypes about women and math. And we’re trying to find out if they perform better on math tests **because they were recently reminded of the fact that they are Asian**, and this reminding is presumed to automatically trigger some psychological response regarding stereotypes about Asians and math.
      
      I personally believe this is a patently ridiculous idea. However, if the experimenters had a carefully selected thousands of study participants to reflect a cross-section of Asian women in American society; if they had tested them all independently first with a significant math test, then later retested them with priming, using a math test of the same difficulty, I might be interested in the results.
      
      As it stands, however, the results hardly matter. Such a complicated psychological phenomenon as response to stereotypes tested by an insignificant number of people by a basically unknown math test with no effort to select a representative cross-section of some particular group?
      
      I’m not that worried about judging the individuals. But would, for example, I stake millions of dollars on some policy that depended on this study being right? Noooo.
      
      Reply to this comment
      - Brent Hutto says:
        
        June 17, 2021 at 7:48 am
        
        jim,
        
        In other words, it’s a research hypothesis not worth the effort of exploring on anything other than a no-cost convenience sample orders of magnitude too small to supply any meaningful evidence.
        
        I was vaguely aware in my undergraduate days that Psych classes routinely had to participate in little toy “studies” like this, presumably to illustrate the basics of performing experiments on human subjects. Until I started reading this blog I honestly had no idea such trivialities were actually published as though they were legitimate research.
        
        Reply to this comment
        
        jim says:
        
        June 17, 2021 at 9:28 am
        
        ‘Psych classes routinely had to participate in little toy “studies” like this’
        
        Ha, funny you say that, I was going to suggest this “research” was probably done in a single 50 minute lecture session! :)
        
        Reply to this comment
Anonymous says:

June 16, 2021 at 2:53 pm

Andrew,

Since this is true

> it’s easy to get apparently statistically significant p-values from pure noise

how do I know if my study is actually telling me something real? If I try to regularize using a prior based on the literature, that won’t help much since the literature seems to think these effects are common and large. So how do I know when to believe my own study’s results?

Reply to this comment
- Andrew says:
  
  June 16, 2021 at 3:19 pm
  
  Anon:
  
  High-quality data is a start. It’s tough to learn from noisy measurements. If there’s a lot of variability between people, you’ll want between-person comparisons or else you’ll need to gather lots of data. Beyond that, you can replicate your study under different conditions. “When to believe” is a continuum; the more you learn, the more you can modify your understanding.
  
  Reply to this comment
- AllanC says:
  
  June 16, 2021 at 5:08 pm
  
  There is a confluence of factors that effect whether or not one admits new facts / beliefs / theories into their corpus of belief. Many of which are domain and situation specific so it would be impossible to sum it all up in these short comments! That said, a very common strategy employed in good science is to make “risky predictions”* and to concoct experiments to find out whether those predictions pan out. As more and more “risky” predictions result in success one naturally (and rationally) starts to believe that they really have something going with the theory or whatever else they are testing.
  
  An alternative preferred by many (most?) researchers in some fields: gathering some data for some vague reason (that has multiple paths to map to the data) and check for a direction of effect utilizing significance. Now, which one seems more persuasive to you: A) when you make risky predictions and confirm that they pan out in the lab or B) by a vague collection regime combined with a flimsy significance test for direction?
  
  The risky part being unlikely sans the theory or whatever you are testing. Note: this is covered in great detail by the writings of Paul E. Meehl. One of his best is here and I recommend it to you: https://meehl.umn.edu/sites/meehl.umn.edu/files/files/113theoreticalrisks.pdf
  
  Reply to this comment
Nigel says:

June 16, 2021 at 4:30 pm

Andrew:

I’ve learned a lot from your posts on psychology’s reproducibility crisis and the shoddy methods that permeate so much of the psych literature. What would you say to students who hope to become productive psych researchers without making the same mistakes as people like Cuddy, Kanazawa, and the many researchers who routinely produce papers like this one? Is a thorough schooling in stats the best preparation for hopeful future scientists? The message I’ve taken away from what I’ve learned about bad science can be summed up as “Don’t do what those guys did.” What would you add to that?

Reply to this comment

Statistical Modeling, Causal Inference, and Social Science

Wow, just wow. If you think Psychological Science was bad in the 2010-2015 era, you can’t imagine how bad it was back in 1999

38 Comments

Leave a Reply

Recent Comments

Categories