RDP 2021-02: Star Wars at Central Banks 4. Our Summary Statistics Show Important Institutional Differences

The new central bank dataset contains around 15,000 hypothesis tests from 190 discussion papers that were released from 2000 to 2019 (Table 1). Around 13,000 hypothesis tests are of primary interest, meaning they represent a main result of a paper. The rest pertain to control variables. The dataset also contains other information about the papers, such as authorship, data and code availability, and journal publication.

Table 1: Summary Statistics
  RBA RBNZ Minneapolis Fed Central banks combined Top journals
Results by paper
Total number 75 59 56 190 641
Share of which:
Make data and code available 15 0 14 10 46
Had authors other than central bankers 11 32 68 34 na
Were also published in peer-reviewed journal 28 31 64 39 100
Average number of authors per paper 2.1 1.8 2.2 2.0 2.2
Results by test statistic: main
Total number 4,901 5,589 2,569 13,059 50,078
Share of which:
Use ‘eye catchers’ for statistical significance 77 89 57 78 64
Portrayed as ‘forward causal’ research 67 68 93 73 na
Disclose using data-driven model selection 20 7 2 11 na
Results by test statistic: control
Total number 957 185 607 1,749 0

Notes: Papers that make data and code available are only those for which the central bank website contains the data and code or a link to them; we do not count cases in which code is accessible by other means. By ‘forward causal’ we mean research that studies the effects of a pre-specified cause, as per Gelman and Imbens (2013). Excluded from this category are ‘reverse causal’ research questions, which search for possible causes of an observed outcome. We also excluded a few papers that are straight forecasting work and a few instances of general equilibrium macroeconometric modelling. Data-driven model selection includes, for example, a general-to-specific variable selection strategy. ‘Eye catchers’ include stars, bold face, or statistical significance comments in the text. We use ‘Top journals’ as loose shorthand for The American Economic Review, the Journal of Political Economy, and The Quarterly Journal of Economics.

Sources: Authors' calculations; Brodeur et al (2016); Federal Reserve Bank of Minneapolis; Reserve Bank of Australia; Reserve Bank of New Zealand

The central banks released a further 540 discussion papers in this time period, but we excluded them because they did not have the right types of hypothesis tests. We required that hypothesis tests apply to single model coefficients, use t-statistics, be two-sided, and have zero nulls. We excluded 83 per cent of papers from the Minneapolis Fed, 66 per cent of papers from the RBA and 68 per cent of papers from the RBNZ. The Minneapolis Fed has the highest exclusion rate because it produced more theoretical research.

Five aspects of the summary statistics might prompt productive institutional introspection:

  1. Compared with articles in top journals, fewer central bank discussion papers were released with access to data and code. We collected this information because, when contracts allow it, providing data and code sufficient to replicate a paper's results is best practice (Christensen and Miguel 2018). The top journals all have policies requiring this form of transparency; The American Economic Review introduced its policy in 2005, the Journal of Political Economy in 2006, and The Quarterly Journal of Economics in 2016. Of the central banks, only the RBA has a similar policy for discussion papers, which it introduced in 2018.

    A benign explanation for central banks' lower research transparency levels might be the preliminary status of discussion papers. Central bank researchers might also rely more on confidential datasets.[6]

  2. The central banks show large differences in their levels of collaboration with external researchers. External collaboration was most prevalent at the Minneapolis Fed, where many of the collaborators were academics.
  3. A far larger share of Minneapolis Fed papers were published in peer-reviewed journals, consistent with its higher collaboration levels and our subjective view of differences in (informal) publication incentives.[7]
  4. The RBA and RBNZ papers emphasised statistical significance more often than the Minneapolis Fed papers or articles in top journals. The typical emphasis used stars or bold face numbers in results tables or mentions in body text. Current American Economic Association guidelines ask authors not to use stars, and elsewhere there are calls to retire statistical significance thresholds altogether, partly to combat researcher bias (e.g. Amrhein, Greenland and McShane 2019).
  5. Staff at the RBA and RBNZ focused more on research that was reverse causal, and less on research that was forward causal, than did staff at the Minneapolis Fed. As explained by Gelman and Imbens (2013), both types of questions have important roles in the scientific process, but the distinction is important. The reason is that, while reverse causal questions are a natural way for policymakers to think and important for hypothesis generation, they are also inherently ambiguous; finding evidence favouring one set of causes can never rule out others, so complete answers are impossible. Another challenge is that reverse causal research often generates its hypotheses using the same datasets they are tested on. This feature makes the research unusually prone to returning false positives.

The top journals dataset contains about 50,000 test statistics from 640 papers that were published between 2005 and 2011 in The American Economic Review, Journal of Political Economy and The Quarterly Journal of Economics. It includes many of the same variables as the central bank dataset. It also excludes many papers for the same reasons the central bank dataset does. A potentially important difference is that the central bank dataset covers a longer time window to ensure an informative sample size.

Simonsohn et al (2014) advise that, to implement the hypothesis test from the p-curve method properly, the sample should contain statistically independent observations. So, for the p-curve, we use one randomly chosen result from each paper; doing so leaves us with 185 central bank observations and 623 top journal observations.[8] Simonsohn et al use far smaller sample sizes. For the z-curve, we use complete samples.

Footnotes

Even in this case, researchers can attain high levels of transparency by releasing synthetic data. One such example is Obermeyer et al (2019). [6]

Our measures understate true publication rates because the papers released in the back end of our window might publish after we collected data (January 2020). The problem is unlikely to drive the relative standings though. [7]

Our plan specifies a random seed for our pseudo-random number generator. A reader has since suggested that we test over several different random samples. We do not present the results of that exercise, because the test statistics (we used p-values) in every sample were indistinguishable from one another at two decimal places. [8]