Citations

Every empirical claim on this site links here. Auto-generated from the paper bibliographies. Download BibTeX

  1. Ambroise, McLachlan (2002). Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data. Proceedings of the National Academy of Sciences, 99(10), 6562--6566. doi:10.1073/pnas.102102699
  2. Apicella, Isgrò (2025). Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning. Artificial Intelligence Review. doi:10.1007/s10462-025-11326-3
  3. Arlot, Celisse (2010). A Survey of Cross-Validation Procedures for Model Selection. Statistics Surveys, 4, 40--79. doi:10.1214/09-SS054
  4. Austin, Odena, Nye, Bosma, Michalewski, Dohan, Jiang, Cai, Terry, Le, Sutton (2021). Program Synthesis with Large Language Models. In arXiv preprint arXiv:2108.07732.
  5. Ballarin, Dellaportas, Grigoryeva, Hirt, van Huellen (2024). Reservoir Computing for Macroeconomic Forecasting with Mixed-Frequency Data. International Journal of Forecasting, 40(3), 1206--1237. doi:10.1016/j.ijforecast.2023.10.009
  6. Bates, Hastie, Tibshirani (2024). Cross-Validation: What Does It Estimate and How Well Does It Do It?. Journal of the American Statistical Association, 119(546), 1434--1445. doi:10.1080/01621459.2023.2197686
  7. Baylor, Breck, Cheng, Fiedel, Foo, Haque, Haykal, Ispir, Jain, Koc, others (2017). TFX. In KDD.
  8. Becker, Recamonde-Mendoza (2025). Mind the Gap: Investigating the Impact of Data Leakage on Machine Learning Predictive Models. In Brazilian Conference on Intelligent Systems (BRACIS).
  9. Bengio, Grandvalet (2004). No Unbiased Estimator of the Variance of K. Journal of Machine Learning Research, 5, 1089--1105. https://jmlr.org/papers/v5/grandvalet04a.html
  10. Benjamini, Hochberg (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289--300. doi:10.1111/j.2517-6161.1995.tb02031.x
  11. Bergstra, Bengio (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281--305. https://jmlr.org/papers/v13/bergstra12a.html
  12. Bertin (1967). Sémiologie Graphique. Mouton/Gauthier-Villars.
  13. Binder, Pfisterer, Lang, Schneider, Kotthoff, Bischl (2021). mlr3pipelines: Flexible Machine Learning Pipelines in R. Journal of Machine Learning Research, 22(184), 1--7. https://jmlr.org/papers/v22/21-0206.html
  14. Bischl, Binder, Lang, Pielok, Richter, Coors, Thomas, Ullmann, Becker, Boulesteix, Deng, Lindauer (2023). Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges. WIREs Data Mining and Knowledge Discovery, 13(2), e1484. doi:10.1002/widm.1484
  15. Blum, Hardt (2015). The Ladder: A Reliable Leaderboard for Machine Learning Competitions. In Proceedings of the 32nd International Conference on Machine Learning (ICML). https://arxiv.org/abs/1502.04585
  16. Bousquet, Elisseeff (2002). Stability and Generalization. Journal of Machine Learning Research, 2, 499--526. https://jmlr.org/papers/v2/bousquet02a.html
  17. Buitinck, Louppe, Blondel, Pedregosa, Mueller, Grisel, Niculae, Prettenhofer, Gramfort, Grobler, Layton, VanderPlas, Joly, Holt, Varoquaux (2013). API. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. https://arxiv.org/abs/1309.0238
  18. Cawley, Talbot (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research, 11, 2079--2107. https://jmlr.org/papers/v11/cawley10a.html
  19. Chawla, Bowyer, Hall, Kegelmeyer (2002). SMOTE. Journal of Artificial Intelligence Research, 16, 321--357. doi:10.1613/jair.953
  20. Chen, Tworek, Jun, Yuan, Pinto, Kaplan, Edwards, Burda, Joseph, Brockman, others (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.
  21. Chomsky (1957). Syntactic Structures. Mouton.
  22. Codd (1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13(6), 377--387. doi:10.1145/362384.362685
  23. Dietterich (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), 1895--1923. doi:10.1162/089976698300017197
  24. Drobnjaković (2024). Abstract Interpretation for Data Leakage Detection in Machine Learning Pipelines. In Theoretical Aspects of Software Engineering (TASE 2024). doi:10.1007/978-3-031-64626-3_7
  25. Drobnjaković (2025). Static Analysis by Abstract Interpretation Against Data Leakage in Machine Learning. Science of Computer Programming. doi:10.1016/j.scico.2025.103338
  26. Drori, Krishnamurthy, Rampin, de Paula Lourenco (2021). AlphaD3M.
  27. Du, Liu, Wang, Wang, Liu, Chen, Feng, Sha, Peng, Lou (2023). ClassEval. arXiv preprint arXiv:2308.01861.
  28. Dwork, Feldman, Hardt, Pitassi, Reingold, Roth (2015). The Reusable Holdout: Preserving Validity in Adaptive Data Analysis. Science, 349(6248), 636--638. https://www.science.org/doi/10.1126/science.aaa9375
  29. Gumbel (1958). Statistics of Extremes. Columbia University Press.
  30. Guyon, Elisseeff (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157--1182. https://jmlr.org/papers/v3/guyon03a/guyon03a.pdf
  31. Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. https://hastie.su.domains/ElemStatLearn/
  32. Jesse, Ahmed, Devanbu, Morgan (2023). Large Language Models and Simple, Stupid Bugs. arXiv preprint arXiv:2303.11455.
  33. Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan (2023). SWE-bench. arXiv preprint arXiv:2310.06770.
  34. Kapoor, Narayanan (2022). Leakage and the Reproducibility Crisis in ML. arXiv:2207.07048
  35. Kapoor, Narayanan (2023). Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns, 4(9), 100804. doi:10.1016/j.patter.2023.100804
  36. Kapoor, Cantrell, Peng, Pham, Bail, Gundersen, Hofman, Hullman, Lones, Malik, Nanayakkara, Poldrack, Raji, Roberts, Salganik, Serra-Garcia, Stewart, Vandewiele, Narayanan (2024). REFORMS. Science Advances, 10(18), eadk3452. doi:10.1126/sciadv.adk3452
  37. Kapoor, Narayanan (2025). Leakage and the Reproducibility Crisis in ML. https://reproducible.cs.princeton.edu
  38. Kaufman, Rosset, Perlich, Stitelman (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Transactions on Knowledge Discovery from Data, 6(4), 1--21. doi:10.1145/2382577.2382579
  39. Kruschke (2018). Rejecting or Accepting Parameter Values in Bayesian. Advances in Methods and Practices in Psychological Science, 1(2), 270--280. doi:10.1177/2515245918771304
  40. Kuhn, Silge (2022). Tidy Modeling with R. O'Reilly Media. https://www.tmwr.org
  41. Lakens (2013). Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-Tests and ANOVAs. Frontiers in Psychology, 4, 863. doi:10.3389/fpsyg.2013.00863
  42. Li, Choi, Chung, Kushman, Schrittwieser, Leblond, Eccles, Keeling, Gimeno, Dal Lago, others (2022). Competition-Level Code Generation with AlphaCode. Science, 378(6624).
  43. Lones (2024). Avoiding Common Machine Learning Pitfalls. Patterns, 5(10), 101046. doi:10.1016/j.patter.2024.101046
  44. Myers (1999). JFlow. In Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). doi:10.1145/292540.292561
  45. Nadeau, Bengio (2003). Inference for the Generalization Error. Machine Learning, 52(3), 239--281. doi:10.1023/A:1024068626366
  46. Olsson, Elhage, Nanda, Joseph, DasSarma, Henighan, Mann, Askell, Bai, Chen, others (2022). In-Context Learning and Induction Heads. Transformer Circuits Thread.
  47. Pearce, Ahmad, Tan, Dolan-Gavitt, Karri (2022). Asleep at the Keyboard? Assessing the Security of GitHub Copilot. In IEEE Symposium on Security and Privacy.
  48. Pedregosa, Varoquaux (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825--2830. https://jmlr.org/papers/v12/pedregosa11a.html
  49. Pedregosa, Varoquaux (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825--2830. https://jmlr.org/papers/v12/pedregosa11a.html
  50. Pierce (2002). Types and Programming Languages. MIT Press.
  51. Raschka (2020). Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv preprint arXiv:1811.12808. doi:10.48550/arXiv.1811.12808
  52. Riley, Snell, Ensor, Burke, Harrell, Moons, Collins (2019). Minimum Sample Size for Developing a Multivariable Prediction Model: Part I. Statistics in Medicine, 38(7), 1262--1275. doi:10.1002/sim.7993
  53. Roberts, Bahn, Ciuti, Boyce, Elith, Guillera-Arroita, Hauenstein, Lahoz-Monfort (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40, 913--929. https://nsojournals.onlinelibrary.wiley.com/doi/10.1111/ecog.02881
  54. Romano, Le, La Cava, Gregg, Goldberg, Chakraborty, Ray, Himmelstein, Fu, Moore (2021). PMLB. Bioinformatics, 38(3), 878--880. doi:10.1093/bioinformatics/btab347
  55. Rosenblatt, Tejavibulya, Jiang, Noble, Scheinost (2024). Data Leakage Inflates Prediction Performance in Connectome-Based Machine Learning Models. Nature Communications, 15, 1829. doi:10.1038/s41467-024-46150-w
  56. Roth (2022). Biased Machines in the Realm of Politics. Universit\. https://kops.uni-konstanz.de/handle/123456789/59732
  57. Roth (2026). A Grammar of Machine Learning Workflows: Rejecting Data Leakage at Call Time. doi:10.5281/zenodo.19406355
  58. Roth (2026). Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets. doi:10.5281/zenodo.19406148
  59. Roth (2026). The Shortest Path Leaks: How LLM.
  60. Sasse, Nicolaisen-Sobesky, Dukart, Eickhoff, Gotz, Hamdan, Komeyer, Kulkarni, Lahnakoski, Love, Raimondo, Patil (2025). Overview of Leakage Scenarios in Supervised Machine Learning. Journal of Big Data, 12, 41. doi:10.1186/s40537-025-01193-8
  61. Scriven (1967). The Methodology of Evaluation. In Perspectives of Curriculum Evaluation.
  62. Simpson (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society: Series B (Methodological), 13(2), 238--241. doi:10.1111/j.2517-6161.1951.tb00088.x
  63. Smith, Sala, Kanter, Veeramachaneni (2020). An Interactive Pipeline for Cross-Domain AutoML. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. doi:10.1145/3318464.3384693
  64. Stone (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2), 111--133. doi:10.1111/j.2517-6161.1974.tb00994.x
  65. Strom, Yemini (1986). Typestate: A Programming Language Concept for Enhancing Software Reliability. IEEE Transactions on Software Engineering, SE-12(1), 157--171. doi:10.1109/TSE.1986.6312929
  66. Tampu, Eklund, Haj-Hosseini (2022). Inflation of Test Accuracy Due to Data Leakage in Deep Learning-Based Classification of OCT. Scientific Data, 9, 580. doi:10.1038/s41597-022-01618-6
  67. Truong, Zhang, Marchareddy, Lee, Busold, Socas, AlOmar (2025). LeakageDetector.
  68. Tsamardinos, Greasidou, Borboudakis (2018). Bootstrapping the Out-of-Sample Predictions for Efficient and Accurate Cross-Validation. Machine Learning, 107(12), 1895--1922. doi:10.1007/s10994-018-5714-4
  69. Valavi, Elith, Lahoz-Monfort (2019). blockCV. Methods in Ecology and Evolution, 10, 225--232. https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13107
  70. van de Mortel (2025). Data Leakage in Machine Learning Studies Creep into Meta-Analytic Estimates. Molecular Psychiatry. doi:10.1038/s41380-025-03336-y
  71. van der Ploeg (2014). Modern Modelling Techniques Are Data Hungry: A Simulation Study for Predicting Dichotomous Endpoints. BMC Medical Research Methodology, 14, 137. doi:10.1186/1471-2288-14-137
  72. Vandewiele, Dehaene, Ková (2021). Overly Optimistic Prediction Results on Imbalanced Data: A Case Study of Flaws and Benefits when Applying Over-Sampling. Artificial Intelligence in Medicine, 111, 101987. doi:10.1016/j.artmed.2020.101987
  73. Vanschoren, van Rijn, Bischl, Torgo (2013). OpenML. ACM SIGKDD Explorations Newsletter, 15(2), 49--60. doi:10.1145/2641190.2641198
  74. Varma, Simon (2006). Bias in Error Estimation when Using Cross-Validation for Model Selection. BMC Bioinformatics, 7, 91. doi:10.1186/1471-2105-7-91
  75. Varoquaux (2018). Cross-validation Failure: Small Sample Sizes Lead to Large Error Bars. NeuroImage, 180, 68--77. doi:10.1016/j.neuroimage.2017.06.061
  76. Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS.
  77. Wickham (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3--28. doi:10.1198/jcgs.2009.07098
  78. Wilkinson (1999). The Grammar of Graphics. Springer.
  79. Yang, Brower-Sinning, Lewis, Kaestner (2022). Data Leakage in Notebooks: Static Detection and Better Processes. In Proceedings of the 37th IEEE/ACM. doi:10.1145/3551349.3556918
  80. Yao, Yu, Zhao, Shafran, Griffiths, Cao, Narasimhan (2024). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS.