Citations
Every empirical claim on this site links here. Auto-generated from the paper bibliographies. Download BibTeX
- Ambroise, McLachlan (2002). Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data. Proceedings of the National Academy of Sciences, 99(10), 6562--6566. doi:10.1073/pnas.102102699
- Arlot, Celisse (2010). A Survey of Cross-Validation Procedures for Model Selection. Statistics Surveys, 4, 40--79. doi:10.1214/09-SS054
- Austin, Odena, Nye, Bosma, Michalewski, Dohan, Jiang, Cai, Terry, Le, Sutton (2021). Program Synthesis with Large Language Models. In arXiv preprint arXiv:2108.07732.
- Ballarin, Dellaportas, Grigoryeva, Hirt, van Huellen (2024). Reservoir Computing for Macroeconomic Forecasting with Mixed-Frequency Data. International Journal of Forecasting, 40(3), 1206--1237. doi:10.1016/j.ijforecast.2023.10.009
- Bates, Hastie, Tibshirani (2024). Cross-Validation: What Does It Estimate and How Well Does It Do It?. Journal of the American Statistical Association, 119(546), 1434--1445. doi:10.1080/01621459.2023.2197686
- Baylor, Breck, Cheng, Fiedel, Foo, Haque, Haykal, Ispir, Jain, Koc, others (2017). TFX. In KDD.
- Becker, Recamonde-Mendoza (2025). Mind the Gap: Investigating the Impact of Data Leakage on Machine Learning Predictive Models. In Brazilian Conference on Intelligent Systems (BRACIS).
- Bengio, Grandvalet (2004). No Unbiased Estimator of the Variance of K. Journal of Machine Learning Research, 5, 1089--1105. https://jmlr.org/papers/v5/grandvalet04a.html
- Benjamini, Hochberg (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289--300. doi:10.1111/j.2517-6161.1995.tb02031.x
- Bergstra, Bengio (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281--305. https://jmlr.org/papers/v13/bergstra12a.html
- Bertin (1967). Sémiologie Graphique. Mouton/Gauthier-Villars.
- Binder, Pfisterer, Lang, Schneider, Kotthoff, Bischl (2021). mlr3pipelines: Flexible Machine Learning Pipelines in R. Journal of Machine Learning Research, 22(184), 1--7. https://jmlr.org/papers/v22/21-0206.html
- Bischl, Binder, Lang, Pielok, Richter, Coors, Thomas, Ullmann, Becker, Boulesteix, Deng, Lindauer (2023). Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges. WIREs Data Mining and Knowledge Discovery, 13(2), e1484. doi:10.1002/widm.1484
- Blum, Hardt (2015). The Ladder: A Reliable Leaderboard for Machine Learning Competitions. In Proceedings of the 32nd International Conference on Machine Learning (ICML). https://arxiv.org/abs/1502.04585
- Bousquet, Elisseeff (2002). Stability and Generalization. Journal of Machine Learning Research, 2, 499--526. https://jmlr.org/papers/v2/bousquet02a.html
- Buitinck, Louppe, Blondel, Pedregosa, Mueller, Grisel, Niculae, Prettenhofer, Gramfort, Grobler, Layton, VanderPlas, Joly, Holt, Varoquaux (2013). API. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. https://arxiv.org/abs/1309.0238
- Cawley, Talbot (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research, 11, 2079--2107. https://jmlr.org/papers/v11/cawley10a.html
- Chawla, Bowyer, Hall, Kegelmeyer (2002). SMOTE. Journal of Artificial Intelligence Research, 16, 321--357. doi:10.1613/jair.953
- Chen, Tworek, Jun, Yuan, Pinto, Kaplan, Edwards, Burda, Joseph, Brockman, others (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.
- Chomsky (1957). Syntactic Structures. Mouton.
- Codd (1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13(6), 377--387. doi:10.1145/362384.362685
- Dietterich (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), 1895--1923. doi:10.1162/089976698300017197
- Drobnjaković (2024). Abstract Interpretation for Data Leakage Detection in Machine Learning Pipelines. In Theoretical Aspects of Software Engineering (TASE 2024). doi:10.1007/978-3-031-64626-3_7
- Drobnjaković (2025). Static Analysis by Abstract Interpretation Against Data Leakage in Machine Learning. Science of Computer Programming. doi:10.1016/j.scico.2025.103338
- Drori, Krishnamurthy, Rampin, de Paula Lourenco (2021). AlphaD3M.
- Du, Liu, Wang, Wang, Liu, Chen, Feng, Sha, Peng, Lou (2023). ClassEval. arXiv preprint arXiv:2308.01861.
- Dwork, Feldman, Hardt, Pitassi, Reingold, Roth (2015). The Reusable Holdout: Preserving Validity in Adaptive Data Analysis. Science, 349(6248), 636--638. https://www.science.org/doi/10.1126/science.aaa9375
- Gumbel (1958). Statistics of Extremes. Columbia University Press.
- Guyon, Elisseeff (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157--1182. https://jmlr.org/papers/v3/guyon03a/guyon03a.pdf
- Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. https://hastie.su.domains/ElemStatLearn/
- Jesse, Ahmed, Devanbu, Morgan (2023). Large Language Models and Simple, Stupid Bugs. arXiv preprint arXiv:2303.11455.
- Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan (2023). SWE-bench. arXiv preprint arXiv:2310.06770.
- Kapoor, Narayanan (2022). Leakage and the Reproducibility Crisis in ML. arXiv:2207.07048
- Kapoor, Narayanan (2023). Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns, 4(9), 100804. doi:10.1016/j.patter.2023.100804
- Kapoor, Cantrell, Peng, Pham, Bail, Gundersen, Hofman, Hullman, Lones, Malik, Nanayakkara, Poldrack, Raji, Roberts, Salganik, Serra-Garcia, Stewart, Vandewiele, Narayanan (2024). REFORMS. Science Advances, 10(18), eadk3452. doi:10.1126/sciadv.adk3452
- Kapoor, Narayanan (2025). Leakage and the Reproducibility Crisis in ML. https://reproducible.cs.princeton.edu
- Kaufman, Rosset, Perlich, Stitelman (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Transactions on Knowledge Discovery from Data, 6(4), 1--21. doi:10.1145/2382577.2382579
- Kruschke (2018). Rejecting or Accepting Parameter Values in Bayesian. Advances in Methods and Practices in Psychological Science, 1(2), 270--280. doi:10.1177/2515245918771304
- Kuhn, Silge (2022). Tidy Modeling with R. O'Reilly Media. https://www.tmwr.org
- Lakens (2013). Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-Tests and ANOVAs. Frontiers in Psychology, 4, 863. doi:10.3389/fpsyg.2013.00863
- Li, Choi, Chung, Kushman, Schrittwieser, Leblond, Eccles, Keeling, Gimeno, Dal Lago, others (2022). Competition-Level Code Generation with AlphaCode. Science, 378(6624).
- Lones (2024). Avoiding Common Machine Learning Pitfalls. Patterns, 5(10), 101046. doi:10.1016/j.patter.2024.101046
- Myers (1999). JFlow. In Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). doi:10.1145/292540.292561
- Nadeau, Bengio (2003). Inference for the Generalization Error. Machine Learning, 52(3), 239--281. doi:10.1023/A:1024068626366
- Olsson, Elhage, Nanda, Joseph, DasSarma, Henighan, Mann, Askell, Bai, Chen, others (2022). In-Context Learning and Induction Heads. Transformer Circuits Thread.
- Pearce, Ahmad, Tan, Dolan-Gavitt, Karri (2022). Asleep at the Keyboard? Assessing the Security of GitHub Copilot. In IEEE Symposium on Security and Privacy.
- Pedregosa, Varoquaux (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825--2830. https://jmlr.org/papers/v12/pedregosa11a.html
- Pedregosa, Varoquaux (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825--2830. https://jmlr.org/papers/v12/pedregosa11a.html
- Pierce (2002). Types and Programming Languages. MIT Press.
- Raschka (2020). Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv preprint arXiv:1811.12808. doi:10.48550/arXiv.1811.12808
- Riley, Snell, Ensor, Burke, Harrell, Moons, Collins (2019). Minimum Sample Size for Developing a Multivariable Prediction Model: Part I. Statistics in Medicine, 38(7), 1262--1275. doi:10.1002/sim.7993
- Roberts, Bahn, Ciuti, Boyce, Elith, Guillera-Arroita, Hauenstein, Lahoz-Monfort (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40, 913--929. https://nsojournals.onlinelibrary.wiley.com/doi/10.1111/ecog.02881
- Romano, Le, La Cava, Gregg, Goldberg, Chakraborty, Ray, Himmelstein, Fu, Moore (2021). PMLB. Bioinformatics, 38(3), 878--880. doi:10.1093/bioinformatics/btab347
- Rosenblatt, Tejavibulya, Jiang, Noble, Scheinost (2024). Data Leakage Inflates Prediction Performance in Connectome-Based Machine Learning Models. Nature Communications, 15, 1829. doi:10.1038/s41467-024-46150-w
- Roth (2022). Biased Machines in the Realm of Politics. Universit\. https://kops.uni-konstanz.de/handle/123456789/59732
- Roth (2026). A Grammar of Machine Learning Workflows: Rejecting Data Leakage at Call Time. doi:10.5281/zenodo.19406355
- Roth (2026). Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets. doi:10.5281/zenodo.19406148
- Roth (2026). The Shortest Path Leaks: How LLM.
- Sasse, Nicolaisen-Sobesky, Dukart, Eickhoff, Gotz, Hamdan, Komeyer, Kulkarni, Lahnakoski, Love, Raimondo, Patil (2025). Overview of Leakage Scenarios in Supervised Machine Learning. Journal of Big Data, 12, 41. doi:10.1186/s40537-025-01193-8
- Scriven (1967). The Methodology of Evaluation. In Perspectives of Curriculum Evaluation.
- Simpson (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society: Series B (Methodological), 13(2), 238--241. doi:10.1111/j.2517-6161.1951.tb00088.x
- Smith, Sala, Kanter, Veeramachaneni (2020). An Interactive Pipeline for Cross-Domain AutoML. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. doi:10.1145/3318464.3384693
- Stone (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2), 111--133. doi:10.1111/j.2517-6161.1974.tb00994.x
- Strom, Yemini (1986). Typestate: A Programming Language Concept for Enhancing Software Reliability. IEEE Transactions on Software Engineering, SE-12(1), 157--171. doi:10.1109/TSE.1986.6312929
- Tampu, Eklund, Haj-Hosseini (2022). Inflation of Test Accuracy Due to Data Leakage in Deep Learning-Based Classification of OCT. Scientific Data, 9, 580. doi:10.1038/s41597-022-01618-6
- Truong, Zhang, Marchareddy, Lee, Busold, Socas, AlOmar (2025). LeakageDetector.
- Tsamardinos, Greasidou, Borboudakis (2018). Bootstrapping the Out-of-Sample Predictions for Efficient and Accurate Cross-Validation. Machine Learning, 107(12), 1895--1922. doi:10.1007/s10994-018-5714-4
- Valavi, Elith, Lahoz-Monfort (2019). blockCV. Methods in Ecology and Evolution, 10, 225--232. https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13107
- van de Mortel (2025). Data Leakage in Machine Learning Studies Creep into Meta-Analytic Estimates. Molecular Psychiatry. doi:10.1038/s41380-025-03336-y
- van der Ploeg (2014). Modern Modelling Techniques Are Data Hungry: A Simulation Study for Predicting Dichotomous Endpoints. BMC Medical Research Methodology, 14, 137. doi:10.1186/1471-2288-14-137
- Vandewiele, Dehaene, Ková (2021). Overly Optimistic Prediction Results on Imbalanced Data: A Case Study of Flaws and Benefits when Applying Over-Sampling. Artificial Intelligence in Medicine, 111, 101987. doi:10.1016/j.artmed.2020.101987
- Vanschoren, van Rijn, Bischl, Torgo (2013). OpenML. ACM SIGKDD Explorations Newsletter, 15(2), 49--60. doi:10.1145/2641190.2641198
- Varma, Simon (2006). Bias in Error Estimation when Using Cross-Validation for Model Selection. BMC Bioinformatics, 7, 91. doi:10.1186/1471-2105-7-91
- Varoquaux (2018). Cross-validation Failure: Small Sample Sizes Lead to Large Error Bars. NeuroImage, 180, 68--77. doi:10.1016/j.neuroimage.2017.06.061
- Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS.
- Wickham (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3--28. doi:10.1198/jcgs.2009.07098
- Wilkinson (1999). The Grammar of Graphics. Springer.
- Yang, Brower-Sinning, Lewis, Kaestner (2022). Data Leakage in Notebooks: Static Detection and Better Processes. In Proceedings of the 37th IEEE/ACM. doi:10.1145/3551349.3556918
- Yao, Yu, Zhao, Shafran, Griffiths, Cao, Narasimhan (2024). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS.