Citations

Every empirical claim on this site links here. Auto-generated from the paper bibliographies. Download BibTeX

Ambroise, McLachlan (2002). Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data. Proceedings of the National Academy of Sciences, 99(10), 6562--6566. doi:10.1073/pnas.102102699
Apicella, Isgrò (2025). Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning. Artificial Intelligence Review. doi:10.1007/s10462-025-11326-3
Arlot, Celisse (2010). A Survey of Cross-Validation Procedures for Model Selection. Statistics Surveys, 4, 40--79. doi:10.1214/09-SS054
Austin, Odena, Nye, Bosma, Michalewski, Dohan, Jiang, Cai, Terry, Le, Sutton (2021). Program Synthesis with Large Language Models. In arXiv preprint arXiv:2108.07732.
Ballarin, Dellaportas, Grigoryeva, Hirt, van Huellen (2024). Reservoir Computing for Macroeconomic Forecasting with Mixed-Frequency Data. International Journal of Forecasting, 40(3), 1206--1237. doi:10.1016/j.ijforecast.2023.10.009
Bates, Hastie, Tibshirani (2024). Cross-Validation: What Does It Estimate and How Well Does It Do It?. Journal of the American Statistical Association, 119(546), 1434--1445. doi:10.1080/01621459.2023.2197686
Baylor, Breck, Cheng, Fiedel, Foo, Haque, Haykal, Ispir, Jain, Koc, others (2017). TFX. In KDD.
Becker, Recamonde-Mendoza (2025). Mind the Gap: Investigating the Impact of Data Leakage on Machine Learning Predictive Models. In Brazilian Conference on Intelligent Systems (BRACIS).
Bengio, Grandvalet (2004). No Unbiased Estimator of the Variance of K. Journal of Machine Learning Research, 5, 1089--1105. https://jmlr.org/papers/v5/grandvalet04a.html
Benjamini, Hochberg (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289--300. doi:10.1111/j.2517-6161.1995.tb02031.x
Bergstra, Bengio (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281--305. https://jmlr.org/papers/v13/bergstra12a.html
Bertin (1967). Sémiologie Graphique. Mouton/Gauthier-Villars.
Binder, Pfisterer, Lang, Schneider, Kotthoff, Bischl (2021). mlr3pipelines: Flexible Machine Learning Pipelines in R. Journal of Machine Learning Research, 22(184), 1--7. https://jmlr.org/papers/v22/21-0206.html
Bischl, Binder, Lang, Pielok, Richter, Coors, Thomas, Ullmann, Becker, Boulesteix, Deng, Lindauer (2023). Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges. WIREs Data Mining and Knowledge Discovery, 13(2), e1484. doi:10.1002/widm.1484
Blum, Hardt (2015). The Ladder: A Reliable Leaderboard for Machine Learning Competitions. In Proceedings of the 32nd International Conference on Machine Learning (ICML). https://arxiv.org/abs/1502.04585
Bousquet, Elisseeff (2002). Stability and Generalization. Journal of Machine Learning Research, 2, 499--526. https://jmlr.org/papers/v2/bousquet02a.html
Buitinck, Louppe, Blondel, Pedregosa, Mueller, Grisel, Niculae, Prettenhofer, Gramfort, Grobler, Layton, VanderPlas, Joly, Holt, Varoquaux (2013). API. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. https://arxiv.org/abs/1309.0238
Cawley, Talbot (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research, 11, 2079--2107. https://jmlr.org/papers/v11/cawley10a.html
Chawla, Bowyer, Hall, Kegelmeyer (2002). SMOTE. Journal of Artificial Intelligence Research, 16, 321--357. doi:10.1613/jair.953
Chen, Tworek, Jun, Yuan, Pinto, Kaplan, Edwards, Burda, Joseph, Brockman, others (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.
Chomsky (1957). Syntactic Structures. Mouton.
Codd (1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13(6), 377--387. doi:10.1145/362384.362685
Dietterich (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7), 1895--1923. doi:10.1162/089976698300017197
Drobnjaković (2024). Abstract Interpretation for Data Leakage Detection in Machine Learning Pipelines. In Theoretical Aspects of Software Engineering (TASE 2024). doi:10.1007/978-3-031-64626-3_7
Drobnjaković (2025). Static Analysis by Abstract Interpretation Against Data Leakage in Machine Learning. Science of Computer Programming. doi:10.1016/j.scico.2025.103338
Drori, Krishnamurthy, Rampin, de Paula Lourenco (2021). AlphaD3M.
Du, Liu, Wang, Wang, Liu, Chen, Feng, Sha, Peng, Lou (2023). ClassEval. arXiv preprint arXiv:2308.01861.
Dwork, Feldman, Hardt, Pitassi, Reingold, Roth (2015). The Reusable Holdout: Preserving Validity in Adaptive Data Analysis. Science, 349(6248), 636--638. https://www.science.org/doi/10.1126/science.aaa9375
Gumbel (1958). Statistics of Extremes. Columbia University Press.
Guyon, Elisseeff (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157--1182. https://jmlr.org/papers/v3/guyon03a/guyon03a.pdf
Hastie, Tibshirani, Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. https://hastie.su.domains/ElemStatLearn/
Jesse, Ahmed, Devanbu, Morgan (2023). Large Language Models and Simple, Stupid Bugs. arXiv preprint arXiv:2303.11455.
Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan (2023). SWE-bench. arXiv preprint arXiv:2310.06770.
Kapoor, Narayanan (2022). Leakage and the Reproducibility Crisis in ML. arXiv:2207.07048
Kapoor, Narayanan (2023). Leakage and the Reproducibility Crisis in Machine-Learning-Based Science. Patterns, 4(9), 100804. doi:10.1016/j.patter.2023.100804
Kapoor, Cantrell, Peng, Pham, Bail, Gundersen, Hofman, Hullman, Lones, Malik, Nanayakkara, Poldrack, Raji, Roberts, Salganik, Serra-Garcia, Stewart, Vandewiele, Narayanan (2024). REFORMS. Science Advances, 10(18), eadk3452. doi:10.1126/sciadv.adk3452
Kapoor, Narayanan (2025). Leakage and the Reproducibility Crisis in ML. https://reproducible.cs.princeton.edu
Kaufman, Rosset, Perlich, Stitelman (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Transactions on Knowledge Discovery from Data, 6(4), 1--21. doi:10.1145/2382577.2382579
Kruschke (2018). Rejecting or Accepting Parameter Values in Bayesian. Advances in Methods and Practices in Psychological Science, 1(2), 270--280. doi:10.1177/2515245918771304
Kuhn, Silge (2022). Tidy Modeling with R. O'Reilly Media. https://www.tmwr.org
Lakens (2013). Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-Tests and ANOVAs. Frontiers in Psychology, 4, 863. doi:10.3389/fpsyg.2013.00863
Li, Choi, Chung, Kushman, Schrittwieser, Leblond, Eccles, Keeling, Gimeno, Dal Lago, others (2022). Competition-Level Code Generation with AlphaCode. Science, 378(6624).
Lones (2024). Avoiding Common Machine Learning Pitfalls. Patterns, 5(10), 101046. doi:10.1016/j.patter.2024.101046
Myers (1999). JFlow. In Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). doi:10.1145/292540.292561
Nadeau, Bengio (2003). Inference for the Generalization Error. Machine Learning, 52(3), 239--281. doi:10.1023/A:1024068626366
Olsson, Elhage, Nanda, Joseph, DasSarma, Henighan, Mann, Askell, Bai, Chen, others (2022). In-Context Learning and Induction Heads. Transformer Circuits Thread.
Pearce, Ahmad, Tan, Dolan-Gavitt, Karri (2022). Asleep at the Keyboard? Assessing the Security of GitHub Copilot. In IEEE Symposium on Security and Privacy.
Pedregosa, Varoquaux (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825--2830. https://jmlr.org/papers/v12/pedregosa11a.html
Pedregosa, Varoquaux (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825--2830. https://jmlr.org/papers/v12/pedregosa11a.html
Pierce (2002). Types and Programming Languages. MIT Press.
Raschka (2020). Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv preprint arXiv:1811.12808. doi:10.48550/arXiv.1811.12808
Riley, Snell, Ensor, Burke, Harrell, Moons, Collins (2019). Minimum Sample Size for Developing a Multivariable Prediction Model: Part I. Statistics in Medicine, 38(7), 1262--1275. doi:10.1002/sim.7993
Roberts, Bahn, Ciuti, Boyce, Elith, Guillera-Arroita, Hauenstein, Lahoz-Monfort (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40, 913--929. https://nsojournals.onlinelibrary.wiley.com/doi/10.1111/ecog.02881
Romano, Le, La Cava, Gregg, Goldberg, Chakraborty, Ray, Himmelstein, Fu, Moore (2021). PMLB. Bioinformatics, 38(3), 878--880. doi:10.1093/bioinformatics/btab347
Rosenblatt, Tejavibulya, Jiang, Noble, Scheinost (2024). Data Leakage Inflates Prediction Performance in Connectome-Based Machine Learning Models. Nature Communications, 15, 1829. doi:10.1038/s41467-024-46150-w
Roth (2022). Biased Machines in the Realm of Politics. Universit\. https://kops.uni-konstanz.de/handle/123456789/59732
Roth (2026). A Grammar of Machine Learning Workflows: Rejecting Data Leakage at Call Time. doi:10.5281/zenodo.19406355
Roth (2026). Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets. doi:10.5281/zenodo.19406148
Roth (2026). The Shortest Path Leaks: How LLM.
Sasse, Nicolaisen-Sobesky, Dukart, Eickhoff, Gotz, Hamdan, Komeyer, Kulkarni, Lahnakoski, Love, Raimondo, Patil (2025). Overview of Leakage Scenarios in Supervised Machine Learning. Journal of Big Data, 12, 41. doi:10.1186/s40537-025-01193-8
Scriven (1967). The Methodology of Evaluation. In Perspectives of Curriculum Evaluation.
Simpson (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society: Series B (Methodological), 13(2), 238--241. doi:10.1111/j.2517-6161.1951.tb00088.x
Smith, Sala, Kanter, Veeramachaneni (2020). An Interactive Pipeline for Cross-Domain AutoML. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. doi:10.1145/3318464.3384693
Stone (1974). Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2), 111--133. doi:10.1111/j.2517-6161.1974.tb00994.x
Strom, Yemini (1986). Typestate: A Programming Language Concept for Enhancing Software Reliability. IEEE Transactions on Software Engineering, SE-12(1), 157--171. doi:10.1109/TSE.1986.6312929
Tampu, Eklund, Haj-Hosseini (2022). Inflation of Test Accuracy Due to Data Leakage in Deep Learning-Based Classification of OCT. Scientific Data, 9, 580. doi:10.1038/s41597-022-01618-6
Truong, Zhang, Marchareddy, Lee, Busold, Socas, AlOmar (2025). LeakageDetector.
Tsamardinos, Greasidou, Borboudakis (2018). Bootstrapping the Out-of-Sample Predictions for Efficient and Accurate Cross-Validation. Machine Learning, 107(12), 1895--1922. doi:10.1007/s10994-018-5714-4
Valavi, Elith, Lahoz-Monfort (2019). blockCV. Methods in Ecology and Evolution, 10, 225--232. https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13107
van de Mortel (2025). Data Leakage in Machine Learning Studies Creep into Meta-Analytic Estimates. Molecular Psychiatry. doi:10.1038/s41380-025-03336-y
van der Ploeg (2014). Modern Modelling Techniques Are Data Hungry: A Simulation Study for Predicting Dichotomous Endpoints. BMC Medical Research Methodology, 14, 137. doi:10.1186/1471-2288-14-137
Vandewiele, Dehaene, Ková (2021). Overly Optimistic Prediction Results on Imbalanced Data: A Case Study of Flaws and Benefits when Applying Over-Sampling. Artificial Intelligence in Medicine, 111, 101987. doi:10.1016/j.artmed.2020.101987
Vanschoren, van Rijn, Bischl, Torgo (2013). OpenML. ACM SIGKDD Explorations Newsletter, 15(2), 49--60. doi:10.1145/2641190.2641198
Varma, Simon (2006). Bias in Error Estimation when Using Cross-Validation for Model Selection. BMC Bioinformatics, 7, 91. doi:10.1186/1471-2105-7-91
Varoquaux (2018). Cross-validation Failure: Small Sample Sizes Lead to Large Error Bars. NeuroImage, 180, 68--77. doi:10.1016/j.neuroimage.2017.06.061
Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS.
Wickham (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3--28. doi:10.1198/jcgs.2009.07098
Wilkinson (1999). The Grammar of Graphics. Springer.
Yang, Brower-Sinning, Lewis, Kaestner (2022). Data Leakage in Notebooks: Static Detection and Better Processes. In Proceedings of the 37th IEEE/ACM. doi:10.1145/3551349.3556918
Yao, Yu, Zhao, Shafran, Griffiths, Cao, Narasimhan (2024). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS.