Conclusion

Conclusion#

This paper presents ten rules for leveraging AI coding tools effectively in scientific computing while maintaining methodological rigor and code quality. These rules are organized around four key themes: preparation and understanding, context engineering, testing and validation, and code quality assurance. However, we acknowledge a fundamental reality from our experience: even when following these rules, flawless start-to-finish interactions are the exception rather than the norm. The value of these rules lies not in guaranteeing immediate perfection, but in providing a framework that helps you focus on what matters most for successful interactions while also enabling you to quickly diagnose what went wrong when interactions fail, so you can iterate more effectively on your next attempt.

Ethical Considerations and Responsibility#

The use of AI-assisted coding raises fundamental questions about scientific accountability. When code that generates published results is partly AI-generated, who bears responsibility for errors, methodological flaws, or irreproducible outcomes? The answer must be unequivocal: the scientist. AI tools are instruments, and like any instrument in science, the researcher using them remains fully accountable for validating their outputs and ensuring methodological soundness. This responsibility cannot be delegated to the AI, regardless of how sophisticated the tool or how confident its outputs appear. Researchers must ensure their AI-assisted code is reproducible, well-documented, and scientifically appropriate. When AI generates code that implements a statistical method or analytical pipeline, the researcher must understand that implementation well enough to defend its appropriateness, explain its limitations, and troubleshoot unexpected results. “AI wrote it” is not a valid defense for flawed methodology or incorrect results. Transparency about AI usage in methods sections, while important, does not diminish this responsibility.

Beyond individual accountability, broader ethical concerns demand serious consideration. The environmental costs of training and running large language models are substantial and measurable [3, 4]. These systems consume enormous amounts of energy and computational resources, raising questions about the sustainability of widespread AI adoption. Further, intellectual property questions surrounding AI training on open-source code and the ownership of AI-generated code remain legally and ethically unsettled [5, 6, 7, 8]. Courts have yet to definitively rule on whether training on copyrighted code constitutes fair use, whether AI-generated code can be copyrighted, and who owns the rights to such code when models have been trained on proprietary or licensed material. These are fundamental ethical and legal challenges that the scientific community must grapple with as AI tools become embedded in research infrastructure. While these complex issues merit a dedicated treatment beyond our scope here, researchers should recognize that using AI coding tools involves participating in systems with significant unresolved ethical dimensions.

Guardrails for Autonomous Agents#

Autonomous coding agents can make extensive changes across a codebase with minimal human intervention, dramatically accelerating development but introducing risks if not properly constrained. The primary danger lies in granting agents too much control without appropriate safeguards. An agent given broad permissions might break existing functionality, introduce security vulnerabilities, or violate architectural principles while reporting success.

We recommend several guardrails. First, use containerized or sandboxed environments for agent-driven development, isolating agent operations from production systems [9]. Second, commit working code before allowing agent changes, enabling easy rollback. Third, learn how to properly configure agents with explicit constraints about what they can modify and what actions require human approval. Fourth, maintain active monitoring rather than allowing unsupervised operation, as discussed in Rule 8. For individual projects, consider project-specific containers where each agent operates in an isolated environment with restricted file access. As autonomous agents become more capable, developing clear and safe practices for constraining and monitoring their behavior will become increasingly important for maintaining scientific rigor and system safety.

Limitations and Future Directions#

We acknowledge that we are operating in a rapidly evolving technological landscape. For reference, GPT-3 (2020) had a context window of 2,048 tokens [10], GPT-4 (2023) expanded this to variants with tens of thousands of tokens [11], and current state-of-the-art models like Google Gemini 2.5 Pro (2025) can operate with context windows of millions [12]. In light of this rapid evolution, we have intentionally focused on principles and practices that remain relevant across different AI capabilities. We believe our proposed rules emphasize fundamental skills (domain knowledge, problem decomposition, critical review) and strategies (context management, test-driven development, incremental refinement) that apply regardless of specific tools, and thus far have proven useful throughout the evolution of AI models to-date. We have deliberately avoided prescriptive recommendations and strategies tied to specific models, as these would quickly become outdated. Future advances may change which practices prove most valuable, but we believe these rules provide a useful framework for current practice that will remain adaptable as technology matures.

We also anticipate substantial evolution in how science acknowledges AI-assisted work. As AI coding becomes standard practice, we expect clearer community expectations for documenting AI tool usage and validating AI-generated code; this may include citations of specific systems, disclosure of prompting approaches, detailed validation procedures in methods sections, and heightened expectations regarding testing, validation, and reproducibility of code derivatives. The practices we recommend (systematic context building, comprehensive testing, and critical validation) may provide a foundation for informing and meeting these emerging accountability standards.

Acknowledgements#

An initial framework of 20 rules was developed by EWB and RAP (10 rules each). These rules were streamlined into 10 rules with assistance from Claude (Anthropic) [16], which were then authored and iteratively refined by the research team into the content, examples, and recommendations presented herein. In the complementary examples of this jupyter book, Claude (Anthropic) chatbot and Claude Code were used as the reference interaction tools for the working examples of AI interactions. This work was supported by a grant from the Sloan Foundation to RAP (G-2025-25270).

References#

[1]

Russell A. Poldrack. Better code, better science. https://poldrack.github.io/BetterCodeBetterScience/, 2024. Accessed: 2025-09-10. doi: 10.5281/zenodo.17407478.

[2]

Kent Beck. Test-Driven Development: By Example. Addison-Wesley, Boston, MA, 2003.

[3]

Ahmad Faiz, Sotaro Kaneda, Ruhan Wang, Rita Chukwunyere Osi, Prateek Sharma, Fan Chen, and Lei Jiang. Llmcarbon: modeling the end-to-end carbon footprint of large language models. In The Twelfth International Conference on Learning Representations (ICLR). 2024.

[4]

Shaolei Ren, Bill Tomlinson, Rebecca W Black, and Andrew W Torrance. Reconciling the contrasting narratives on the environmental impact of large language models. Scientific Reports, 14(1):26310, 2024. doi:10.1038/s41598-024-76682-6.

[5]

Jan Bernd Nordemann and Jonathan Pukas. Copyright exceptions for AI training data—will there be an international level playing field? Journal of Intellectual Property Law & Practice, 17(12):973–974, 2022. doi:10.1093/jiplp/jpac106.

[6]

Adam Buick. Copyright and AI training data—transparency to the rescue? Journal of Intellectual Property Law & Practice, 20(3):182–192, 2025. doi:10.1093/jiplp/jpae102.

[7]

Matt Blaszczyk, Geoffrey McGovern, and Karlyn D. Stanley. Artificial intelligence impacts on copyright law. Perspective PEA3243-1, RAND Corporation, Santa Monica, CA, 2024.

[8]

Martin Kretschmer, Thomas Margoni, and Pinar Oruç. Copyright law and the lifecycle of machine learning models. IIC - International Review of Intellectual Property and Competition Law, 55:110–138, 2024. doi:10.1007/s40319-023-01419-3.

[9] (1,2)

Kristina Wiebels and David Moreau. Leveraging containers for reproducible psychological research. Advances in Methods and Practices in Psychological Science, 2021. doi:10.1177/25152459211017853.

[10]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and others. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, 1877–1901. 2020.

[11]

OpenAI and others. GPT-4 technical report. Technical Report, OpenAI, 2023. arXiv:2303.08774.

[12]

Google DeepMind. Introducing Gemini 2.0: our new AI model for the agentic era. Google Blog, December 2024. URL: https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/.

[13]

Randall J LeVeque, Ian M Mitchell, and Victoria Stodden. Reproducible research for scientific computing: tools and strategies for changing the culture. Computing in Science & Engineering, 14(4):13–17, 2012. doi:10.1109/MCSE.2012.38.

[14]

John Ousterhout. A Philosophy of Software Design. Yaknyam Press, 2 edition, 2021.

[15]

Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, and Shriram Krishnamurthi. How to Design Programs: An Introduction to Programming and Computing. MIT Press, 2 edition, 2018. URL: https://htdp.org.

[16]

Anthropic. Claude sonnet 4.5. Anthropic Product Release, 2025. URL: https://www.anthropic.com/claude/sonnet.