Introduction

Contents

Introduction#

The integration of artificial intelligence into scientific computing represents one of the most significant shifts in research methodology since the advent of personal computers. Large language models (LLMs) trained on vast corpora of code can now generate syntactically correct, functionally appropriate programs from natural language descriptions, a capability that was inconceivable just a few years ago [1]. Tools like GitHub Copilot, ChatGPT, and Claude have democratized access to sophisticated programming assistance, enabling researchers with limited coding experience to implement complex analyses and build robust scientific software [2]. Agentic coding tools like Claude Code and Cursor have further enabled entire coding workflows by invoking tools outside the language model.

AI-assisted coding tools have demonstrated measurable productivity gains in some controlled studies, with benefits spanning development speed, code quality, and maintainability [2]. However, the evidence for these benefits remains contested and situation-dependent. While some enterprise studies and developer surveys report significant productivity increases and improved code quality [3], recent randomized controlled trials with experienced developers found that AI tools actually slowed completion times, despite developers believing they were working faster [4]. Additional concerns about code quality have emerged, with research analyzing over 200 million lines of code showing substantial increases in copy-pasted code and decreases in refactoring as AI use has become more prominentwhen using AI assistants [5]. These contradictory findings suggest that productivity effects are far from well-understood, and may vary based on developer experience, task complexity, and codebase characteristics. These questions become particularly acute in scientific computing, where code is not merely a means to an end but often embodies scientific reasoning, methodological decisions, and domain expertise. The validity, reproducibility, and interpretability of scientific software directly impact research integrity and the reliability of scientific findings [6].

The implications for scientific computing are profound. Programming involves complex problem decomposition, algorithmic thinking, and domain-specific reasoning. These cognitive skills that atrophy with excessive AI dependence. Furthermore, scientific code often requires deep understanding of mathematical models, statistical methods, and domain-specific conventions that cannot be adequately captured by AI tools trained on general programming corpora. These challenges are compounded by the technical limitations inherent to current AI systems: their context[1] windows constrain how much code they can process simultaneously, their stateless nature means they forget previous interactions, and their tendency toward “context rot” [7] can cause them to lose track of important details even within their processing limits.

Effective use of AI coding tools requires understanding how to work within and around these constraints. Techniques like strategic prompting, test-driven development, and externally-managed context files (such as memory files and constitution files) can help maintain consistency across AI interactions. Different tools (from conversational interfaces to interactive coding assistants to autonomous coding agents) each offer distinct capabilities and limitations that must be matched to specific development tasks. (For readers unfamiliar with these concepts, we provide detailed definitions in Background concepts and vocabulary.)

The rules presented in this paper emerge from our collective experience using AI-assisted coding tools, and highlight both the substantial promise and documented risks of AI-assisted coding in scientific settings. We hope they provide a framework for harnessing AI’s transformative potential while preserving the methodological rigor and domain expertise essential for high-quality scientific computing. These guidelines emphasize the importance of maintaining human agency in the coding process, establishing robust testing and validation procedures, and strategically managing the interaction between human expertise and AI assistance.

Who is this paper for? These guidelines are intended for anyone who develops scientific software that will be used more than once, whether by themselves, their collaborators, or the broader research community. This includes both scientists who primarily use code to generate research outputs and developers who build reusable tools and packages. Our focus is on creating maintainable, reliable software rather than one-off scripts. If you write code that needs to work reliably and repeatably, be understood by others, or be built upon in the future, we believe these rules are for you.


References#

[1]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, and others. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.

[2] (1,2)

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of ai on developer productivity: evidence from github copilot. arXiv preprint arXiv:2302.06590, 2023.

[3]

Eirini Kalliamvakou, Albert Ziegler, X Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. Measuring github copilot's impact on productivity. Communications of the ACM, 67(3):54–63, 2024.

[4]

Joel Becker and Nate Rush. Measuring the impact of early-2025 ai on experienced open-source developer productivity. arXiv preprint arXiv:2507.09089, 2025.

[5]

Bill Harding, Matthew Kloster, and GitClear. Ai copilot code quality: 2023 data suggests downward pressure on code quality. GitClear Research Report, 2024. URL: https://gwern.net/doc/ai/nn/transformer/gpt/codex/2024-harding.pdf.

[6]

Russell A. Poldrack. Better code, better science. https://poldrack.github.io/BetterCodeBetterScience/, 2024. Accessed: 2025-09-10. doi: 10.5281/zenodo.17407478.

[7]

Chroma. Context rot: how increasing input tokens impacts llm performance. Technical Report, Chroma, 2024. URL: https://research.trychroma.com/context-rot.