Background concepts and vocabulary

Background concepts and vocabulary#

To navigate the challenges of AI-assisted coding effectively, researchers should be familiar with several key concepts that underpin these tools:

Large Language Models (LLMs) are neural networks trained on vast text corpora that generate text by predicting sequences of tokens, basic units of text processing that typically represent words, parts of words, or individual characters [2]. For instance, the word “unhappily” might be tokenized as “un”, “##happi”, “##ly”, where ## marks tokens that are not the start of a word.
Context windows define the maximum number of tokens an LLM can consider when generating responses. State-of-the-art models typically handle hundreds of thousands to millions of tokens, constraining how much code and documentation they can simultaneously process. When context limits are exceeded, models lose track of earlier information. Even when information is contained within the context window, attention to mid-document details can degrade (“lost in the middle”), especially for models with very large context windows; this phenomenon is known as context rot. For an example of context rot, see [7].
In-context learning allows models to adapt their behavior based on examples and instructions provided within the current conversation, without permanent changes to the underlying model. This enables direction of model behavior through strategic provision of examples and formatting of instructions.
Prompting encompasses techniques for structuring inputs to elicit desired outputs, including clear requirement specification, strategic provision of examples, and structured formatting. Effective prompting can dramatically improve code quality and relevance.
Test-driven development involves writing tests before implementation to specify expected behavior and validate correctness, a practice that becomes even more critical when AI generates the implementation code. Test driven development is detailed in [1].

References#

[1]

Chroma. Context rot: how increasing input tokens impacts llm performance. Technical Report, Chroma, 2024. URL: https://research.trychroma.com/context-rot.

[2]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, volume 30. 2017.

[3]

Kent Beck. Test-Driven Development: By Example. Addison-Wesley, Boston, MA, 2003.

Background concepts and vocabulary

Contents

Background concepts and vocabulary#

Sharing context#

References#