Humans do it better: GitClear analyzes 153M lines of code, finds risks of AI

Summary:

GitClear analyzes AI’s influence on code quality, examining over 153 million lines of code from 2020 to 2023. Highlighting key shifts in code churn, duplication, and age, it explores the impact of AI tools like GitHub Copilot on programming practices. This report delves into the challenges and implications for future coding standards, stimulating discussion on maintaining code quality in AI.

This study was conducted by analyzing how code was authored in 3400 repositories between 2020 and 2023. Results show that three key metrics have seen significant change in the second half of the studied interval: code churn, code duplication, and code age.

The paper studies the variation of those factors and how their change correlates with the advent of AI programming assistants. The increase in code churn (the percentage of code that gets removed or significantly altered soon after integration) suggests that more “bad code” is being authored. The higher percentage of duplicated code indicates that developers applied “quick patches” more often than 3 years ago, decreasing the clarity of the project. Finally, the shorter average “refactor time” points to more time spent on fixing recent bad code versus refactoring legacy modules.

Perhaps unsurprisingly, the output quality of AI-generated code resembles that of a developer unfamiliar with the projects they are altering. Just like a developer assigned to a brand new repository, code generation tools are prone to corrupting the DRY-ness of the project. To see the full details, view the full paper here.

Hire an AI developer, get $3000 free. Claim your hiring credit now.
Find experts in generative AI, LLM, OpenAI, and more on Arc. Offer ends 2/9.

The GitHub Copilot Context

2023 marked the mainstream launch of GitHub Copilot and the dramatic increase in code written by an AI programming assistant. As far as its impact, GitHub’s CEO Thomas Dohmke highlights in his blog post a gain of 15 million “developers,” a global impact on the economy of 1.5 trillion U.S. dollars, 55% “faster coding,” and 46% more “code written.”

Building on the AI-adoption statistics provided by GitHub, GitClear’s study aims to measure the implications of this phenomenon. Are there measurable side effects to committing AI-generated code? What are the implications of the widespread adoption of AI programming assistants?

The Problem with AI-Generated Code

To echo Adam Tornhill’s (code researched and author, Your Code as a Crime Scene) take on the subject, the first “AI-generated problem” stems from the fact that, on average, developers spend 10x more time reading code than writing it (according to Robert Martin, author of Clean Code: A Handbook of Agile Software Craftsmanship). If AI programming helps write code 55% faster, that means all code will be written faster, including the bad code (or the code that shouldn’t be written in the first place).

From the same lens of “code maintainability,” AI-generated suggestions are skewed towards adding new code as opposed to moving, updating, or deleting existing code. Similarly, the suggestion algorithms favor the results that are most likely to be accepted (which, by itself, does not predict “codebase health”). The complex list of implications culminates with the increased time the developer has to spend reading and evaluating said suggestions.

How to Approximate Code Quality?

GitClear classifies code changes into seven main code operations: additions, deletions, moves, updates, string substitutions, duplicates and no-op code. More details on GitClear’s code operations can be found in the Diff Delta documentation.

By analyzing the contents of authored code, GitClear is able to approximate developer intention. For example, “additions” (completely new lines of code being added) usually correlate with the creation of new features. Meanwhile, “moves” (existing lines being transferred to other files or functions within a file) usually correlate with code refactoring. Similarly, “deletions” tend to coincide with cleanup and increased codebase health, while “duplicates” typically achieve the opposite.

In addition to code operations, GitClear also measures a metric called “Churned Code.” This stands for code that the developer writes and then reverts or significantly alters within two subsequent weeks. Churn is best understood as “changes that were either incomplete or erroneous when they were authored.”

Finally, GitClear also measures “code provenance,” which is the length of time that passes between the moment when a piece of code is written and subsequently updated or deleted.

With this framework in mind for understanding code change, let’s take a look over the results.

Study Results: Trends in Code Operations and Churn

GitClear analyzed the number of different code operations and aggregated their distributions for every year. Here are the results:

The table above illustrates how much each code operation contributed to code changes in its particular year. Additionally, the “Churn” column shows the percentage of code that was committed and then removed or updated within 2 weeks. The last row of the table is a projection for 2024 that extrapolates on the previous four years. Here is the same data plotted onto a graph:

Furthermore, here are the trends in the “revised code age”:

The table above displays how long it took for code to be “revised” every year between 2020 and 2023, as well as a projection for 2024.

The Real Impact of AI-Generated Code

Looking at the variation of operation frequency and churn between 2020 and 2023, we find three red flags for code quality:

The most significant changes correlated with GitHub Copilot’s rise are “churn,” “moves,” and “duplicates.” Let’s explore the implications of each.

Developers Commit more “Breaking Code”

Recall that “churn” is the percentage of code that was pushed to the repo, then subsequently reverted, removed, or updated within 2 weeks. This was a relatively infrequent outcome when developers authored all their own code: only 3-4% of code was churned between 2020 and 2022 every year. By contrast, in 2023, the numbers grew to an average of 5.5%.

The data strongly correlates “using Copilot” with “mistake code” being pushed to the repository more frequently. If Copilot’s prevalence was 0% in 2021, 5-10% in 2022, and 30% in 2023 (as per GitHub and O’Reilly), the Pearson correlation coefficient between these variables is 0.98.

The more churn becomes commonplace, the greater the risk of mistakes being deployed to production. If the current pattern continues into 2024, more than 7% of all code changes will be reverted within two weeks, double the rate of 2021.

Hire an AI developer, get $3000 free. Claim your hiring credit now.
Find experts in generative AI, LLM, OpenAI, and more on Arc. Offer ends 2/9.

Projects See Less Refactoring

“Moved code” is typically observed when refactoring an existing code system. As a product grows in scope, developers traditionally rearrange existing code into new modules and files that can be reused by new features. Code reuse means developers are often employing code that has already been tested and documented. Therefore, this type of operation often translates into less time spent understanding existing code and less cognitive effort updating it.

The 17% decrease in “move” operations when compared to 2021 hints at the built-in trait of AI assistants to discourage code reuse. Instead of refactoring and working to DRY (“Don’t Repeat Yourself”) code, they offer a one-keystroke templation to repeat existing code.

Programmers Push More Duplicated Code

By re-adding code instead of reusing it, the chore is left to future maintainers to consolidate parallel code paths. The problem is aggravated by most developers’ subjective preference for writing new code from scratch vs. reading existing code. Even in teams where there are senior developers with the skills and authority for refactoring, the willpower cost of understanding code well enough to delete it is hard to overstate.

In the absence of a CTO or VP of Engineering who actively schedules time to reduce “tech debt,” “copy/pasted code” often never gets consolidated into the appropriate component libraries.

Especially next to the decrease in “moved code,” the 11% increase in the proportion of duplicated code confirms the drop in overall code quality in 2023 when compared to 2021. Furthermore, since GitClear operations only include code that is duplicated within a single commit, it is likely that the real percentage of commits that have duplicate code is significantly larger.

More Time Is Spent Changing Recent Code

Finally, the Code Provenance data corroborates the patterns observed in the code operation analysis. Namely, the amount of code replaced in less than two weeks since its conception has jumped by 10%. Meanwhile, code older than one month was changed 24% less frequently in 2023 than in 2022.

The shift in “revised code age” suggests that less time is being spent on refactoring legacy code and more time is being spent fixing recent “mistake code.”

Conclusion and Open Questions

Recent trends in how code changes point to an “add it and forget it”-type philosophy when it comes to programming. The specific metrics underlying this claim (code duplication, code churn, etc.) suggest a strong correlation between this drop in code quality and the adoption of code-generating AI tools.

While it is undeniable that code does get written faster, the side effects translate into (1) more time spent understanding existing code and (2) more resources directed at fixing bad code.

One potential avenue for alleviating the effects of the GitHub Copilot era is to train available algorithms to favor refactoring code. While not an all-encompassing solution by itself, this could prove to be a favorable first step towards slowing down the downward trend in code quality.

That said, any technical solution needs to be accompanied by a shift in public discourse. Developers and programming thought leaders alike ought to reframe the discussion towards code quality instead of code-writing speed; towards generating code that is easier to understand instead of more code. Provided with the relevant data and an industry consensus, most experienced developers are bound to opt for a more conservative use of AI programming assistants.

Finally, GitClear proposes a set of open questions around how to best measure and interpret the negative impact of faster code-writing:

  1. At what rate does development progress become inhibited by additional code? Does more code (and especially more copy/pasted code) inversely correlate with “the velocity at which developers can modify the code”? Knowing the rate at which slowdown takes hold would allow future tools to highlight when a manager should consider cutting back on new features.
  2. What is the total percentage of “duplicated code” that is actually occurring? Since GitClear currently measures only copy-pasted code within the context of an individual commit, the total volume of duplicated code might be much larger than the quoted numbers in this study.

GitClear will look to address these questions in future research and encourage other researchers in the field to contribute their own data.

Hire an AI developer, get $3000 free. Claim your hiring credit now.
Find experts in generative AI, LLM, OpenAI, and more on Arc. Offer ends 2/9.

Written by
Arc Team