fossabot’s Strategic Updates Keep Getting Smarter

The fossabot Preview was launched in early October with two foundational capabilities that unlock the ability to make strategic dependency upgrades:

1. Detect important breaking changes that affect your app

2. Fix or adapt the app in order to complete the upgrade

This post will dive into how we detect changes, interpret them, and improvements to our planning system for fixing them. Last, we'll cover how we score outselves on the outcomes through evaluation.

Driving Down Time-To-Merge

Since launch, fossabot has analyzed many Dependabot, Renovate, and Snyk Pull Requests and has helped organizations transition from worse than average to better than average at merging updates in just a few short weeks.

Organizations using fossabot merge updates over 50% faster

Our goal is to be able to return a verdict of “Safe” or “Safe with Fixes” (we fixed your code) for the majority of dependency upgrades. To drive down non-Safe verdicts, we’ve improved our agent with better planning and more efficient model usage, which drives down speed, cost, and duration of the fixes.

Fixing Breaking Changes Like a Senior Engineer

fossabot now explains what it’s attempting to adapt in your code, why those steps need to be taken, and offers more control on proceeding with that change or handling it yourself.

Example 1: Fixing Updated Lint Rules

For example, a frustratingly common issue is an update of eslint which brings new default lint rules, that will immediately start failing. fossabot will fix these up for you:

4 call sites identified for lint-related fixes

After triggering @fossabot fix, a commit will be made and the bot will wait for a CI signal for confirmaton of the fix.

Fixes committed and CI signal is passing

Example 2: Adapting to New Patterns

Here's another example of a syntax adapation fix. The bot already has context about the migration, saving both the research time and the toil related to this change.

Fixes applied and skipped after further research — 2 breaking changes fixed across 74 locations

Consuming your Continuous Integration (CI) results into the new planning engine is another enhancement used whether a fix is needed or not.

Stronger CI Signals and Custom Tools for GitHub Actions

Key to any dependency upgrade is a Continuous Integration (CI) test result that exercises key parts of your application.

fossabot automatically looks for signals in your GitHub Action-based checks. Additional documentation has been added for users that use tools like Storybook or Cypress and other tips for using CI successfully.

Every organization does testing differently and we’d love to hear how to best support your team.

Measuring Accuracy, Consistency, and Correctness

Trust in fossabot is built on a foundation of accuracy, consistency, and correctness. How? Let’s dig into the dataset.

The dataset contains real-world dependency upgrades against complex applications. This allows fossabot’s evaluation framework to see all kinds of dependencies in action:

Dependencies with tons of changes per release
Major jumps between versions
Many direct and transitive dependencies being upgraded at once
Dependencies without changelogs at all

When those dependencies are used, behaviors differ as well:

Simple usage of high-level APIs
Targeted use of a small fraction of available functionality
Large-scale usage across the app
Dev-only tools that are still exposed to significant breaking changes

Determining the Verdict

fossabot’s verdict of “safe” or “not safe” for each upgrade is the key item that we test in our evaluations. It’s a roll up of every fact, change, and behavior about each package being upgraded, and that matrix can get large very quickly. That means that one wrong analysis of a single fact out of 100s will affect these results:

Accuracy	Precision	Recall	F1 Score
72.6%	77.6%	86.8%	82.0%

The test is very conservative when making a binary verdict and will always favor “not safe” when encountering uncertainty (which is a false negative). These false negatives are encountered in about 10% of the evaluations today.

While that harms the scores above, it protects customers trusting our “safe” verdicts. These scores falling lower on the range is OK to us, and you can see below how major version upgrades with hundreds (or thousands) of changes affects the results.

Finding Changes

Understanding the content of a release is critical, and we have a whole section of the evaluation framework dedicated to evaluating this task. The numbers speak for themselves:

Category	Accuracy	Precision	Recall	F1 Score
routine_minor_updates	100%	100%	100%	100%
dev_dependencies	90.0%	100%	90.0%	94.7%
multi_dependency_updates	79.7%	85.0%	79.7%	82.3%
major_version_upgrades	58.1%	62.1%	58.1%	60.0%

For completeness, we need to ingest all human-readable content like release notes, changelogs, and migration guides. Are there cases where something important isn’t mentioned? Absolutely. In that case, fossabot will analyze the code changes to build up its own version of the changelog.

A good example of this is the popular pg, a non-blocking PostgreSQL client for Node, which is tagged on GitHub, but doesn’t have releases nor release notes. Maintaining an OSS library is hard work, so we’re not here to throw shade for the lack of notes, but we do need to understand what’s changing so the database client doesn’t break apps.

When fossabot displays the changelog for review, it’s been ranked for the importance of the changes, is labeled with the source of the data, and links to repo issues or sources when available.

Understanding Changes

How fossabot explains its findings is just as important as their accuracy, so there is an evaluation for that as well. Each discovered change is measured on its validity or correctness of the explanation.

For example, here’s a specific fact that the analysis of Material UI 6.x to 7.x produced:

The update includes 38 new features adding missing root slots to multiple components

The evaluation is not pass, but as you can see, it's very subjective:

The v7 release notes show root slots were added to components as part of broader changes, but not as 38 separate 'new features'.

Is it one feature that touches 38 implementations or 38 separate new features?

Try Out fossabot

fossabot’s public preview is available as a GitHub app. Every user gets $15 of analysis credit, replenished every month. Let loose the updates!

Try Now: Install the GitHub app

Reach out to get a demo of fossabot and let's figure out how to get your teams caught up on updates.