How safe are your passwords? How secure are your API keys? Are you sure your CI pipeline is configured using the best security practices?
One of the easiest methods malicious actors use to infiltrate systems and abuse data is by scanning for secrets that accidentally leak into the public space. Why go through the effort of hacking when someone has left the keys to the kingdom sitting on the doormat?
For organizations, this can have quite a cost. According to IBM’s Cost of a Data Breach Report, the average cost of a data breach in 2020 was $3.9 million.
Put simply: Can your organization afford a data breach? If maintaining continued stability and security is your responsibility, read on.
The growth of collaborative development via shared code repositories has given malicious actors new and imaginative attack vectors to exploit.
In early 2021, a misconfigured Bitbucket server operated by Nissan NA was breached using a default administrator password.
The Git-based code repository contained code used across Nissan’s North American operations, allowing anyone with a little computing experience to clone the repository and plant backdoors into the existing codebase to exploit at a later date.
One of the biggest data breaches in recorded history began with a poorly selected password exposed on a public GitHub server by a SolarWinds intern.
As SolarWinds did not have a secret scanning tool inside their CI/CD pipeline, the password leaked into a public repository and was used to breach SolarWinds. Once in control, the hackers used SolarWinds’ platform to exploit many of their high-profile clients in a classic supply-chain attack.
In January 2021, an Amazon cloud engineer accidentally committed almost a gigabyte worth of sensitive data to his personal GitHub repository.
Within 30 minutes, the leak was detected by automated tools used by a third-party security firm, demonstrating the speed and ease at which leaked secrets can be detected with the right tools in place. Without the quick detection and notification by a reputable security firm, Amazon’s AWS could have suffered additional data leaks and service disruption.
Today’s secret scanning solutions use one or more of the following scanning algorithms:
Entropy is the simplest secret detection method. This approach works under the assumption that secrets use randomized values in relation to actual code syntax, which uses more structure.
An example of a high entropy string would be “Gj12_34xAaQ2p01oV”. Entropy scanning on its own often results in false positives and is not sufficient for use in large projects.
gitLeaks is an open-source secret scanning solution that employs entropy scanning of JSON, SARIF, or CSV file formats, integrates into the CI/CD pipeline, and scans Git commit history.
A regular expression tries to identify specific patterns that may point to an exposed secret. For example, all YouTube DATA v3 API keys begin with the string “AIza” and use a fixed number of characters.
Regular expressions are especially useful for detecting API keys and tokens that use a fixed string structure. However, regular expressions are not well suited when scanning for other secrets (such as passwords).
Git-Secrets is an open-source solution that uses regular expressions to scan code for secrets and can integrate into the CI/CD pipeline to scan for accidental commits.
The rise of AI and machine learning completely was a game-changer.
Unlike entropy checks and regular expressions, which are essentially educated guesses, machine learning works by training an algorithm on a large, curated data set of previously discovered secret leaks.
The machine learning algorithm evolves through additional data sets and user-generated feedback. As the algorithm is trained, false-positive reports are reduced, and previously hidden secrets may be revealed.
Spectral is a powerful commercial solution that performs intelligent secret scans using AI and machine learning algorithms. Spectral easily integrates into the CI/CD pipeline and offers a clean and intuitive user interface.
When selecting a secret scanning solution, the first thing to consider is whether the solution meets your organization’s needs and specifications.
The last thing a developer wants is to be interrupted. Interrupting a developer’s workflow reduces the developer’s output and may even affect their morale if the interruption is perceived as a ‘waste of time’ (e.g., a false-positive report). A secret scanning solution should provide an intuitive UX (user experience) with minimal disruption.
CI/CD integration is a mandatory feature for any serious secret scanning solution. It ensures that secrets are scanned in real-time as developers commit their code to a repository. With CI/CD integration, developers are notified as soon as an accidental leak is detected, blocking the leak well before it has an opportunity to spread beyond control.
Most secret scanning tools are designed to scan for secrets in code. More advanced tools expand coverage by scanning Git commit history, Gists (shared code), Git server configuration, Git Wiki (shared knowledge), logs, and more. Make sure the solution you select offers comprehensive coverage that is suited to your organization.
One of the most important issues when scanning for secrets is accuracy. False positives may hurt developer performance and morale as developers waste their time handling non-existent cases. At the same time, false negatives mean that secrets are leaking, and you simply have no idea how or where.
A scanning solution must use machine learning combined with user feedback to significantly reduce the number of false positives while evolving to detect other secrets that may still be lurking in the wild.
Not all CI/CD integration solutions are built with developer experience and resource allocation in mind.
A secret scanning solution should offer a fast, smooth CI/CD integration that does not introduce artificial delays into the pipeline. Such delays often frustrate developers and slow down development as a whole.
Secrets may leak in areas beyond your control. Whether it’s Slack, Microsoft Teams, EMail, or public platforms such as GitHub, Gists, or Pastebin, monitoring and secret leak alerts should extend beyond your organization’s internal systems.
Any secret scanning solution you choose should detect and alert you to any accidental secret leak involving your staff and the external services they may use.
Your organization may use unique templates to store secrets. Such templates may remain undetected by default.
A secret scanning solution that allows you to train the algorithm by importing or creating custom detection rules can elegantly resolve detection issues and enhance secret discovery.
Your company’s source code is sacred. Exposing proprietary code may result in data theft, secret leakage, security breach, ransomware, regulatory concerns, and other nasty predicaments.
For these reasons, it is important to select a secret scanning solution that does not expose your intellectual property by remotely scanning the code on a server beyond your control.
Very few secret scanning solutions combine expansive coverage, developer-first approach, machine learning detection, smooth CI/CD integration, an intuitive user experience, secure code privacy, and enhanced monitoring.
Spectral’s proprietary technology is the only solution that delivers a fast, accurate, and developer-friendly option that checks all the boxes. Furthermore, Spectral is well-suited to businesses and enterprises working with a large codebase and corporate DevSecOps in mind.