All blog posts

Why generating SBOM based on your code is far from enough

This isn’t yet another blog giving the SBOM 101. There is an abundance of those. This is a deep dive into things we need to consider to generate the most accurate SBOM.

Authors

Rotem Bar, Head of Research @ Cider Security
Daniel Krivelevich, CTO @ Cider Security

The more our industry learns about SBOM (Software Bill of Materials), the deeper an understanding we have of both the crucial importance of possessing an accurate SBOM, as well as the complexities and intricacies associated with generating one. Covering all development contexts, languages, frameworks and package managers is far from trivial – especially when we are required to know the exact versions used by both direct dependencies and transitive dependencies. This blog takes a deep dive into the different sources and artifacts that exist across our entire CI/CD ecosystem – that allow us to generate the most accurate SBOM.

Contents:

  • SBOM – Intro
  • Motivations for generating the most accurate SBOM
  • Artifacts which are relevant for deriving SBOM
    • Source code
      • Configuration files
      • Lock files
    • Container images
    • Build logs
  • Recommended measures
  • How Cider helps organizations with SBOM

SBOM – Intro

3rd party dependencies are increasingly turning into a significant portion of application code being developed by organizations. This, together with the growing diversity in relation to development languages and frameworks, each with its own method for working with dependencies – is drawing more and more attention to the importance of SBOM – our Software Bill of Materials.

In short, SBOM is a list of components in a piece of software. It is a description of all the code dependencies used – directly and indirectly – by a dev organization or any subset of it. But this blog isn’t about what SBOM is, which different formats exist and what its benefits are. If you are interested in learning more about these topics – you can find some great resources here, here or here.

Motivations for generating the most accurate SBOM

  1. Legal – For many years, the main use case for having an SBOM was for legal purposes. Security organizations were often required by legal teams to have accurate visibility over all dependencies in use by the R&D organization to make sure that none of these dependencies are using licenses which are in violation of the organization’s legal policies.
    Here’s one example of some of the challenges related to this objective, from our own experiences developing code at Cider Security: We have about 90k dependencies in different areas of our code. After a request from our legal department to export all of our dependencies, we saw that one of our dependencies was problematic; The dependency itself was using an MIT license (which is great because MIT is a very permissive license), but one of its dependencies was using a copy-left license, which is obviously problematic.
    Understanding these possible complications and preemptively avoiding them may save hours of coding, rather than detecting after the fact that a problematic package was introduced into production.
  2. Risk assessment – Engineers use new dependencies on a daily basis. This means hundreds and thousands of new dependencies and sub dependencies constantly being added to our codebase. In parallel, new vulnerabilities for these dependencies are continuously being exposed and published. And to top it off – we are witnessing frequent takeovers over the accounts of maintainers of some of the most widely used dependencies (COA, RC, UA Parser) with the intent of embedding malware into these packages. This malware is aimed to be executed on the endpoints of unsuspecting developers using these packages – both as direct and indirect dependencies.
    In either one of these scenarios – a new known vulnerability or a compromise of a package – defenders need to have accurate capabilities to detect if – and exactly where within the dependency tree – the vulnerable/compromised package exists within the ecosystem.
    This is the only way to accurately assess the risk, determine the extent of compromise, and plan the right remediation activities.

So how do we make sure that we have the most accurate and comprehensive visibility over all of the dependencies used within our ecosystem?

Let’s generate some SBOM!

To start generating our full bill of materials we need to first map the artifacts that are relevant for generating the most accurate SBOM.

Artifact #1: Generating SBOM from source code

The first step in the SBOM creation process can be started at the code level, without the need to access any other resources. Though the code is made up of thousands of files that build up our software, each language has its own package manager to help developers easily manage all their dependencies.

The package manager basically gets instructions, telling it which dependencies to retrieve in order to correctly compile/execute the program. It then proceeds to download them from the different (public and private) package repositories. The package manager gets its instructions by parsing configuration files that contain instructions on which packages are needed to compile and run the software.

These files can come in a variety of formats and syntaxes, but all contain the same basic information: (a) which package and (b) what version (either a fixed version or a range of versions) to download.
In python, for instance, it is common for projects to have a “requirements.txt” file with the following sample syntax(using SemVer – Semantic Versioning):

appdata==2.1.1
click==8.0.*
requests>=2,<3

This file is what enables projects to be fetched and executed on any host/endpoint. Any new developer/environment using the project will download the source code of the project and then request the needed dependencies stated in the file above.
The instructions contained in these files can be an excellent starting point from which to generate our SBOM.

However, this source, while very useful and convenient, has some important drawbacks:

  1. In the python example above, the dependencies requested are not “locked.” This means that at the point of time when the package manager requests the dependencies, it will request the most up to date package according to the parameters defined. In this case, the version of the package “requests” that is downloaded can be any version between 2 and 3 (requests>=2,<3).
  2. This file doesn’t contain the full list of dependencies, since each dependency listed has its own set of dependencies (which of course have their own transitive dependencies) – that are also downloaded and used by the project.

    For example, the selected “requests” package version “2.27.1” has the following transitive dependencies that will also be downloaded even though they do not appear in the “requirements.txt” file.
charset_normalizer~=2.0.0
chardet>=3.0.2,<5
idna>=2.5,<3
idna>=2.5,<4
urllib3>=1.21.1,<1.27
certifi>=2017.4.17

In order to solve these problems, package managers started to create and use “lock” files. These files hold the full list of packages and the specific resolved versions of each package that were used to create the software. They are created when the developer responsible for the project installs new packages and are then used in the compilation process to download the correct package versions for any subsequent installation of the software.

For example, the Python command “pip freeze” will generate a “lock” file with all the dependencies and their specific resolved versions:

# Direct Dependencies
appdata==2.1.1
requests==2.27.1
click==8.0.4
# Transitive Dependencies
attrs==21.4.0
certifi==2021.10.8
charset-normalizer==2.0.12
coverage==6.3.2
coveralls==3.3.1
docopt==0.6.2
idna==3.3
iniconfig==1.1.1
loguru==0.4.1
packaging==21.3
pluggy==1.0.0
py==1.11.0
pyparsing==3.0.7
pytest==6.2.5
toml==0.10.2
urllib3==1.26.8

This lock file can be used when compiling and installing the software for production usage to ensure that these specific dependencies will be used. The intent of the lockfile is to replicate the exact collection of packages used by the package maintainer to make sure that installation of the software is successful. But one of the additional benefits of the lockfile is that It can also serve as a more reliable source for generating SBOM.

The first principle of generating an accurate SBOM is therefore to try, whenever possible, to ensure that you are locking your dependencies to specific versions.

Unfortunately, deciding to use lock files is just the beginning. Even when using lock files, the code does not necessarily provide an accurate depiction of the actual packages installed in the build process and deployed to production. Let’s look at some examples based on Javascript’s most popular dependency manager – npm (Node Package Manager).
Npm lists direct needed dependencies in a file called “package.json,” but in order to support reproducible builds and the locking of dependencies to specific versions, they also created a new format called package-lock.

First pitfall: Small errors can cause big problems

When installing and building our source code, sometimes, for various reasons, the package-lock file might not be transferred to the build server. In the example below, for instance, a typo is what leads do this situation:

Good example dockerfile
—
FROM node:latest

ADD package*.json ./
RUN npm ci

CMD [“npm start”]
Bad example dockerfile
—
FROM node:latest

ADD package.json ./ 
RUN npm i

CMD [“npm start”]

In this example, the “*” (adding all files named package* to the container including package-lock.json) was accidentally left out, meaning that the docker file does not copy the package-lock file, and because of this the dependencies in the container built will be different from the actual dependencies listed in the source code’s package-lock file.

I found and disclosed exactly such a problem in AWS’ lambda templates, which they then corrected as you see below:
https://github.com/aws/aws-sam-cli-app-templates/commit/a3e77b3d7b27e34c294a46b927cd7f71a30515d7

Second pitfall: The configurations don’t always include all dependencies

The package configuration files may not actually document all the dependencies used by our software. In the docker file below, for example, we can see that the developer installed other dependencies after the package.json was used. If we only relied only on the package configuration files, we would not see these added packages.

FROM node:latest

ADD package*.json ./
RUN npm ci

RUN npm install requests

CMD [“npm start”]

Third pitfall: Package manager confusion

With the multiplicity of package managers and versions continually arriving on the scene, it is difficult to be certain that your package manager is behaving as you expect it to.

One problem is that certain versions of various package managers may not work as they are intended to (see for example this issue from our previous article).

Another problem is that sometimes different package managers exist for managing the same resources – and their clients/the developers are not necessarily aware of that. In the scenario below, for instance, we are using npm to install source code that the developer generated with YARN (a new package manager for Javascript) which supports the same ecosystem as npm, but generates its own lock file: yarn.lock.

Because npm does not know about yarn.lock files, it will download dependencies from the package.json file only, and no lock file will be used to fetch the dependencies while installing.

Dockerfile
–
FROM node:latest

ADD main.js     
ADD package.json
ADD yarn.lock 

RUN npm i
# Correct command should be - yarn install
CMD [“npm start”]

You might ask yourself why the dockerfile would run the npm command (npm i) even though YARN was the framework used by the developers. But in reality this does happen quite often, as many times the dockerfiles themselves are managed by DevOps, which are not always aware of the intricacies of how developers build packages – which leads to inconsistencies and mistakes like the one described above. 

Pitfall #4: Inconsistencies between package configuration and lock files can cause problems

Ideally, when using lock files we want any client installing the package to download dependencies according to the lock file. But, If a developer changes the package configuration file but does not generate an updated lock file to reflect these changes, there will be a “drift” between the configuration and lock files. For these types of scenarios, we need to be aware of the difference between “npm i” and “npm ci” commands. When there is an inconsistency between the configuration and lock file, the “npm i” command will use configuration as the source of truth, whereas the “npm ci” command will install only the packages in the lock file.

Good example dockerfile
—
FROM node:latest

ADD package*.json ./

RUN npm install 
    --no-package-lock color

RUN npm ci

CMD [“npm start”]
Bad example dockerfile
—
FROM node:latest

ADD package.json ./ 

RUN npm install 
    --no-package-lock color

RUN npm i

CMD [“npm start”]

SBOM from source code – summary

Source code can be a valuable artifact for generating SBOM, but to rely only on source code, we need to meet the following conditions:

  1. A lock file must be submitted into the code repository
  2. Use of that lock file should be verified while building the software.
  3. No installations should be added while building the code outside of the configuration file
  4. The updated – and correct – package manager must be used when installing the code
  5. Use only “npm ci” or its equivalent in other frameworks to avoid drifts and inconsistencies.

Keep in mind, however, that every language has its own package managers, each with their own set of issues that must be accounted for when relying on source code for SBOM. Python alone has several package management methods such as pip (requirments.txt), setup.py, conda, with their own individual areas of concern. It is therefore prudent to never rely only on source code, but to rely on additional artifacts.

Artifact #2: Deriving SBOM from the operating system (containers)

The major advantage of this option is its accuracy. Unlike source code, which is susceptible to all sorts of possible oversights, scanning containers is the closest we can get to the actual packages that exist in the final installation and deployments of our services.

However, scanning containers does also come with several drawbacks:

  1. Changes to the container layers holding the packages are large
    Compared to scanning a configuration file which is about ~1-2k, the packages stored inside docker layers themselves can weigh hundreds of Mbs. Downloading and processing these packages continuously costs resources and time.
  2. Scanning containers is not an option with compiled languages (e.g., Go, C, ..)
    These languages do not store the packages themselves in the final distribution, which makes it hard or even impossible to determine the packages that were used when building the software.
  3. Time to action
    Scanning containers will usually happen in late stages of the CI/CD, in contrast to code scanning which can happen even on the developer’s machine. This means that scanning containers cannot provide immediate feedback on potential issues.
  4. Identifying the origin/source

When deriving SBOM for code – we identify exactly what code is responsible for installing each and every package. This means that when there is a known vulnerability/compromise in a specific package, we immediately know where we need to implement the fix in order to prevent risks. However, when using containers – we only know where we are vulnerable, but we still need to identify the relevant piece of code for being able to prevent risk. 

Even with these drawbacks, it is still crucial to scan containers to make up for the drawbacks we mentioned earlier around deriving SBOM from code,and also to make up for a scenario when we don’t have access to the code.

Artifact #3: Scanning build logs

Another source of data, which is crucial but often underused, is log scanning.
When our build systems actually build and download the packages, they usually print out logs and information on how they got the packages. This information can be used to understand what packages were downloaded and used, even in testing and in intermediate layers which don’t show up in the final squashed container images.

Logs, however, are free-text, and therefore very hard to parse and analyze. Each build system behaves differently and sometimes doesn’t include all required data. There is no standard for log production, or enforcement to ensure that the wording of logs will not suddenly change, which makes analyzing them a challenge.

Here, for instance, is a log produced by “bundle install,” and another created by Conda. As you can see, different package managers can phrase their logs very differently. They can also change their phrasing from version to version. The Conda example highlights another potential problem – the possibility that important information is truncated from the log. On lines 19 and 23, for instance, we can see that there should be additional text in the package name, but it has been cut off by the log’s table format. 

Unfortunately, while good tools for parsing source code and containers have already been developed, to the best of our knowledge there are currently no reliable tools for parsing logs.

To fully know and understand what packages and 3rd party dependencies we are really using in all stages of the service (development, testing and production) we need to combine all three methods and choose what method is best for which need.

How do we know what our needs are?

In repositories which contain a lock file we can limit ourselves to scanning the code – if we can make sure that there were no changes inside the CI/CD (like installing new packages) and make sure we are using the correct commands to install the packages – then we can safely rely on the code scanning to be accurate.

When this is not possible, we should also scan the containers and make sure there are no drifts between our configuration files for each deployment. We can also use a sampling method to save on scanning resources – scanning containers only after major changes, or at certain intervals.

In languages that are compiled we cannot rely on the container to have all the packages, in which case we can scan log files for these packages.

How Cider helps organizations with SBOM

At Cider, we’ve spent quite some time figuring out the intricacies and specifics of generating an accurate SBOM across all of the different frameworks, artifacts, package managers, and use cases. As mentioned, different solutions are relevant for different needs – but ultimately, organizations need to possess the ability to have an effective solution for each use case and need – that is specific for the unique technical characteristics of their development ecosystem. 

See how Cider Security can help you with your SBOM

Cider’s unique value stems from our connectivity to all CI/CD systems, all the way from code to deployment, and obtain all relevant artifacts and data sources for securing the engineering ecosystem – including SBOM. 

Cider Security has been acquired by Palo Alto Networks