Package repositories were never something I thought about as a developer unless something didn’t work. For example, if it was slow, wouldn’t connect, wouldn’t install, or was overly complicated to configure. Mostly I wanted something I barely noticed. Something simple and easy to use.
When I started my career, I was developing Windows apps in the late 2000s. My packages were the exes and dlls produced by my lovely big Visual Studio IDE. Our dependencies were either official windows libraries or proprietary 3rd party licensed libraries. The exes and dlls were stored in folders on an FTP server and could be accessed using Microsoft access controls- so I suppose the FTP server was my package manager. The software ran on on-premise hardware, and the users were an in-house team that I knew well. I owned the application and the entire software workflow.
That software development workflow is very different from a more modern workflow where:
- I work on a small part of a bigger project which uses multiple frameworks, languages, and containers.
- My code is dependent on open source software (OSS) hosted externally.
- Multiple developers work on the same codebase simultaneously and make frequent commits to the code repository.
- Code could be built, scanned, and deployed many times a day via a CI/CD pipeline
- The output of the build (the package) is deployed to the cloud
- Distribution of the software package to customers is tracked and controlled
As a developer, I want my package repository:
- To store my packages somewhere securely
- To have one place to view and control all my software packages in all formats- a single source of truth
- To push/pull packages super fast for all regions my teams are based
- To handle distributing my software to customers.
- To be able to sign my packages and have that managed for me
- To integrate easily with my other build and security tools in an automated way.
- Something simple and easy to use with extensive documents
- Good support when something goes wrong
This was meant to be one blog- but it turns out package management is a dark horse and needed two blogs!
This blog will give a few explanations that we in the package management biz throw around, and our next blog on the topic will go into the things that have influenced package management.
A software package, artifact, or image is the output of building software- it groups together files containing your software along with the metadata about the software and dependencies in a well-defined format. Packages are typically versioned to provide a better and more manageable understanding of what software is being deployed.
Packages promote the reuse of code as it can be dropped into another application and used easily. Packages are created using a package manager and are usually stored in a repository, like Cloudsmith. The table below details some common packages.
Package metadata describes the package with information about the author, repository location, repository version, file type, license, package dependencies, and more. Metadata can also have information about the CI/CD build like who triggered the build, the build time, approval information, vulnerability information, or user-created metadata.
A package manager is software that creates, uploads, installs, upgrades, and configures software packages for a language, container, or OS. Every package type gets its own unique package manager: Debian’s RPMs use apt-get, Node’s packages use NPM, Python’s packages use Pip, etc.
Some packages have more than one Package Manager to choose from. For example, .NET’s NuGet packages can use Chocolately or the native NuGet package manager. Similarly, Java’s Maven packages can use the native Maven package manager, Gradle or Ivy. Some package managers are more stable, easier to use, have faster build times, or have access to different packages available in their repositories.
A package repository, registry, or feed is a place to store all of your packages.
Package repositories work closely with package managers, and the terms can get mangled when talking about software tools, like Cloudsmith or JFrog’s Artifactory, that support most software packages. The terms become mingled because these tools have the functionality of a package manager to upload, download and configure packages, and they also host all the packages on repositories.
It is quite a task to provide support for many formats as every language/OS has its unique package manager.
Dependencies and Dependency resolution
If a package uses another package, that package is called a dependency. Almost every project uses third-party packages as libraries and/or frameworks.
Resolving dependencies in a package is no joke- specifying and resolving the dependencies and relationships between libraries and packages is one of those NP-hard problems. Version constraints on packages mean package managers have to solve a problem equivalent to SAT solving.
On top of dependency resolution being a complex problem to solve, different packages have different ways to resolve conflicts or missing dependencies and some package types have deep dependency trees (I’m looking at you NPM). All in all, dependency solving is a toughie.
Multi-format repositories allow you to store packages of different types in one repository. Many package managers don’t let you store different packages in the same repository. Multi-format repositories are especially useful if your tech stack uses multiple languages and containers.
Multiformat Repositories means fewer repositories to manage- I think that is a good thing.
Public vs. Private Package Repositories
Many languages and containers provide a Public repository to host your packages. NPM, for example, provides npm public registry, and Python provides the PyPI repository.
Publicly available packages have made it so much easier to use Open Source Software (OSS) and have changed how software is built and deployed forever.
The benefits of OSS for organizations are numerous. Still, community-controlled public repositories cannot guarantee availability, bring an increased risk of introducing security vulnerabilities, generally only host one package format, and cannot control who downloads your package.
Many organizations need a private repository for their packages for security, compliance, availability, or reliability. On top of that, private repositories provide additional features required by enterprises, such as:
- Single Sign-On
- Custom Domains
- Access Controls for Teams and Entitlement tokens
- Multiformat packaging
- Software Distribution
- Logging and metric data
- Integration with CI/CD and Security tools
- Tech Support
- Service Level Agreements guaranteeing uptimes
Package Upstreams allow users to consume packages hosted elsewhere from public repositories like Maven Central, PyPI, NuGet.org, npmjs.com, or Debian’s package registry. When a repository has an upstream configured, the service regularly checks for new upstream packages and stores them in the private repository
The rules around the order and precedence of what repositories to search and what packages to select will determine what packages are used. Generally, packages in the repository itself supersede packages from the upstream.
Upstreams allow dependencies to be isolated from untrusted 3rd party sources to protect you from outages and slowness of external services.
The whole point of signing a package is to be able to trust that a package is safe to download or use as a dependency.
Many software organizations use their own GPG/RSA key for signing their metadata and packages which are usually managed by the private repository. Signing a package with your organization’s key lets Developers know that this package was written and approved by your organization.
Lately, the software community is coming to grips with how OSS software can be the source of entry in supply chain attacks. It’s a lot harder to trust code that was not written and signed by your organization. Signing OSS packages can help but even if an OSS package is signed it is not clear if you should trust it.
In the absence of using a trusted signed OSS package, package repositories can scan OSS packages for known vulnerabilities and extract metadata information like version, who wrote the code, results from scans, or license information which can provide insight into the provenance of the software package.
Recently, CNCF’s Sigstore Project has been working to improve trust in OSS software by improving transparency and simplifying the signing process by providing a service for package signing similar to Let’s Encrypt’s service to enable HTTPS.
Private package repositories need to be able to manage signing packages and maintain keys, work with a package’s native signing tools, and work towards integrating with new tools to sign and trust OSS packages.
SBOM and SPDX
The Software Bill of Materials (SBOM) is essentially a list of all components, including licenses and dependencies contained in a software product. Most software includes dependencies sourced from the open-source community or commercial software. New regulations have come in to publish the SBOM of software used by US government departments.
The end-user of software can use the SBOM to perform vulnerability and license analysis of their software packages which can evaluate risk in a software product.
The Software Package Data Exchange (SPDX) is an open standard for communicating the SBOM. SPDX has become internationally recognized as the standard for communicating the SBOM. SPDX is integral to generating an SBOM that can be easily shared and automated.
Package License Compliance
A software license agreement is a legal document chosen by a software company or developer on how a user can use the software and should be included in the package. There are many software licenses with different legal terms, support agreements, limitations, and costs.
Most software licences fall into two groups either:
- Free and Open Source Software (FOSS), e.g., GNU General Public Licence (GPL), Apache, BSD, and MIT
- Proprietary software License, e.g., EULA.
A package manager should match the license defined within a package's metadata as accurately as possible. For example, the BSD license specified within this package's metadata is checked against a valid SPDX license. Adding this license reporting functionality gives Developers more visibility, control, and management across all aspects of your package management.
Software vendors can use private package repositories to distribute their software. Private repositories that can distribute software packages eliminate the need to rehost the package elsewhere for customers and all the management associated with that.
Software vendors distributing software may need their private repository to:
- Provide a reliable and fast software package distribution.
- Control who downloads your package. Cloudsmith does this with entitlement tokens
- Gather metrics on downloads
- Manage license agreements
- Create custom domains
- Provide Service Level Agreements (SLAs) to guarantee service levels.
Package Delivery Network
A content delivery network (CDN) refers to a geographically distributed group of servers that work together to provide fast delivery of Internet content.
At Cloudsmith, we developed what we call the Package Delivery Network, or PDN. It’s like a highly customized CDN that knows that it deals with packages, package authentication, and client package management tooling. It helps deliver packages faster to distributed users.
Package management was a lot simpler at the start of my tech career when it was just me, an exe, and an FTP server!
Package management has always been complex, even when dealing with a more straightforward landscape of a small number of dependent packages dealing with only one package format. Modern package repositories need to host many formats, deal with complicated dependencies from many feeds while dealing with the problems of scaling, distribution, and security.
Check out part 2 of this blog, where we delve into trends in the software landscape that have changed what developers and organizations want from a package repository.