Package management: Using repositories in production systems
Data science is characterized among other things using open source tools. An advantage when working with open source languages such as R or Python is the large package world. This provides tools for numerous use cases and problems through the development within huge communities. The packages are organized in digital online archives – so-called repositories. Data scientists can use these repositories to access current or past package versions and use them for their work. An important aspect here is the continuous development of many packages. New package versions include new, improved or extended functionalities as well as bug fixes. In some cases, however, a new package version also contains different behavior with the same code or new dependencies on other packages, the programming language itself or other system components, such as the underlying operating system. These changes require additional customizations to maintain the functionality of the code already developed. For example, the code must be adapted to the new behavior of the packages, or additional packages must be installed to do justice to the dependencies. Production systems not only have to guarantee almost constant functionality, but also have many developers who are working on them. Therefore, it is important that updates to the package landscape are carried out quickly and smoothly.
Package management and collaborative work
Ideally, all developers work in identical environments, i.e. with the same packages and package versions. However, different scripts and analyses can result because developers might work with different package versions where changes to the functionality exist.
These do not work uniformly for all developers, causing either errors or different results. In addition to the danger that scripts may behave differently in various development environments, there is also the danger that developers‘ package versions may differ from those of the production system, where the analyses are used profitably and must therefore function almost continuously. In order to avoid conflicts between different package versions, a good infrastructure is used to ensure smooth package management that guarantees equal development conditions and controlled and synchronous updates.
A first measure to create the basis for good package management is the provision of packages in a local, company or team-wide repository. For developers, the local repository functions like an online repository, with only selected packages and package versions available in the local repository. This gives all data scientists access to the same central repository of packages, while ensuring that package versions in the repository are largely stable and all dependencies are met. This guarantees that the developed algorithms and codes behave the same throughout the company in the various development environments and in the production system. However, the coexistence of different versions of the same package cannot always be guaranteed, since there is again the danger of different developers developing on different package versions, as in the case of an online repository. The RStudio Package Manager is suitable for this. The RStudio Package Manager acts as a bridge to integrate different package sources, such as online repository, local repository and external development repository (GitLab). Companies with restrictive corporate governance principles only want to have an approved subset of the packages in their local repository.
Package management in practice
To prevent this problem, the local repository can be extended with different package versions and restricted to a certain version within different projects. For this purpose, a project environment is defined for each project, which contains a certain part of the packages of the local repository and is limited to fixed package versions. This has the advantage that you can work with different packages or package versions in different projects, while providing a stable and conflict-free package world throughout a project. For the data scientists this means either developing on a central development system (e.g. RStudio server) or working on their local system with the packages defined for the project (e.g. as R Project or conda environment, optionally within a Docker container). In addition, a production system is operated that includes a package landscape that is identical to the development environment. In this case, the local repository provides an additional level of security to ensure that only packages are used that have proven to be stable over a certain period and that they already contain initial bug fixes.
If it is time to update the packages, this should be done almost simultaneously on the development and production environments in order to limit the different behavior of the environments to as short a time as possible. It is especially important that the production system runs stable without interruptions. It is therefore advisable to set up a test system on which updates are carried out beforehand to check for missing package dependencies or conflicts between certain package versions. If the test system has reached a stable state, the development environments can be updated in order to adapt the algorithms and analyses to the new package versions if necessary. An update of the package world on the production system can then take place at the same time as the analysis adjustments already tested on the development environments to keep the risk of errors on the production system as low as possible. A reliable infrastructure is essential in order to carry out such updates quickly and smoothly on a regular basis. The structure of such an infrastructure depends on many factors, such as the number of projects, the size of the development teams, or the length of the update cycles.
A good package management in productive systems and a fully functional infrastructure are the basis for a complication-free development environment. We are happy to support and advise you in the planning and implementation of an IT infrastructure in your company. Learn more about aicon | analytic infrastructure consulting!
Florian Schmoll - Posted on 21.11.2019
Florian Schmoll hat Mathematik an der Universität Kassel studiert und arbeitet seit 2017 als Data Scientist bei eoda. Seine Hauptaufgaben beinhalten unter anderem die Entwicklung von R-Paketen und die Analyse von Daten im Industriekontext. Die Arbeit als Data Scientist ermöglicht es ihm, sein im Studium erworbenes theoretisches Wissen für die Lösung von Problemen aus der Unternehmenspraxis einzusetzen.