Whoever starts a job as a developer in 2019, be it in software development or in data science, data ops, etc., is usually confronted with a tool for version management relatively early. Programs such as Git, SVN and BitKeeper are primarily used to transparently wind forwards and rewind the development history or to develop new features on separate development characters. In addition, version control for linear script development and code management is a key feature. Even if the vocabulary needed to get started seems comparatively small, one should not be deceived by it. For people who work with these tools for the first time, getting started requires an unfamiliar way of thinking when using them.

In order to work productively with the tools, a few vocabularies are enough to ensure a transparent and conflict-free workflow. Furthermore, all tools offer a greater variety of functions, in addition to the usual commands, which are helpful for specific use cases, but also increase the complexity of the workflow.

In this article we give a short introduction to the tasks of a version management software and one of these tools will be examined in more detail.

Logging & archiving – The everyday work of a version management system

The basic operation of the version control tools is quite simple. In the entry point, a (project) folder, changes to the files are tracked. In general, not all files are monitored for changes, except from files which are marked or indexed by the user. In addition, files can be explicitly marked on development branches and thus developed in isolation. This approach avoids conflicts, since other users can only see the finished feature and concentrate on developing their features without taking potential changes into account. Thus, different development stages of the project are accessible any time and can be restored if necessary.

The workflow can be imagined as a tree. This is usually identical across projects:

  • All files including the version tree are stored in a repository. The developer clones the current version (the symbolic reference) into his local working directory.
  • There is a main branch which always contains an executable and most current version of the project (Trunk – SVN / Master – Git). Besides master, there is also a develop branch called “nightly build”. It is an additional branch, which is (always) operational and contains the current production version. Moreover, it contains current features and bugfixes.
  • If changes are made, they are first implemented on a separate branch. If the developer is satisfied with his changes, he commits his changes, i.e. he maintains his changes in version control.
  • In order to get access to the new local changes for developers, the previously local branch is published in the remote repository by a push. In this way, developers can synchronize their development states.
  • If the change is tested and executable, the developer branch can be merged with the main branch if required.

Decentralized version management with Git

In contrast to central version management, where the version tree only exists in one central repository, each developer in decentralized (distributed) version management has his own local repository. You can track changes locally in your own repository and compare them with the repositories of other developers. Conflicts between two or more developers when working on the same files must only be resolved if the different versions are to be merged into one. In the following, a possible workflow with Git shall be introduced. Git is one of the most popular applications for decentralized version management. It is to be noted that we refer to a variant of the distributed version control, in which an official repository exists, which is cloned at the beginning of the project and on which local changes are brought together. In theory, this is not necessary, but makes sense in most projects.

Step 1: Create and synchronize a remote repository

First, an official repository is created that can be accessed by any developer. In this repository there is a master branch. It only exists for providing stable versions, which are merged from the local developer versions. None of the developers should make direct changes to the master branch.

Step 2: Create and synchronize local branches

Each developer creates local branches on which for features or the like are developed. The local branches are synchronized with the remote repository via an upstream.

Step 3: Stage/commit/push changes

After local changes have been made to documents/files/folders, they must first be staged. This will mark the changes for the next commit. Once the changes have been committed to the local branch, the branch is pushed to synchronize with the remote repository.

Step 4: Merge the developer branches

If all necessary features for an updated version are in-place, the developer branches can be merged on the master branch. This resolves potential conflicts between different branches.

Conclusion

Version management is a central tool for project management, not only in the developer industry. Document versioning is used in almost every area, although not always with tools such as Git or SVN (e.g. when working together on a Word document).

For example, Numerous additional applications offer a GUI for version management (e.g. GitLab for Git) or at the same time already provide a complete CI pipeline for the repository (e.g. GitLab CI Runner). Due to them, the work on the common project will probably become easier and more accessible in the future.

Within the framework of our eoda | analytic infrastructure consulting, we support you in setting up a productive data science environment regarding optimal version management and many other important aspects your company.