June 10, 2020
Be rigorous to work faster!¿
In a world where data is omnipresent, always changing, and often messy. You need to have a sane method of work, for yourself but also with your colleagues to achieve your project or analysis. One part of the data analyst job is to make this work understandable, with a coherent environment and future proof. Fortunately, the use of well-maintained technology to manage dependencies can provide all of this.
While doing any development work, switching versions for dependencies is a major issue. Debian 10 (Buster) Linux distribution comes with Python version 2.7 by default. It does provide Python 3.7 though, which can be invoked by running “python3” command in a terminal, but this is not a default version. This implies all your installations and development work will be done on top of the default Python 2.7. However, Python 2.7 will not be maintained past 2020, its obsolescence has made Python 3 a prerequisite for most of the recent libraries. Also, different pieces of code demand for different versions of Python to be executed. Hence, it is a necessity to switch between versions of Python while working, without breaking any of your developments.
And now what? You have a perfectly operational data science project running with Python 3.5, your colleague uses Python 3.8. You think about modifying the default python of your running distribution. Well, if no one has told you before: just don’t! Best case scenario you will break some old system logic, worst case scenario you will break most of your system.
With constant and fast software changes, how to avoid the hassle of maintaining a reliable and stable data-science work environment?
What you need is a modern tool to make your life easier and allow you to focus on your work.
What is esDynamic?
At its core, esDynamic is a web application platform. You can access the core platform and add from a wide selection of tools and modules that match your needs. It was built to leverage Jupyterlab as a collaboration framework, with notebooks complementing complex attack scripts.
The platform has been specifically designed for your own talented experts to work day-to-day, using tools and modules to complete tests and analyses as well as profiting from built-in knowledge share, ease of use, collaboration tools and bespoke online learning modules. Every analysis is recorded, enabling a review of steps during a test, enabling multiple team members to follow and learn methods and techniques. Hard learned expertise stays on the system even if the people leave. esDynamic, the cybersecurity tool built by experts for experts.
Notebooks documents contain the inputs and outputs of an interactive Python session as well as narrative text that accompanies the code. Rich output generated by running code, including HTML, images, video, and plots, is embedded in the notebook, which makes it a complete and self-contained record of a computation. These can be converted to various formats and shared with collaborators using any means of file sharing. Notebooks are the main interface to use esDynamic functionalities. While notebooks can technically execute Julia, R and many other interpreters, the Python interpreter is the more widely used and is used to run esDynamic’s libraries.
We share back
The esDynamic platform has also been a nice opportunity for eShard to contribute and help the community to reach the state of the art side-channels methods and techniques by attending large conferences, giving talks and releasing the first side-channels open-source librairie: scared.
The modern package manager
Python applications can have conflicting dependencies and versions requirements. Consequently, a single Python installation may not meet the necessary conditions to run all software packages within the same system. The solution is to work within virtual environments: independent and self-contained directories providing the required dependencies and versions. Python 3 standard library now provides the virtual environment manager called venv. Up until Python 3.6, a script called pyvenv was also included as a wrapper around venv, but this has been deprecated. It has been completely removed in Python 3.8. You should now use venv directly or virtualenv, these tools work well as long as you don’t have to manage “builds”. Indeed you may want to use different builds to make different things. For example: in data science, Intel MKL built Numpy, Scipy and others are often used. Using Pip, these tools usually need to be re-compiled with custom flags or use some Pip wheels magics.
What is wrong with pip?
Conda and Pip are often considered as being nearly identical. Although some of their functionalities overlap, they were designed for different purposes, and they should be used for their intended purposes. Pip is the Python Packaging Authority’s recommended tool for installing packages from the Python Package Index, PyPI. Pip installs Python software packaged as wheels or source distributions. The latter may require that the system have compatible compilers, and possibly libraries, installed before invoking Pip to succeed.
|Manages||binaries||wheel or source|
|Can require compilers||no||yes|
|Create environment||yes, built-in||no, requires virtualenv or venv|
|Package sources||Anaconda repo,cloud, conda-forge||PyPI|
Conda is a cross-platform package and environment manager that installs and manages Conda packages from the Anaconda repository as well as from the Conda-Forge repository. Conda packages are binaries, meaning there is no compiler requirement at all for your system. Additionally, Conda packages are not limited to Python software. They may also contain C or C++ libraries, R packages or any other software.
Builds, environments, everything in the same tool!
This highlights a key difference between Conda and Pip. Pip installs Python packages whereas Conda installs packages that may contain software written in any language. For example, before using Pip, a Python interpreter must be installed via a system package manager or by downloading and running an installer. On the other hand, Conda can install Python packages as well as the Python interpreter directly, but Conda also installs packages compiled with different methods or flags. This is particularly useful when working with hybrid packages such as Google Tensorflow or Nvidia Cuda.
Another key difference between the two tools is that Conda has the ability to create isolated environments that can contain different versions of Python and/or the packages installed within. This can be extremely useful when working with data science frameworks as different tools may contain conflicting requirements that could prevent them from being installed into a single environment. Pip has no built-in support for environments but rather depends on other tools like virtualenv or venv to create isolated environments.
A little work sanity
The management of numerous libraries, Python or other, can be really hard to deal with. The maintainers need to add more and more functionalities while thinking about not breaking things for the end-user. The library’s users on the other hand, build a lot of great content and do a lot of research using these same tools. They don’t want to lose all their hard work over some update, and yet they want more and more features.
Without a bit of methodology, the code in notebooks tends to grow to the point of being incomprehensible. To overcome this problem, the proper way is to extract parts of it into Python modules once in a while. Since it only makes sense to extract functions and classes into Python modules, I often start cleaning up a messy notebook by thinking about the actual task a group of cells is accomplishing. This helps me to refactor those cells into a proper function which I can then migrate into a Python module.
Everything is module now
At the point where you create custom modules, things get trickier. By default, Python will only allow you to import modules that are installed in your environment or in your current working directory. Due to this behavior, many people start creating their custom modules in the directory holding their notebooks. Since esDynamic is nice enough to set the current working directory to the directory containing your notebook, everything seems fine at the beginning. But as the number of notebooks that share common functionality imported from modules grows, the single directory containing notebooks and modules will get messier as you go. The obvious split of notebooks and modules into different folders or even organizing your notebooks into different folders will not work with this approach since your imports will fail.
This observation brings us to one of the most important best practices: develop your code as a Python package. A Python package will allow you to structure your code nicely over several modules and even sub-packages. With Conda, the creation of a package is as simple as creating a description file called meta.yml. And the best part is that you can integrate it with esDynamic, distribute it and share it with your colleagues. “But creating a Python package is so much overhead; surely not worth! I will do this small little analysis in half a day anyway and then forget about it”, I hear you say. Well, how often is this actually true? Things always start small but then get bigger and messier if you don’t adhere to a certain structure right from the beginning. About half a year later, your boss will ask you about that specific analysis you did back then and if you could repeat it with the new data. But more importantly, coming back to well-structured codes, even old ones, it’s no overhead at all.
Develop inside esDynamic!
At that point, the natural question comes up, how to edit the code within your package. Of course, esDynamic will do the job since it includes a real Integrated Development Environment (IDE) for such tasks: Theia. Eclipse Theia is an extensible platform created with state-of-the-art web technologies offering a great alternative to VS code. This IDE makes a great addition to esDynamic, providing you with an all-in-one platform.
Python and its ecosystem evolve steadily and quickly thus, things that worked today might break tomorrow if a version of one of your dependencies changes. If you consider yourself a data scientist, you should guarantee the reproducibility of whatever you do since it’s the most fundamental pillar of any real science. Reproducibility means that given the same data and code, your future self and, of course, others, should be able to run your analysis or model receiving the same results. To achieve this technically, we need to record all dependencies and their versions. With Conda, it consists in creating an environment declaration file. This file recursively states all dependencies and their version as well as the Python version that was used, allowing anyone to deterministically reproduce this environment in the future. This is as easy as just running a Conda command.
Besides a concrete environment file that exhaustively lists all dependencies, it’s also common practice to define an environment file where you state your abstract dependencies. These abstract dependencies comprise only libraries that are directly imported with no specific version. This file keeps track of all libraries you are directly using. If you added a new library you can use this file to update your current environment. Ex.
dependencies: - python=3.6 - scared=0.8.11 - tensorflow=1.15.0=gpu_py36h5a509aa_0 - mamba - intel::mkl - intel::mkl_random - intel::mkl-service - intel::numpy - intel::scipy
Creating an environment file makes it easy to work in a collaborative way with esDynamic. This allows you to share, improve and work seamlessly with your colleagues or managers. Environment files can also be used as an archive system, you can store these declarative files in order to reproduce a given setup from a given time. Recreating earlier environments has proven to be quite useful to reiterate some old analysis without the need to adapt to newer versions of the software.
Pretty neat ugh?
The esDynamic platform is a powerful tool to work in a collaborative way with your colleagues and managers. With the inclusion of the Conda package manager, you are able to replicate your work at any given time. At eShard we believe that your old work must not be lost nor should it be adapted to each new software release. Your work, your notebooks, your research will run as smoothly as the day you wrote them, even if it was years ago.
Notebooks are the new way to display your analysis report. We use them every day, they are a flexible, multi-purpose tool that can support digital research in many contexts. When paired with Conda, a new and efficient way to manage dependencies, portability, and reproducibility, notebooks are the best way to achieve side-channels analysis.
esDynamic takes advantage of all of these tools to provide you with an expert-oriented tool that will take your side channels skills beyond.