www.GuidoDiepen.nl

Include help/instructions in cookiecutter prompts

Guido Diepen — Fri, 14 May 2021 12:17:32 GMT

Although I really like cookiecutter since it enables you to really start a lot of different projects in exactly the same way, one of the things that I think could be improved is an easy way to provide some additional help/information when the questions are being asked. I saw that there already exists an issue for this on the cookiecutter repo, but this is already quite old. Not too much progress has been made on this area, so not sure when such a new feature would become available, unless you start using forked projects.

The reason I want to do this is that sometimes it can be useful for new users of a cookiecutter template to get access to inline documentation instead of having to open some other page.

One thing I have always liked with computers/programs since I was little, is to see how I can use what is provided in such a way that I can do something that it originally was not supposed to do. This process requires a lot of puzzling and trying to be very creative with the inputs you can provide.

After a bit of puzzling and playing around, I was able to at least provide some help text in a cookiecutter.json that could be shown to a user, and this post shows how I got this to work.

Base scenario cookiecutter.json

Let's start with a very basic cookiecutter project, where we have a README.md file in a project directory that will be templated using the information from cookiecutter.

In the folder {{ cookiecutter.project_slug }} we will have a README.md with the following contents:

# {{ cookiecutter.project_name }}

This is a small project create with the cookiecutter demo template

Relevant settings:
* Project name: {{ cookiecutter.project_name }}
* Author: {{ cookiecutter.full_name }} ({{ cookiecutter.email }})

Debug (complete cookiecutter context):
{% for key, value in cookiecutter.items() -%}
* "{{ key }}": {{ value }}
{% endfor %}

Our cookiecutter.json will have the following contents:

{
  "full_name": "Your name",
  "email": "your email address",
  "project_name": "Your project name",
  "project_slug": "{{ cookiecutter.project_name.lower().replace(' ', '_').replace('-', '_') }}"
}

If you run this json file with cookiecutter, you will get the following list of questions:

(testing_cookiecutter)  guido@cartman:~/Programming/cookiecutter-demo$ cookiecutter -o . cookiecutter-demo/
full_name [Your name]: Guido Diepen
email [your email address]: a@b.com
project_name [Your project name]: My Awesome Project
project_slug [my_awesome_project]:

and after answering them as in the above example, cookiecutter will automatically generate a new folder my_awesome_project (since that is the default project_slug) with a README.md with the following contents:

(testing_cookiecutter)  guido@cartman:~/Programming/cookiecutter-demo$ cat my_awesome_project/README.md 
# My Awesome Project

This is a small project create with the cookiecutter demo template

Relevant settings:
* Project name: My Awesome Project
* Author: Guido Diepen (a@b.com)

Debug (complete cookiecutter context):
* "full_name": Guido Diepen
* "email": a@b.com
* "project_name": My Awesome Project
* "project_slug": my_awesome_project
* "_template": cookiecutter-demo/

This is the way the usage of cookiecutter was intended and should not be a big surprise.

Adding help text

Now let's see how we can add some help. After some playing around, I found that I could just add new-line characters in the default values for variables and they would be printed. Since I am not really interested in the actual value later on, I am for now using one space as the key.

For example, if you change the cookiecutter.json to the following:

{
  " ": "CookieCutter with inline help demo\n\nThis is some\nmultiline\ntext explaining the options",
  "full_name": "Your name",
  "email": "your email address",
  "project_name": "Your project name",
  "project_slug": "{{ cookiecutter.project_name.lower().replace(' ', '_').replace('-', '_') }}"
}

we would end up with the following set of questions:

(testing_cookiecutter)  guido@cartman:~/Programming/cookiecutter-demo$ cookiecutter -o . cookiecutter-demo/
  [CookieCutter with inline help demo

This is some
multiline
text explaining the options]: 
full_name [Your name]: Guido Diepen
email [your email address]: a@b.com
project_name [Your project name]: My Awesome Project
project_slug [my_awesome_project]:

As you can see, this does give us a nice multi-line place where we can put any text. If we are to show some text and make it more informative to an end-user, we can play around a bit with the exact 'default value' of the key.

From the above example it is clear that the way cookiecutter prints the key-value combination is as follows:

key [value]:

Since our key is one (or more spaces, if later on you decide to have multiple explaining screens), cookiecutter will print the spaces followed by a [ indicating the current value. In order to clean this up a tiny bit, I am setting the value for the one-space key in the cookiecutter.json as follows:

  " ": "]\nCookieCutter with inline help demo\n\nThis is some\nmultiline\ntext explaining the options\n[Please press enter to continue"

This will print a closing ] right after the opening [ displayed by cookiecutter (have not found any way that I can disable this...), followed by a new-line, followed by the multi-line text. Finally, I show an additional prompt to the end-users, telling him/her to press the enter key to continue (and with that storing the multi-line default value in the cookiecutter context under the single-space key).

Note that cookiecutter will add a closing ] to the default value automatically, therefore I prefix the final prompt line with an opening [.

On the screen, this will look as follows:

(testing_cookiecutter)  guido@cartman:~/Programming/cookiecutter-demo$ cookiecutter -o . cookiecutter-demo/
  []
CookieCutter with inline help demo

This is some
multiline
text
[Please press enter to continue]: 
full_name [Your name]: Guido Diepen
email [your email address]: a@b.com
project_name [Your project name]: My Awesome Project
project_slug [my_awesome_project]:

This way unfortunately, it is not yet possible to print messages before every question without having the user to explicitly press enter to move to the next question after the 'help' question.

Therefore, my current approach when using this is to divide all of my questions into a couple of blocks and have one multi-line explanation key above each block.

Cleaning up help text

The only thing you will have to very careful with is that the cookiecutter context will now have multi-line values. In case you are performing a loop over all of the key-value combinations in the cookiecutter context and using them without triple-quotes in python for example, you could end up with incorrect python code because these strings would be printed over multiple lines.

In order to solve this, I just add one more last item in the cookiecutter.json that updates the values all of my special space keys from the default multiline value to just empty string.

This can be achieved by adding the following key-value combination to your cookiecutter.json as the last line:

  "  ": "{% set update_result = cookiecutter.update({' ': ''}) %}Finished with all questions - Please press enter to generate your project"

Using this additional key-value, we will now get the following flow through our cookiecutter:

(testing_cookiecutter)  guido@cartman:~/Programming/cookiecutter-demo$ cookiecutter -o . cookiecutter-demo/
  []
CookieCutter with inline help demo

This is some
multiline
text explaining the options
[Please press enter to continue]:    
full_name [Your name]: Guido Diepen
email [your email address]: a@b.com
project_name [Your project name]: My Awesome Project
project_slug [my_awesome_project]: 
   [Finished with all questions - Please press enter to generate your project]:

After the user presses enter, cookiecutter will generate the new project again.

Taking it to the limit

Now that we are able to show some text, why stop just here with only boring text? Now that we know we can print multiple things, we can also do some ASCII art here!

With the Text to ASCII Art Generator TAAG website, you can create all kinds of cool ASCII art versions for text. Unfortunately, we cannot directly copy/paste that into our cookiecutter.json file, because using the special ASCII values are not valid for the UTF-8 encoded JSON.

In order to overcome this problem, I copy/paste the output of the TAAG website into a UTF-8 encoded text file (e.g., header.txt) and use the following small python script to update the value of the single-space key in the cookiecutter.json file:

import json

with open("cookiecutter.json") as f:
    cc_template = json.load(f)

with open("header.txt", encoding="utf-8") as f:
    cc_template[" "] = f.read()

with open("cookiecutter.json", "w") as f:
    f.write(json.dumps(cc_template, sort_keys=False, indent=4))
    f.write("\n")

This will just open the cookiecutter.json file and read in the dictionary. After that, it opens the header.txt file with UTF-8 encoding, reads the contents and updates the value of the single-space key in the dictionary. Finally, we write the dictionary to the cookiecutter.json file again, ensuring that the keys are NOT getting sorted.

If you open your cookiecutter.json file after the above, you will see a lot of escaped unicode characters, which make it quite difficult/impossible to see how the result will look like.

However, if you initiate a new run with cookiecutter, this will now be the way it looks:

I don't think this was the way the creators of cookiecutter intended to have it being used, but at least I am able to provide the end-user of my cookiecutter template with some inline help/information without requiring any other tools than vanilla cookiecutter! Also tried this on multiple platforms and it works under all platforms I have tested: Windows / Mac / Linux.

I have not tried anything else, but I am assuming it will be possible to do all kinds of cool tricks with ANSI escape sequences also, like colors or other things. With those things, just not 100% sure if that will work cross platform also, as it will depend on support for these sequences of your console.

Curious to hear if you have any other examples of cool ways to trick an application into doing things outside its normal intended use!

Problems with anaconda and pip

Guido Diepen — Tue, 13 Apr 2021 10:10:57 GMT

For the last couple of years, I have always turned to anaconda/miniconda for creating virtual environments with python, both under windows and linux. Although it is possible to work with just pip and things like virtualenv, especially in the beginning I often had issues trying to install certain packages under windows that required compilation steps. In the meantime, that might have improved, but just stuck with using conda for the environments.

So while using conda as the main environment/package manager, it happens sometimes that some package I need is not available using conda, but only via pip. Fortunately, using pip within a conda environment this should not be a problem, it would use the pip installed within the environment and install the pip package into the current anaconda environment.

They keyword in that assumption is should.....

On my windows machine, over time I saw that my base conda environment got polluted with all kinds of pip packages, and packages that were installed with pip were sometimes not available in my active environment, even though I just had installed them WHILE the prompt indicated I was working in this project environment.

As it turns out, the conda environment in the windows command prompt appears to be a bit of an illusion:

When working, it happens every now and then that I type a command that I did not want to type, I might have forgotten to add some arguments. For example, I originally typed:

conda search dvc

while I actually wanted to search on conda-forge, so I quickly press Ctrl-C, modify my command to

conda search -c conda-forge dvc

and get to see what is the latest version of DVC on conda-forge.

Unfortunately, it is the pressing of Ctrl-C to abort the running conda search command that seems to mess up the internal state of conda, at least on windows machines

I double-checked, and indeed on my windows system (not on my linux system) I can reproduce it as follows:

# Display the location of pip
(my_awesome_project) C:\Users\guido> python -m pip

Usage:
  C:\Users\guido\Miniconda3\envs\my_awesome_project\python.exe -m pip  [options]

Commands:
  install                     Install packages.
  download                    Download packages.
...

As you can see, the output of pip shows that it is running using the python.exe from the selected environment folder.

If I now run a conda search command in this terminal that I terminate with a Ctrl-c as follows:

conda search dvc
# Press Ctrl-C while the conda command is running
(my_awesome_project) C:\Users\guido> conda search dvc
Loading channels: /
CondaError: KeyboardInterrupt

After the prompt returns, I still see the (my_awesome_project) in the prompt, letting me believe I am still in the same active prompt.

However, if I know run the exact same python -m pip command, I see a different output:

(my_awesome_project) C:\Users\guido> python -m pip

Usage:
  C:\Users\guido\Miniconda3\python.exe -m pip  [options]

Commands:
  install                     Install packages.
  download                    Download packages.
...

As you can see, right now the conda environment used by pip is NOT the project enevironment, but it is the conda base environment. Any package you now install with pip will not be installed in the project environment, but in your base environment....

I tried the exact same steps on my linux computer and there it appears I cannot reproduce this exact same problem.

Workaround

A very simple workaround for this problem: do not use Ctrl-c on any running conda process....

I am not sure if this is a problem for conda in general, or whether this is a specific conda on windows problem.

'Automating' DVC stage generation

Guido Diepen — Fri, 26 Mar 2021 08:26:35 GMT

So recently, I have started to work with DVC to create reproducible pipelines. In my previous post about DVC I wrote about dealing with multiple different types of model in one repository.

In this article, would like to look at one other thing I ran into while I started working with DVC and that is the fact that at least when setting up the different stages, it requires quite a bit of

typing....

Especially if you are creating stages that depend the outputs of one or more previous stages, has one or more metrics, multiple parameters, and finally also one or more output files. In this case, you will have to either create a very long dvc run -n STAGE_NAME command or provide the details in the dvc.yaml file.

As you will be able to see in my DVC test project on GitHub, I am using separate python scripts in the dvc_scripts folder. Any of the stages that I add to my dvc.yaml file will be a separate file in this folder. This allows me to easily split the dvc related code that execute the stages from the actual implementation (which can be found in the src folder). The actual implementations can then also be used in Jupyter notebooks when I am playing around with different things.

In order to add a new stage train_test_split that depends on a raw input file to generate two files train.pkl and test.pkl based on parameters indicating the random state and the size of the test set I need to run the following command:

dvc run -n train_test_split \
  -d ./dvc_scripts/train_test_split.py \
  -d ./src/train_test_split/train_test_split.py  \
  -d ./data/raw/raw_data.pkl  \
  -p train_test_split.random_state,train_test_split.test_size \
  -o ./data/train_test_split/train.pkl \
  -o ./data/train_test_split/test.pkl \
  python dvc_scripts/train_test_split.py  \
    --raw-input-file ./data/raw/raw_data.pkl  \
    --train-output-file ./data/train_test_split/train.pkl \
    --test-output-file ./data/train_test_split/test.pkl

As you can see from the above, the command that I want to use for this stage is the train_test_split.py script from the dvc_scripts folder. Therefore, I should add this script to the list of dependencies for the stage (i.e. if I make any change to this file, it should invalidate the current stage). This CLI script itself will call the main file src/train_test_split/train_test_split.py holding the actual implementation for the train test split stage. So this file for sure also needs to be part of dependencies.

Besides that, there is the parameters for this stage, as well as the outputs that it generates

As mentioned before, this requires quite a bit of typing AND I should not forget to include any of the python scripts that the stage depends on, in which you can easily make a mistake. In order to solve this problem, I go back to what I like doing most: AUTOMATE things :)

Automating things...

So the python script for each stage in the dvc_scripts folder will know everything about the input, output, and metrics files, because I provide them with commandline arguments. For example, if I run the dvc_scripts/train_test_split.py with the --help argument, you will see the following:

$ python dvc_scripts/train_test_split.py  --help 

usage: train_test_split.py [-h] --raw-input-file file --train-output-file file
                           --test-output-file file

optional arguments:
  -h, --help            show this help message and exit
  --raw-input-file file
                        Name of the train output file
  --train-output-file file
                        Name of the train output file
  --test-output-file file
                        Name of the test output file
  --show-dvc-add-command
                        After the script is done, it will show the command you
                        will need to use for adding stage

As you can see, I have introduced commandline arguments for each of the items that are relevant:

The file holding the raw input
Where to write the training subset to
Where to write the test subset to

Generating the dvc run -n STAGE command

You also see that I have one more argument, namely the --show-dvc-add-command. This is the automation part: By adding this argument, the python script will use all of the information provided in the commandline arguments to build up the string that you have to add as argument to the dvc run -n STAGE command.

If we execute the dvc_scripts/train_test_split.py as follows:

$ python dvc_scripts/train_test_split.py  \
    --raw-input-file ./data/raw/raw_data.pkl  \
    --train-output-file ./data/train_test_split/train.pkl \
    --test-output-file ./data/train_test_split/test.pkl \
    --show-dvc-add-command

we will get the following output:

Size train set: 712
Size test set: 179


Please copy paste the items below AFTER command to create stage:
    dvc run -n STAGE_NAME \

  -d ./data/raw/raw_data.pkl \
  -d dvc_scripts/train_test_split.py \
  -d src/train_test_split/train_test_split.py \
  \
  -o ./data/train_test_split/train.pkl \
  -o ./data/train_test_split/test.pkl \
  \
  -p train_test_split \
  \
  \
  python dvc_scripts/train_test_split.py \
    --raw-input-file ./data/raw/raw_data.pkl \
    --train-output-file ./data/train_test_split/train.pkl \
    --test-output-file ./data/train_test_split/test.pkl

The first two lines will give the information about the split between train and test. However, after that you will get the information about the text to add after a dvc run statement. As you can see, the first dependency on the raw_data.pkl file is taken from the commandline, but the two other ones are automatically determined during the calling of the script.

You can now copy these lines and easily generate your new stage based on the way you call thescript in the dvc_folder.

Implementation of the generator function

The implementation of the function is using the importlib python module and adds all modules to the list of dependencies that are under the current working directory.

def get_dvc_stage_info(deps, outputs, metrics, params, all_args):
    """Build up the commandline that should be added after dvc run -n STAGE
    for the provided dependencies, outputs, metrics, parameters, and arguments
    provided to the script.

    This function will add all python script source files to the list of dependencies
    that are included during the runtime

    Args:
        deps: list of manual dependencies (i.e. not code) that should be added with a
            -d argument to the dvc run commandline. Typically these will be the input
            files that this script depends on, as the python scripts will be
            automatically determined
        outputs: list of all the output files that should be added with a -o argument
            to the dvc run commandline
        metrics: list of the metrics files that should be added with a -M argument
            to the dvc run commandline
        params: list of the parameters that should be added with a -p argument to the
            dvc run commandline.
        all_args: List of all of the commandline arguments that were given to this
            dvc_script script. All of these arguments are used to build up the final
            cmd provided as argument to the dvc run commandline

    Returns:
        String holding the complete text that should be added directly after a
        dvc run -n STAGE_NAME
        run
    """

    python_deps = []
    _modules = sorted(list(set(sys.modules)))

    # For each unique module, we try to import it and try to get the
    # file in which this is defined. For some of the builtins, this will 
    # fail, hence the except 
    for i in _modules:
        imported_lib = importlib.import_module(i)
        try:
            if not os.path.relpath(imported_lib.__file__, "").startswith(".."):
                python_deps.append(os.path.relpath(imported_lib.__file__, ""))
        except AttributeError:
            pass
        except ValueError:
            pass

    # Now create unique sorted list and add concatenate the user based deps
    # and the python module deps
    python_deps = sorted(list(set(python_deps)))
    all_deps = deps + python_deps

    # Start building the dvc run commandline
    # 1. Add all dependencies
    ret = ""
    for i in all_deps:
        ret += f"  -d {i} \\\n"
    ret += f"  \\\n"

    # 2. Add all outputs
    for i in outputs:
        ret += f"  -o {i} \\\n"
    ret += f"  \\\n"

    # 3. Add all parameters
    ret += f"  -p {','.join(params)} \\\n  \\\n"

    # 4. Add all metrics
    for i in metrics:
        ret += f"  -M {i} \\\n"
    ret += f"  \\\n"

    # We want to create newlines at every new argument that was provided to the script
    all_args = map(lambda x: f"\\\n    {x}" if x[0] == "-" else x, all_args)

    # Now build up the final string
    ret += "  python " + " ".join(all_args) 
    return ret

Conclusion

Although you normally would not need to write these lines very often, I really want to automate all items that I can!

Also, instead of writing the component that should be placed after the dvc run statement, the above code could also be adapted to instead show the yaml lines that are to be copied into the dvc.yaml file for this stage. The academic statement for this would be "Left as an exercise for the reader" ;-)

Multiple different models with DVC

Guido Diepen — Sat, 20 Mar 2021 14:05:35 GMT

Recently, started to play around with DVC and although the main idea is very cool and quite clear (and the tutorials were also quite clear), one of the things I did not see that much is how to deal with multiple different model types over time.

For example, in case you start out with a very simple logistic regression model and later on you decide to investigate more complex models like random forest or xgboost models.

So if you are working with just one type of model, one way I can see this with DVC is the following:

Create your project
Initialize DVC on your project
Implement logistic regression in a python script
Add a new stage train_model to your dvc.yaml that depends on for example the train.pkl output of a train_test_split stage and outputs a model.pkl file with the trained model parameters
Add a new stage evaluate_model to your dvc.yaml that depends on the model.pkl file of the train_model stage and the test.pkl output of a train_test_split stage

Step 5 could then also write some metrics to a test.json file such that you can keep track of the performance.

If you make small changes to the model (e.g. changing the feature-set or some hyper parameters in case your doing things with Ridge for example), I would consider this just small variations on one existing model-type. But it becomes different when you now start looking at a new type of model that you would like to compare to your first logistic regression model, for example you want to investigate an xgboost model. At the moment, I see two separate ways of dealing with this: always just one active model or multiple models

Always one active model

One way is to create just a new branch in your code, replace the code that is being called by the scripts executed in the DVC stages and keep all other things the same. This means that instead of the logistic regression, your train_model and evaluate_model stage now will refer to your new xgboost model. Then if you are happy with the performance compared to the previous model, you just merge your branch back into master and this way replace the original logistic regression model with your new xgboost model.

This also means that you will use the same metrics file, now with the metrics for the new model

A big advantage of this approach is that you can easily keep track of the performance of any evolution of your solution for the problem (i.e. could be different model types, different features, different hyper parameters, etc). If each version of your solution always writes to the same test.json metrics file, you can easily see how you were improving over time.

A disadvantage is that you must determine based on for example git tags when you changed model types for example

Multiple simultaneous models

Another approach would be to just create two new stages for each type of model you want to investigate. For example, if you want to add the xgboost model, you would add train_xgboost_model and evaluate_xgboost_model stages to your dvc.yaml file. Also, the original two stages might then also be called train_logistic_regression and evaluate_logistic_regression.

If you look at the DAG that would be created, that would then be something like the DAG below:

The drawback of this approach is that you will always have to keep track of which model you consider to be the active one in each of the tags if you want to track your improvements over time. A big advantage is that whenever you have a new dataset you want to train your data on, you can easily compare all of the different model types at the same time.

Hybrid approach

Maybe one approach that combines the best of both would be to have two stages for each of the different model types you would like to consider (train and evaluate) and create two separate stages for the model that you consider to be the selected one: train_selected_model and evaluate_selected_model.

These two additional stages would then be a copy of the stages for the model you consider active. This means that if you start out with a logistic regression model, the train_selected_model and evaluate_selected_model would be copies of the train_logistic_regression and evaluate_logistic_regression.

If later on you find that an xgboost model is outperforming the logistic regression significantly, you can then update the train_selected_model and evaluate_selected_model stages to refer to the train and evaluate stages of the xgboost model.

This approach would allow you to easily keep track of the main performance of your selected model (the type of which could change over time) as well as having access to the performance of all other approaches also

Other approaches/solutions?

Since I only started investigating DVC very recently, still in the process of learning all of the capabilities and what it can do. So far, very impressed with DVC though and really think it is a very good way to make sure that you can easily reproduce any model in the future. Also played a bit with the experiments feature that recently got released and have to say that I really like it also!

Very curious to hear your opinion about DVC. Also, very interested in hearing opinions about dealing with different model types for the same problem within one project. How do you solve this? Let me know in a comment!

Keeping pandas dataframe column names when using Pipeline with OneHotEncoder

Guido Diepen — Wed, 24 Feb 2021 08:43:38 GMT

In this post, I will show how to create a simple custom scikit-learner Transformer that allows you to easily deal with OneHotEncoders and Pandas Dataframes.

Recently, I re-started working more with SKLearn again to organize both my preprocessing flows, as well as the flows for estimator training and testing. In my previous jobs, I have also used these pipelines, but always ran into the same major frustration when trying to use the combination:

scikit-learn Pipeline
Pandas Dataframe
scikit-learn OneHotEncoder

This frustration is the fact that after applying a pipeline with a OneHotEncoder in it on a pandas dataframe, I lost all of the column/feature names. And of course, it is possible to fix this afterwards again using the `get_feature_names` functionality of the Pipeline but it always felt like a bit of patching afterwards.

So for example, the following code:

import pandas as pd

cat_column = ['AA', "AA", "AB", "AA", "AA", "AC", "AC"]
df = pd.DataFrame({"categorical_column":cat_column})

print(df)

will give me a simple dataframe with just one categorical column:

  categorical_column
0                 AA
1                 AA
2                 AB
3                 AA
4                 AA
5                 AC
6                 AC

Now if I want to convert this to OneHot encoded data, I have multiple options. One possible option is to make use of the get_dummies functionatity provided by pandas dataframes. The main problem I have with this is that I have no standardized way of dealing with unseen values, missing columns when moving from training to test, etc.

Another way is to make use of the OneHotEncoder preprocessing transformer that is provided by scikit-learn. With the following code, you can easily get an array that is the one hot encoded representation:

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

result = ohe.fit_transform(df)

print(ohe.categories_)
print(result)

So we indicate we want to create a OneHotEncoder with a dense structure (especially since I am going to work with dataframes later on, no need to work with the sparse matrix here). After that, we fit the encoder on the earlier created dataframe and transform the data right away according to the fitted transformer.

The first print statement will give us the categories found by the OneHotEncoder:

[array(['AA', 'AB', 'AC'], dtype=object)]

As you can see, three separate values were found. The second print statement gives us the following output:

[[1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 0. 1.]]

This output has three columns (one for each of the categories found by the OneHotEncoder and 7 lines, one for each of the original lines in the dataframe). As you can see, all information regarding the column names has vanished in this output, the information is only available in the fitted ohe transformer.

So if you now want to convert this back to a dataframe, you can do this with the following python statement:

print(pd.DataFrame(result, columns=ohe.categories_))

which will give you the following output:

    AA   AB   AC
0  1.0  0.0  0.0
1  1.0  0.0  0.0
2  0.0  1.0  0.0
3  1.0  0.0  0.0
4  1.0  0.0  0.0
5  0.0  0.0  1.0
6  0.0  0.0  1.0

The problem with this is now that this will work easily for this one column, but if you have multiple columns, you will have to do a bit more coding to get the column prefix there (because what happens if you have two categorical columns, both having rows with values AA for example).

Also, this approach of converting the output of the OneHotEncoder transformer directly after it was run works easily whenever you work directly with this encoder. However, if this encoder is part of an arbitrarily complex pipeline, it becomes more tricky.

Since I see this problem occur more than once and I did not really find any solution that did exactly what I wanted/needed, and because I also like to understand how these things work, I decided to go for the next solution: build your own :-)

Implementation of DataFrameOneHotEncoder

To solve this problem, I implemented a new custom scikit-learn transfomer, namely DataFrameOneHotEncoder. The arguments for creating are exactly the same as the arguments for the scikit-learn OneHotEncoder, with the addition of a col_overrule_params. This additional argument is a dictionary where you can provide for each column the parameters that need to be overruled (e.g. if you want to apply drop=first for a specific column for example).

The fit method of this transformer is implemented as follows:

def fit(self, X, y=None):
    """Fit a separate OneHotEncoder for each of the columns in the dataframe

    Args:
        X: dataframe
        y: None, ignored. This parameter exists only for compatibility with
            Pipeline

    Returns
        self

    Raises
        TypeError if X is not of type DataFrame
    """
    if type(X) != pd.DataFrame:
        raise TypeError(f"X should be of type dataframe, not {type(X)}")

    self.onehotencoders_ = []
    self.column_names_ = []

    for c in X.columns:
        # Construct the OHE parameters using the arguments
        ohe_params = {
            "categories": self.categories,
            "drop": self.drop,
            "sparse": False,
            "dtype": self.dtype,
            "handle_unknown": self.handle_unknown,
        }
        # and update it with potential overrule parameters for the current column
        ohe_params.update(self.col_overrule_params.get(c, {}))

        # Regardless of how we got the parameters, make sure we always set the
        # sparsity to False
        ohe_params["sparse"] = False

        # Now create, fit, and store the onehotencoder for current column c
        ohe = OneHotEncoder(**ohe_params)
        self.onehotencoders_.append(ohe.fit(X.loc[:, [c]]))

        # Get the feature names and replace each x0_ with empty and after that
        # surround the categorical value with [] and prefix it with the original
        # column name
        feature_names = ohe.get_feature_names()
        feature_names = [x.replace("x0_", "") for x in feature_names]
        feature_names = [f"{c}[{x}]" for x in feature_names]

        self.column_names_.append(feature_names)

    return self

As you can see, during the fit, we loop over all of the columns in the provided dataframe (after ensuring the X argument is actually a dataframe). For each column, we create a new sklearn OneHotEncoder using the arguments we received and fit this encoder to our current column. After that, we obtain the feature names and replace them with the desired format.

The transform method is the following:

def transform(self, X):
    """Transform X using the one-hot-encoding per column

    Args:
        X: Dataframe that is to be one hot encoded

    Returns:
        Dataframe with onehotencoded data

    Raises
        NotFittedError if the transformer is not yet fitted
        TypeError if X is not of type DataFrame
    """
    if type(X) != pd.DataFrame:
        raise TypeError(f"X should be of type dataframe, not {type(X)}")

    if not hasattr(self, "onehotencoders_"):
        raise NotFittedError(f"{type(self).__name__} is not fitted")

    all_df = []

    for i, c in enumerate(X.columns):
        ohe = self.onehotencoders_[i]

        transformed_col = ohe.transform(X.loc[:, [c]])

        df_col = pd.DataFrame(transformed_col, columns=self.column_names_[i])
        all_df.append(df_col)

    return pd.concat(all_df, axis=1)

(The full code can be found on Github)

A small example for this DataFrameOneHotEncoder, where I apply it on a similar dataframe as we did before, but this time I created two copies of the categorical column) is the following:

# We want to use default parameters on all columns, except for
# the column categorical_column_2, where we want to drop the first value

cat_column = ['AA', "AB", "AA", "AA", "AB", "AB"]
df = pd.DataFrame({"categorical_column":cat_column, "categorical_column_2": cat_column})

df_ohe = DataFrameOneHotEncoder(col_overrule_params={"categorical_column_2":{"drop":"first"}})


print(df_ohe.fit_transform(df))

will give the following output:

   categorical_column[AA]  categorical_column[AB]  categorical_column_2[AB]
0                     1.0                     0.0                       0.0
1                     0.0                     1.0                       1.0
2                     1.0                     0.0                       0.0
3                     1.0                     0.0                       0.0
4                     0.0                     1.0                       1.0
5                     0.0                     1.0                       1.0

As you can see, both the categorical columns have been converted to one hot encoded columns, where each new column name is the combination of the original column followed by the categorical value surrounded with [].

Furthermore, you can also see that for the categorical_column_2, the encoder dropped the first value AA, as requested by the parameters when we created the DataFrameOneHotEncoder.

There problably are a million other ways this can be implemented. For me it was a nice way to start understanding more about the inner details of the sklearn transformers. As stated before, the full source code for the DataFrameOntHotEncoder can be found Github. Always interested in hearing whether this helped anybody else. Also, if you have a better way, also always interested in hearing that!!

Submitting pyspark jobs to Livy with livy_submit

Guido Diepen — Wed, 19 Jun 2019 13:39:47 GMT

For some of the things I am currently working on, I am using a Spark cluster. However, direct access is not possible and the only two ways that I can run spark jobs on the cluster are using Zeppelin notebooks and using the Livy REST API, both installed on an edge node (where Zeppelin is actually using Livy also). So the layout is similar to the one depicted below:

Although I really do like notebooks (mostly Jupyter for normal python stuff, but also for some exploratory pyspark stuff I do appreciate Zeppelin ), the major problem I have with them is the fact it is difficult, if not nearly impossible to have a good Git work flow with them. So for anything that is not exploratory, I typically want to work with just Python scripts that I can put easily into a Git repository. This way, I can keep track of all modifications that I make.

Since I do have access to the Livy REST API, which itself is essentially a wrapper around spark-submit, I decided that we needed to go deeper and add another wrapper layer around Livy again :)

The result of this is the python script livy_submit, which allows you to easily submit some pyspark code to the cluster for execution. The default is to create a new Livy session for each job that you send, but optionally, you can also connect to an existing Livy session.

Each python script you submit with livy_submit is executed as a separate statement within the Livy session, which means that if you choose to reconnect to an existing livy session, the new script you send is just added as a new statement. This also means that all the variables created in the earlier statements will all exist also.

Basic livy_submit usage

Suppose you have the extremely basic calculate_pi.py script to calculate approximation of pi using spark. The contents of this script are the following:

import sys
from random import random
from operator import add

# Change this to play around with larger sets
partitions = 2
n = 10000 * partitions

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))

Now to submit this script to your spark cluster using livy_submit, you can use the following code:

livy_submit https://edgenode:port/path_for_livy -s calculate_pi.py

The above code will execute the following steps:

Connect to the Livy server given by the URL
Create a new Livy session and display the session ID
Poll the Livy server to check whether the new session is available (in idle state)
Create a new statement using the contents of the file calculate_pi.py and execute this statement
Poll the Livy server for the status of the job and display this as progress bar
After the statement is finished, retrieve the output (everything printed to stdout) from the driver and display this to the user
Delete the Livy session

The output of the code will be something like:

(test_env) C:\Users\Guido>livy_submit https://edgenode:port/path_for_livy -s calculate_pi.py
Started session with id = 1028
Waiting for session to become idle before sending statements DONE
Now executing the contents of the script LivySubmit - calculate_pi.py (statement id=0)
available |##################################################| 100.0% Complete

Output of the application on the driver:
--------------- text/plain ---------------
Pi is roughly 3.137200

Finished executing script, now removing the spark session
{'msg': 'deleted'}

(test_env) C:\Users\Guido>

Source and Installation

The source for livy_submit can be found on GitHub. I have also uploaded the project to PyPI.org, which means that you can install it with the following statement:

pip install Livy-Submit

Next steps

Livy-Submit can already do more than just the statement provided above. For example, you can also set different spark settings like the number of executors to use, the amount of memory per executor, the number of cores per executor, etc.

Additionally, you can also set some default values for the LivySubmit URL using environment variables.

I will provide some more details and examples about these features in future blog posts. Also, the application is still under development. If you have any ideas or run into any problems with it, please drop me a comment or fill in an issue on the GitHub page.

Implementing a simple plugin framework in Python

Guido Diepen — Fri, 15 Feb 2019 15:33:55 GMT

Recently, I had the need to work with a plugin-based architecture, where the plugins needed to be shareable amongst different projects in an easy way. In order to do so, I decided to create a separate python module for each plugin, where the functionality of the plugin was implemented as a class within that module.

Originally, I had the idea that each project should just have one folder where all of these modules with plugins could be located in. However, after using an initial implementation for some time, I decided it would be better if each project would define a base folder under which all plugins would be located but that each plugin could be located in an arbitrary number of subdirectories under this main plugin directory.

Suppose we want to make a really basic system where each plugin will implement a method that applies a function over a given argument. The base Plugin class could then be as follows:

class Plugin(object):
    """Base class that each plugin must inherit from. within this class
    you must define the methods that all of your plugins must implement
    """

    def __init__(self):
        self.description = 'UNKNOWN'

    def perform_operation(self, argument):
        """The method that we expect all plugins to implement. This is the
        method that our framework will call
        """
        raise NotImplementedError

Writing a plugin that performs the identity function would require you to first set the description in the constructor and provide an implementation of the perform_operation method. An example implementation of this is as follows:

import plugin_collection 

class Identity(plugin_collection.Plugin):
    """This plugin is just the identity function: it returns the argument
    """
    def __init__(self):
        super().__init__()
        self.description = 'Identity function'

    def perform_operation(self, argument):
        """The actual implementation of the identity plugin is to just return the
        argument
        """
        return argument

Because we want to create a system where we dynamically load the different plugin modules, we will define a new class PluginCollection that will take care of the loading of all plugins and enables the functionality to apply the perform_operation of all plugins on a supplied value. The basic components for this PluginCollection class are as follows:

class PluginCollection(object):
    """Upon creation, this class will read the plugins package for modules
    that contain a class definition that is inheriting from the Plugin class
    """

    def __init__(self, plugin_package):
        """Constructor that initiates the reading of all available plugins
        when an instance of the PluginCollection object is created
        """
        self.plugin_package = plugin_package
        self.reload_plugins()


    def reload_plugins(self):
        """Reset the list of all plugins and initiate the walk over the main
        provided plugin package to load all available plugins
        """
        self.plugins = []
        self.seen_paths = []
        print()
        print(f'Looking for plugins under package {self.plugin_package}')
        self.walk_package(self.plugin_package)


    def apply_all_plugins_on_value(self, argument):
        """Apply all of the plugins on the argument supplied to this function
        """
        print()
        print(f'Applying all plugins on value {argument}:')
        for plugin in self.plugins:
            print(f'    Applying {plugin.description} on value {argument} yields value {plugin.perform_operation(argument)}')

The final component of this class is the walk_package method. The basic steps in this method are:

Look for all modules in the supplied package
For each module found, get all classes defined in the module and check if the class is a subclass of Plugin, but not the Plugin class itself. Each class that satisfies these criteria will be instantiated and added to a list. The advantage of this check is that you can still place some other python modules within the same directories and they will not influence the plugin framework.
Check all sub-folders within the current package and recurse into these packages to recursively search for plugins.

Items 1 and 2 of the above steps can be implemented with the following code:

def walk_package(self, package):
    """Recursively walk the supplied package to retrieve all plugins
    """
    imported_package = __import__(package, fromlist=['blah'])

    for _, pluginname, ispkg in pkgutil.iter_modules(imported_package.__path__, imported_package.__name__ + '.'):
        if not ispkg:
            plugin_module = __import__(pluginname, fromlist=['blah'])
            clsmembers = inspect.getmembers(plugin_module, inspect.isclass)
            for (_, c) in clsmembers:
                # Only add classes that are a sub class of Plugin, but NOT Plugin itself
                if issubclass(c, Plugin) & (c is not Plugin):
                    print(f'    Found plugin class: {c.__module__}.{c.__name__}')
                    self.plugins.append(c())

The recursive part of the method is then implemented by calling the same walk_package method again on each subpackage found within the current package. We first build up the list of all current paths (either just the __path__ in case it is a string, or the elements if it is a _Namespace object. After that, we just initiate the recursive call for each of the elements. The complete implementation of the recursive part is then:

    # Now that we have looked at all the modules in the current package, start looking
    # recursively for additional modules in sub packages
    all_current_paths = []
    if isinstance(imported_package.__path__, str):
        all_current_paths.append(imported_package.__path__)
    else:
        all_current_paths.extend([x for x in imported_package.__path__])

    for pkg_path in all_current_paths:
        if pkg_path not in self.seen_paths:
            self.seen_paths.append(pkg_path)

            # Get all subdirectory of the current package path directory
            child_pkgs = [p for p in os.listdir(pkg_path) if os.path.isdir(os.path.join(pkg_path, p))]

            # For each subdirectory, apply the walk_package method recursively
            for child_pkg in child_pkgs:
                self.walk_package(package + '.' + child_pkg)

If we now implement two additional plugins DoublePositive (doubles the argument) and DoubleNegative (doubles and negates the argument) and place the modules for these two plugins in a subdirectory double under the plugins directory, our plugin collection should be able to handle these.

In order to test everything, we can write a very simple application:

from plugin_collection import PluginCollection

my_plugins = PluginCollection('plugins')
my_plugins.apply_all_plugins_on_value(5)

This application will first initialize a PluginCollection on the package plugins. When instantiated, this plugin collection will recursively look for all plugins defined under the folder plugins. After the initialization is done, it will call the perform_operation on each of the plugins with the value 5.

And behold, if you now run the application you will see the following output:

$ python main_application.py

Looking for plugins under package plugins
    Found plugin class: plugins.identity.Identity
    Found plugin class: plugins.double.double_negative.DoubleNegative
    Found plugin class: plugins.double.double_positive.DoublePositive

Applying all plugins on value 5:
    Applying Identity function on value 5 yields value 5
    Applying Negative double function on value 5 yields value -10
    Applying Double function on value 5 yields value 10

Without any direct reference to any of the plugin modules in your code, it automatically found all of the plugins..... MAGIC :)

A complete working version of the above example code can be found in a separate Github repository.

Drop me a comment in case you think this is useful, or if you have tips, comments, or better approaches :)

Wrapping my head around D3 rotation transitions

Guido Diepen — Wed, 11 Jul 2018 07:17:01 GMT

Although the majority of my posts are about more technical things, tips and tricks, and things like Docker containers, I am also interested in visualization. When I was working at Deloitte, I worked together with Nadieh Bremer on a small analytics demo that allowed us to compare persons attending the Deloitte Ladies Open 2015 to any of the dutch golf pros based on hitting a golf ball 5 times. Ever since this small (but extremely fun) project I was amazed with the flexibility and the power of D3.js (especially in the hands of a pro like Nadieh, see her website for some really cool stuff!)

Recently, I had the need for creating a visualization and decided to brush up on my D3 knowledge again :) One of the main things I needed in my visualization was an animation that both translated and rotated a shape defined by some path, where the rotating happened around the center point of the shape.

Drawing the shape

For sake of simplicity, I will not use the complex shape, but just a simple triangle with the three corners (60,100), (30,180), and (90,180) to to demonstrate my journey with the rotations:

The above code will draw the following triangle:

To make the visualization a bit more clear for the next steps, I will also add some additional grid lines and a center point by changing the code into the following:

which now results in the following triangle with the red dot at (140,60) being the center point:

Rotating the shape

What I now want to do is using a D3 transition to rotate the triangle -180 degrees around its center point (i.e. (140,60) ). Since I want to make use of D3, the first thing I will need to do is include the D3.js library, which is accomplished by including the following code in your HTML:

With the following javascript code and the above SVG, we can accomplish this as follows:

var svg = d3.select("svg")
var triangle_cx = 60
var triangle_cy = 140

function animateTriangle(){
    svg.select("#triangle")
        .transition()
        .duration(2500)
        .attr('transform' , 'rotate(-180, '+triangle_cx+',' +triangle_cy +') ')
        
        .transition() //And rotate back again
        .duration(2500)
        .attr('transform' , 'rotate(0, '+triangle_cx+',' +triangle_cy +') ')
        
        .on("end", animateTriangle) ;  //at end, call it again to create infinite loop
}

//And just call the animateTriangle function now
animateTriangle()

The above javascript code, combined with the SVG element will result in the following animation

So while the end result is still correct (i.e. the triangle is rotated -180 degrees around the red center point), the transition itself is a bit of a disappointment...

D3.js interpolators

So let's see what is actually causing this. It turns out that the when you use the transition functionality and use it on the transform attribute as we are doing above, D3 will use the function d3.interpolateTransformSvg to determine the intermediate steps in the animation. Since our rotation is around a different center point than (0,0), the animation will actually consider the translation to (40,160) also in the animation for the rotate.

We can use the d3.interpolateTransformSvg function ourselves to show the way our center point is translated during the animation.

First we store the interpolate function between the start and end state in a new variable and then use this new interpolate function to calculate 25 intermediate steps (the interpolate function takes a value between 0 and 1, where 0 represents the start state and 1 represents the end state):

var interpol_transform = d3.interpolateTransformSvg( "rotate(0,60,140)", "rotate(-180,60,140)" )

for (i=0 ; i<25 ; i++){
  svg.append("circle")
      .attr("cx", 60)
      .attr("cy", 140)
      .attr("r", 4)
      .style("fill", "red")
      .attr('transform', interpol_transform(i/25))
  console.log("Transformation attribute value: " + interpol_transform(i/25))
}

Adding the above code to our original javascript, we end up with the following SVG where the path of the triangle center point is shown:

If you look in your developer console, you should see also logging statements indicating the 25 intermediate steps similar to the ones below:

Transformation attribute value: translate(0, 0) rotate(0)
Transformation attribute value: translate(4.8, 11.200000000000001) rotate(-7.2)
Transformation attribute value: translate(9.6, 22.400000000000002) rotate(-14.4)

Although there is a perfectly good explanation for it, for me, this behavior brings back some MS Excel frustrations whenever Excel (or Clippy ) tries to do something automatically where you don't want it to, e.g. automatically convert some string fields into dates because they appear like dates....

Solving the rotation issues

Fortunately, there are some simple solutions also:

Include the triangle within a svg g node, where you ensure the centerpoint of the triangle is at (0,0) of the group and just rotate the group
Define our own interpolate function based on interpolateString and using the attrTween function to tell the transition function how to calculate the in-between frames of the animation

Using custom attrTween function

To use this approach, we just have to define a interpolate function based on the interpolateString function, similar to the interpol_transform we defined earlier to show the path of the center point.

By changing the javascript code as follows:

var svg = d3.select("#svg")
var triangle_cx = 60
var triangle_cy = 140

//Define the two interpolations. Note that the interpolateString will 
//do a pairwise interpolate for the numbers that are found and the 
//rotation point is the same for both (i.e. the (60,140) )
var interpol_rotate = d3.interpolateString( "rotate(0,60,140)", "rotate(-180,60,140)" )
var interpol_rotate_back = d3.interpolateString( "rotate(-180,60,140)", "rotate(0,60,140)" )



function animateTriangle(){
    svg.select("#triangle")
        .transition()
        .duration(2500)
        .attrTween('transform' , function(d,i,a){ return interpol_rotate } )
        
        .transition() //And rotate back again
        .duration(2500)
        .attrTween('transform' ,  function(d,i,a){ return interpol_rotate_back })
        
        .on("end", animateTriangle) ;  //at end, call it again to create infinite loop
}

//And just call the animateTriangle function now
animateTriangle()

We end up with exactly the result I had in mind when I decided I needed to rotate a path around a given center point:

Using additional group nodes to translate around origin

The alternative to the above is to wrap the path you want to rotate in three separate group nodes:

One group node to translate the center point of the path to the origin of this group
One group that envelops group node 1 and does the actual rotation
One group note that envelops group node 2 that cancels out the translation from group node 1

In the SVG code, this means we have to have the following definition:

And in the Javascript code, you just have to apply the rotation to the group node with id g_rotate by using the following code:

function animateTriangle(){
    svg.select("#g_rotate")
        .transition()
        .duration(2500)
        .attr('transform' , rotate(-180))
        
        .transition() //And rotate back again
        .duration(2500)
        .attr('transform' , rotate(-180))        
        .on("end", animateTriangle) ;  //at end, call it again to create infinite loop
}

//And just call the animateTriangle function now
animateTriangle()

And this way, you also end up with the same result:

Conclusions

At the moment not 100% sure which of these approaches is the best way. Personally, I would probably stick with the more Javascript solution (i.e. using the attrTween and define my own interpolation), but that might be influenced by the fact I am more of a programmer than a designer :)

Let me know if you have any other ideas, curious to hear other potential solutions. Also, if you good reasons why one of the above solutions is better than the other, please also share them!

Solving timeouts with IguanaIR transceiver and raspberry pi

Guido Diepen — Sat, 23 Jun 2018 10:31:06 GMT

Sometimes you encounter problems for which a variety of solutions might exist. Some of these solutions are clean, and some of these solutions feel like a very duct-tape solution. However, if such a duct-tape solutions works, why change it?? :) In this post I will share one of such duct-tape solutions for a problem I ran into.

So one of my pet projects is to see if I can make good use of the raspberry pi zero w that I have laying around. I finally decided to use it to control the different media devices in the living room (receiver, xbmc, tv, and cable box). In order to do so, I also bought an Iguanaworks USB IR transceiver:

The nice thing is that you can attach up to 4 IR emitters to the device, where you control which emitter you want to use for sending commands. Furthermore, it also contains an IR receiver that you can use to record the IR codes from your existing remote controls.

To get the transceiver to work with LIRC does require some steps. I basically followed the steps from this Instructables page. In short, it boils down to adding the iguanaworks apt repository and installing the iguanair package. After that, remove all LIRC packages, and build the package from source. By building from source, it will now see the iguanaworks driver and will include it. Finally, install the created deb files and you're good to go!

The most basic command you can try out is using the statement igclient --get-version to query the version of the device. However, when executing this command I would often see the following timeout message:

pi@raspberrypi:/ $ igclient --get-version
get version: failed: 110: Connection timed out

Typically, I would get this timeout for the first couple of tries when executing the command and after that it would work without a problem. So this timeout not only happens when you try to query the version but also would happen if you would use a command like:

pi@raspberrypi:/ $ irsend send_once Ziggo POWER_OFF

I did quite some research on it and it might be caused by timeouts happening in the dwc_otg USB drivers on the raspberry pi. You can try all kinds of different settings by providing kernel parameters in the /boot/cmdline.txt (e.g. add dwc_otg.speed=1), but none of these options so far has worked for me.

I noticed that whenever the device is being queried regularly, the timeouts do not seem to happen at all. So instead of trying all kinds of different kernel parameters my current solution is much simpler, namely, start a screen or nohup session with the following statement:

pi@raspberrypi:~ $ while true ; do igclient --get-version ; sleep 1; done

This will indefinitely query the device version every second. Since I have been running this command in the background, I have not seen any timeout at all.

As to why this extremely duct-tape like solution is working, my guess is that after some time of inactivity some of the internals in the DWC_OTG driver and/or connection are reset resulting in the timeouts. After a couple of tries, the connection is up again and the timeouts are gone. So if you just ensure that the connection never resets, the timeouts disappear :)

So one problem solved but I still don't have the whole thing working: I cannot seem to get the transceiver working with my Ziggo Cisco HDTV cable box, it might have something to do with it using different carrier frequencies or so. If you do know any solution for this, will be happy to hear!! Once that final problem is solved, I will be able to control the complete media center from my raspberry pi (and of course connect it to Alexa :) )

Sharing all your Docker data volumes via Samba

Guido Diepen — Sun, 20 Aug 2017 16:49:35 GMT

So as I once wrote in an earlier blog article, I am able to use Docker for Windows on my laptop at work, but I am not able to bind mount any windows directory into my Docker containers because of security settings preventing me to share the drive with the Hyper-V virtual machine.

The earlier blog article showed how I used the dperson/samba image to share a specific data volume. And unlike Joey:

who does not want to share food, I actually want to automatically share ALL of the containers via Samba instead of having to manually provide them :) The approach in my earlier article does work, but was a bit cumbersome if you

Recently I had some time for this and the result is an image for a container that you can easily use on any computer yourself.

Most importantly, I am now able to easily share this particular image with my colleagues allowing them to easily access the data in their data volumes also :) Best about this is that I can start working on creating more Docker converts within my department!! :)

The name of the image I created is gdiepen/volume-sharer and you can easily run it as follows:

docker run --name volume-sharer \
           --rm \
           -d \
           -v /var/lib/docker/volumes:/docker_volumes \
           -p 139:139 \
           -p 445:445 \
           -v /var/run/docker.sock:/var/run/docker.sock \
           --net=host \
           gdiepen/volume-sharer

The different arguments provided to Docker are the following:

--name volume-sharer : Gives the container a recognizable name
--rm : Cleans up the container after it is stopped
-d : Makes the container run in the background
-v /var/lib/docker/volumes:/docker_volumes : Mounts the directory that holds the local Docker data volumes on the Docker host to the local folder /docker_volumes in the running container
-p 139:139 -p 445:445 : Publishes the ports used for Samba to the host
-v /var/run/docker.sock:/var/run/docker.sock : Mounts the Docker socket of the host system within the container to allow the container to access Docker information (i.e. volume creation/deletion)
--net=host : When running under windows, this argument is needed to publish the SMB ports to the Hyper-V virtual machine.
gdiepen/volume-sharer : The name of the of the image on DockerHub

You can also still provide additional arguments to the container (i.e. after gdiepen/volume-sharer in the command line). These are all the same as the ones you can provide to the dperson/samba image.

After the container is started, it will start listening for any docker events that are volume related (i.e. volumes created / removed / mounted). In the case of any such event, the Samba configuration file is rewritten and the running Samba daemon is instructed to reload the updated configuration.

How to use this in practice

Whenever you are on a system that does not allow for bind mounting of the host folders to your Docker container, you can start my volume-sharer container with the command line given above.

After (or before, does not really matter because the configuration is automagically updated :) ) you can start your own containers with either named or unnamed volume mounts.

By going to the SMB server localhost (\\localhost) in case you are not running on Windows (and do not need the --net=host argument), or the Hyper-V SMB server (\\10.0.75.2) in case you are running under Windows with --net=host argument you will be able to access and modify the contents of all of your named and unnamed data volumes.

You will see that any time you start a new container that has new (un)named volumes, automatically the list of shares will be updated in the SMB server, like magic:

A potential future feature I am thinking about is to create one separate share that contains all of the unnamed data volumes as you might have quite some of these if you have played around with Docker for some time ;)

Source-code

As usual, I have made the source-code available in a GitHub repository. You can use this to see how the inner details work. I have already linked this repository to my DockerHub account with an auto-build so that you can easily have the latest version of the image by just running a docker pull gdiepen/volume-sharer.

If you have any other (hopefully easier :) ) way of accessing the contents of the Docker data volumes, drop it in a comment as I would be happy to hear about it!

Connecting to Qlik Sense Server with HTML5 and Enigma.js

Guido Diepen — Fri, 23 Jun 2017 12:10:05 GMT

At the Analytics & Information Management service line at Deloitte, we are working a lot with QlikView and Qlik Sense. After a recent training on the engine API, one of the things that I have been working on is trying to connect a simple HTML5 web app to a Qlik Sense Server using the Enigma.js javascript library.

All of the documentation I found so far seems to deal with the situation where you connect your web app with a locally running Qlik Sense Desktop instance. In this case, you don't have to worry about authentication and getting a qlikTicket to be used in the communication.

Connecting to dashboard on Qlik Sense Desktop

When using Enigma.js for connecting to a running instance of Qlik Sense Desktop, you need to have a copy of enigma.js or enigma.min.js and typically just have the following code:

//The base config with all details filled in 
var config = {
    //Take the schema object
    schema: schema,
    //The ID of your app on the qlik sense server
    //If not provided, you can manually open an app later on
    appId: "My Demo App.qvf",
    session:{
        //Use localhost:4848 to connect to local Qlik Sense Desktop
        host:"localhost",
        port: 4848,
        prefix: "",
        unsecure: true,
    },
}



//Now that we have a config, use that to connect to the QIX service.
enigma.getService("qix" , config).then(function(qlik){

    //In case you did not provide appID in the config, now you can open
    //an app by using the openApp function on the qlik object
    //    qlik.global.openApp( [APP_ID] )
    //
    //In case you did provide an appID in the config, you can directly
    //access it via the app object in the qlik object and use this to 
    //create session objects with the createSessionObject function
}

With the above code and having a local Qlik Sense Desktop instance running on your computer within a very short time, you can have your own custom visuals using libraries like amCharts or D3 using the data from the Qlik Sense Dashboard.

Connecting to dashboard on Qlik Sense Server

After successfully being able to create some custom D3 visualizations around a local Qlik Sense dashboard, I wanted to see if I could connect to a dashboard running on a Qlik Sense Server. This turned out to take a little bit more effort... :)

First of all, it is important that you make sure you have a recent version of the enigma.js library. A big part of my struggle to get this working was caused by the fact I was using an older version of the enigma.min.js library, more about this later.

An easy way to get the latest version of the enigma.min.js library file is to use npm as follows:

npm install enigma.js --save

After that, you can find the enigma.min.js in the directory node_modules/enigma.js/dist of the directory where you executed the npm command in.

To connect to a dashboard on a Qlik Sense Server, the original config object we used above needs to be extended a bit to deal with the authentication. A full description of the authentication and dealing with tickets can be found on the website of Qlik. In short, it boils down to the following process:

The user opens your web app in the browser
Enigma uses information from the config object to create WebSocket to the dashboard.
Enigma fires a request to receive data over this WebSocket
In a reply message over the WebSocket, the server indicates that the user must authenticate and provides the URL where the user can do so
The browser will have to do a redirect to the URL obtained Step 4. After this new page is loaded, the user will have to provide the credentials on this page.
The authentication service will validate these credentials and if successful, will provide a redirect to the reloadUri that is provided in the config supplied to enigma by the web app. However, the redirect will include an additional GET request parameter in the URL, namely a value for qlikTicket
Your web app will need to retrieve the value of the qlikTicket from the URL after it is loaded and add this as a new piece of information to the config object that is used by Enigma. It must be added to the urlParams part of the config object.
Using this updated version of the config object (i.e. with the qlikTicket information), you must again make a connection to the Qlik Server. The enigma library will add the qlikTicket value in the urlParams part of the config to the URL of the WebSocket, proving that you are authenticated.
Get all data from the Qlik Sense dashboard and display it will cool D3 visuals :)

The problem for me to get this working was that I could not find any clear explanation about Step 7 and 8, especially in combination with Enigma.js.

Furthermore, it turned out that when I started I had an older version of enigma.min.js library file which did not append the additional items under urlParams to the WebSocket URL. This resulted in the situation where it was not possible to prove I was authenticated and therefore it went into and endless loop of trying to authenticate.

After getting the latest version of the library, checking the source code and looking at all requests in the developer console of Chrome, I felt like Cartman:

Combining all of the information above, i.e. adding the additional parameters to the config as well as the code required to add qlikTicket information if available, is given in the code block below.

//Helper function to get the value of a given key in a GET request
//We use this to get the value of the qlikTicket key in the current
//URL (i.e. after the authentication service redirected us back and
//added the qlikTicket parameter)
function getParameterByName(name, url) {
    if (!url) url = window.location.href;
    name = name.replace(/[\[\]]/g, "\\$&");
    var regex = new RegExp("[?&]" + name + "(=([^&#]*)|&|#|$)"),
        results = regex.exec(url);
    if (!results) return null;
    if (!results[2]) return '';
    return decodeURIComponent(results[2].replace(/\+/g, " "));
}



//The location of the current app. Note that it would also be possible
//to use location.href for this to just get the current value
//
//We need this URI to be able to tell the authentication service where
//the user needs to be redirected back to after the authentication has
//been done
//Please adjust this to the URL for your own web app
var currentApplicationURI = "http://localhost:8383" ; 


//Based on whether we have the details in the URL, we create the
//url parameters object
if (getParameterByName( "qlikTicket" , location.href) !== null ){
    var myURLParams = {
            reloadUri: currentApplicationURI,
            qlikTicket: getParameterByName( "qlikTicket" , location.href) ,
  
    }
}
else{
    var myURLParams = {
            reloadUri: currentApplicationURI,  
    }
}

//The base config with all details filled in 
var config = {
    //Take the schema object (see enigma.js website for details on how
    //to obtain this)
    schema: schema,
    //The ID of your app on the qlik sense server
    //If not provided, you can manually open an app later on after
    //you created the connection to the qix service
    //Please update this to the ID of your app on the Qlik Sense Server
    appId: "4a53b97c-73ed-48a2-9299-b479999f8791",
    session:{
        //Hostname and port of your qlik sense server
        //Update this to your own server details
        host:"Hostname.Of.Your.Qlik.Sense.Server",
        port: 80,
        prefix: "",
        unsecure: true,


        urlParams: myURLParams,
    },

    //Provide an authentication listener that will automatically redirect
    //to the authentication URL in case authentcation is required
    listeners: {
        "notification:OnAuthenticationInformation":  function ( authInfo ) {

            //If the message on the WebSocket indicates that the user must
            //authenticate, the message also provides a loginUri where the
            //authentication can be done. In order to do the actual 
            //authentication, we will redirect the current browser window
            //to the URI. After the authentication, the authenticator 
            //service will provide a redirect again to the provided 
            //reloadUri in the session urlParams
            if ( authInfo.mustAuthenticate ) {
                location.href = authInfo.loginUri;
            }
        },
    },
};

//Depending on whether the user is authenticated or not, the config will
//contain the qlikTicket information in the urlParams. If not, the call
//to getService will result in a redirect to authenticate. If the 
//qlikTicket is present, it will allow the user to get the data
enigma.getService("qix" , config).then(function(qlik){
    //In case you did not provide appID in the config, now you can open
    //an app by using the openApp function on the qlik object
    //    qlik.global.openApp( [APP_ID] )
    //
    //In case you did provide an appID in the config, you can directly
    //access it via the app object in the qlik object and use this to 
    //create session objects with the createSessionObject function
}

Instead of keeping the qlikTicket in the URL, another option for making it a bit cleaner would be to check for a qlikTicket parameter in the URL. If it exists, put this in a session cookie and then redirect again to the same page without the additional parameter in the URL and take the ticket information from the cookie.

Curious to hear whether there are other solutions on how to do the authentication and connect with a Qlik Sense server using Enigma. If you do have another way, let me know via a comment.

Marking all messages as read in Outlook with cached exchange mode

Guido Diepen — Fri, 05 May 2017 10:32:28 GMT

The other day I wanted to clean up my mailbox (the number of unread messages had gotten quite high). Though all my recent emails were all read, there was still a large pile of old emails that I did not explicitly mark as read...

Since most of these emails were more than a year old and not relevant at the moment, I just wanted to mark all of the messages as read. So I just right clicked on the Inbox in Outlook and pressed "Mark All as Read":

After clicking this, my expectation was that the (1) would disappear. To my surprise, this did not happen and even after switching to another folder, I would still not see the (1) disappear....

Some searching on the Internet revealed it could be due to the connection to the Exchange Server using the cached exchange mode. If I check the settings for my account, I indeed see that this is the case.

One solution to this mark all as read not doing what you expect it would do is to switch of the cached exchange mode, then perform the action, and then switch on the cached exchange mode again.

Another alternative I found is that you use the search bar to search for all unread messages and ensure you have selected "All Outlook Items" for the search area.

As with pretty much all computer problems, there are a lot of different ways to solve the problem and also for this case I found another solution :) After I selected my Inbox, I just scrolled down to the first message in my Inbox. Below this first email there will be this message:

If you click on the highlighted text, your Outlook will download all messages from the exchange server, even the ones older than the limit indicated for your cached exchange mode. After all messages were downloaded, I right clicked again on the Inbox, selected Mark All messages as read and finally this (1) number disappeared!!

I think Outlook only executes the commands on the items present in the current view and then synchronizes the result with the Exchange server again. In the case of these special actions like "Mark all messages as read" I would expect the command to be sent to the Exchange server instead as the command needs to be executed on the whole folder.

Listing information for all your named/unnamed data volumes

Guido Diepen — Tue, 25 Apr 2017 09:36:39 GMT

After having played with Docker for some time on my computer, I have started to gather quite some images on my hard disk... In his post, Brian Christner provided a very good comparison of different scripts that can help you to clean up some of the images that are not needed anymore.

From my previous Docker-related posts, you can see that one of the Docker features I am looking most at is the data volumes. Similar to the problem of a growing pile of images on my hard disk, I also had a growing pile of (un)named data volumes.

In order to get a bit of an idea how large this pile was, I started to investigate this. Instead of getting the information about the container/image sizes, I just want to have the size of the data volumes:

In the spirit of my motto to automate everything I can automate, I, of course, wrote a script to get me the information I needed :)

The trivial approach would be to just check the /var/lib/docker/volumes folder. However, I wanted the script to be able to deal with volume plugins. Furthermore, I also wanted to be able to get some additional information about each data volume.

The current version of the script is very naive and just loops over all data volumes, starts a container and mounts the data volume in the container in order to enable us to check the size of that volume in the container. Using the Docker API, I think better and more efficient alternatives are possible than this :)

After we have received the size of the data volume, we also get some additional information regarding the running and stopped containers that have a known link to this data volume. This information can help us determine which data volumes we must not delete ;)

To get the size of a data volume myDataVolume, I make use of the following one-liner:

docker_volume_size=$(docker run --rm -t -v myDataVolume:/volume_data ubuntu bash -c "du -hs /volume_data | cut -f1" )

This will store the size of the data volume in the variable docker_volume_size. To obtain a list of all data volumes present, we can use the command docker volume ls -q. By combining the above one-liner with this command, I can loop over all of the data volumes and query its size.

Besides the size, I also would like to get information about the containers that are connected to this data volume. And for each container I would also like to have both the name of the image as well as the status (running or stopped).

If we have a known Docker data volume myDataVolume, you can get this information with the following one-liner:

docker ps -a --filter=volume=myDataVolume --format "{{.Names}} [{{.Image}}] ({{.Status}})"

Using this Go template we extract the name of the container, as well as the image it was created from and the current status.

By combining all the above into one script, we can get detailed information about all the data volumes that are present on the system. An example output is shown in the image below:

I added the script to my Docker Convenience Scripts repository. In the current version, I just show everything to the screen. Some features for a future version could be a performance improvement and creating a structured output (e.g. JSON) instead of the current output. If you have ideas for this, you can fork my repository and send me a PR.

Accessing container contents from another container

Guido Diepen — Tue, 11 Apr 2017 12:12:31 GMT

I have had it a couple of times already that I am running a container and quickly want to see the contents of some of the files that are in the container (e.g. verify some settings file or quickly view/edit the source of the scripts in the container). Just like in the above image, I only want to cut a hole or quickly open a window into the container contents :)

The first thing that comes to mind is to use a command like

docker exec -it  /bin/bash

to get a bash shell in the running container and use this bash shell to move around in the container.

Simple access/viewing works this way, but when I need to go through some of the source files, my editor of choice is, of course, the best editor there is: ViM (Don't worry, the rest of this post will also hold if you have another favorite :) ).

Normally we should NOT be able to use ViM in the container because it is good practice to keep your container images to the minimum needed. Typically, additional editors like ViM do not need to be part of the container image, meaning that they will not be available in the running containers.

Of course, the action might be anything, instead of just using an editor, you might want to run any other program on the contents of the running container.

One way you can solve this is by using the shell in the running container to install the additionally required applications within the running container. However, if you have to do this more often, this approach becomes quite cumbersome.

PID Namespace

Fortunately, there is another way to solve this issue and that is using the functionality of Docker to share the PID namespace between containers. In the case of our goal to see the contents of a container RunningContainer, you can achieve this by starting a new container ContainerB in the PID namespace of the original container RunningContainer. This is done by using the --pid argument for the docker executable as follows:

docker run -it --rm --name ContainerB --pid container:RunningContainer alpine /bin/ash

By adding the additional argument, the processes are shared between the two containers. You can easily check this by running the ps command in the shell of the new container. This in itself is already very useful, for example for debugging purposes. By giving ContainerB access to the processes of RunningContainer, you can effectively use a debugger on the process(es) running in RunningContainer.

One very cool side-effect of this process sharing that is extremely helpful for the goal I want to achieve is the fact that the process(es) that are running in RunningContainer also have a directory entry in the /proc directory in the container ContainerB because they share the PID namespace. Within the subdirectory for each process under /proc, there is a symlink called root that is pointing to the root of the filesystem for that running process. By following this symlink for any of the processes running in the container RunningContainer, we end up in the root of the filesystem of the original container!

Example

You can easily see this cool feature in action by starting the ash shell in a new container based on the alpine image as follows:

docker run -it --rm --name RunningContainer alpine /bin/ash

After the container is started, you can check that the /tmp directory is empty by using the command ls /tmp.

In a second terminal, you can now start another container that is sharing the namespace of the container we just created:

docker run -it --rm --pid container:RunningContainer --name ContainerB alpine /bin/ash

The ash process in RunningContainer will have pid 1 (it was started before all other processes). Using this information, we can now use the shell in ContainerB to create a new file in the /tmp/ directory corresponding to the tmp directory of the process with pid 1 with the following command:

touch /proc/1/root/tmp/hello_RunningContainer.txt

If you now go back to the terminal of the original RunningContainer container, you can verify that this new file hello_RunningContainer.txt is now available under /tmp by running the command ls /tmp!

Clever bit tricks with ASCII

Guido Diepen — Mon, 27 Mar 2017 11:58:50 GMT

Quite some years ago, I started my adventures in the wonderful world of programming. The tool of choice for me at that point of time was QBasic. Besides the obvious starters like printing information to the screen and sending random numbers to the PC speaker, I also started digging through the source files of both gorilla.bas and nibbles.bas which came with every installation of MS-DOS :)

After QBasic, I started to look at Pascal and after some time at Pascal in combination with assembly. Fortunately, my local library had books about assembly, including the ones from Peter Norton :) Mostly used the assembly to efficiently do graphical things. Waaaaaaaay before you had DirectX/OpenGL/nVidia/shaders/etc you only had Mode 13h / Mode-X and things like paged memory / protected mode / etc. The best was, you could render the most amazing things on a 386 in real-time :) When I was learning assembly, I really got fascinated with cool and very clever tricks you can do with bit-operations.

The other day I was reading a very informative article about Unicode that for the first time made all the details about Unicode completely clear to me. If you still have some issues with wrapping your head around Unicode, this is definitely an article I can recommend!

A side-effect of reading this article was that it made me think about ASCII and some clever tricks that were built in.

Upper and lower case in ASCII

If you look at the ASCII values for the letters a-z and A-Z in ASCII, you will see that for each letter, the difference between the upper and lower case is exactly 32. If you look at the ASCII number for the letter g, it is 103. the ASCII number for the letter G is 71.

If you look at the bit representation, this comes down to the following:

Letter	Decimal value	Binary value
G	71	`0100 0111`
g	103	`0110 0111`

Because of the 32 offset, the only difference between these two letters is bit 5: if bit 5 is set, we are dealing with a lower case letter, if this bit is not set we are dealing with an uppercase letter.

This means that if you have a string that you know only consists of letters, you can easily toggle the case of all letters by doing a bitwise XOR with 0010 0000 for each of the letters.

To create a lowercase version of the string is just a matter of doing a bitwise OR with 0010 000 for each of the letters. All the letters that are already lowercase will stay lowercase and all the letters that were uppercase will now have the bit 5 set and therefore become lowercase.

Alternatively, if you want to create an uppercase version of a string, you can do a bitwise AND operation with the value 1101 111 forcing bit 5 to be reset, resulting in uppercase letters.

Note that you will have to ensure you only do this on the characters that represent letters (you cannot convert the & character into an uppercase or lowercase :) )

Integer values in ASCII

One of the other cool tricks that I have used when I was playing around with Assembly and Pascal is the fact that the characters 0-9 have been put in a particular place also in the ASCII table, starting at 48:

Letter	Decimal value	Binary value
0	48	`0011 0000`
1	49	`0011 0001`
2	50	`0011 0010`
3	51	`0011 0011`
4	52	`0011 0100`
5	53	`0011 0101`
6	54	`0011 0110`
7	55	`0011 0111`
8	56	`0011 1000`
9	57	`0011 1001`

So if you know that a specific character is a number, you can get the numerical value as an integer very easy by just performing an AND operation on the ASCII value with the mask 0000 1111. For example, performing an AND on 0011 0110 (the binary representation for the ASCII character '6') with the mask 0000 1111 will result in 0000 0110 which is exactly the same as the value represented by the character '6'.

ASCII art

Besides all of the above clever tricks, one additional cool thing you can do with ASCII is creating the ASCII art:

I really think it is amazing that they thought about these kinds of clever things when designing the ASCII table. With the abundance of memory and computing power nowadays, I am not sure if these kinds of bit tricks would still be included during a design phase.

Are you aware of any other cool or clever tricks you can do with ASCII, or any other clever bit tricks in general, let me know in a comment!