Python - www.GuidoDiepen.nl

'Automating' DVC stage generation

Guido Diepen — Fri, 26 Mar 2021 08:26:35 GMT

So recently, I have started to work with DVC to create reproducible pipelines. In my previous post about DVC I wrote about dealing with multiple different types of model in one repository.

In this article, would like to look at one other thing I ran into while I started working with DVC and that is the fact that at least when setting up the different stages, it requires quite a bit of

typing....

Especially if you are creating stages that depend the outputs of one or more previous stages, has one or more metrics, multiple parameters, and finally also one or more output files. In this case, you will have to either create a very long dvc run -n STAGE_NAME command or provide the details in the dvc.yaml file.

As you will be able to see in my DVC test project on GitHub, I am using separate python scripts in the dvc_scripts folder. Any of the stages that I add to my dvc.yaml file will be a separate file in this folder. This allows me to easily split the dvc related code that execute the stages from the actual implementation (which can be found in the src folder). The actual implementations can then also be used in Jupyter notebooks when I am playing around with different things.

In order to add a new stage train_test_split that depends on a raw input file to generate two files train.pkl and test.pkl based on parameters indicating the random state and the size of the test set I need to run the following command:

dvc run -n train_test_split \
  -d ./dvc_scripts/train_test_split.py \
  -d ./src/train_test_split/train_test_split.py  \
  -d ./data/raw/raw_data.pkl  \
  -p train_test_split.random_state,train_test_split.test_size \
  -o ./data/train_test_split/train.pkl \
  -o ./data/train_test_split/test.pkl \
  python dvc_scripts/train_test_split.py  \
    --raw-input-file ./data/raw/raw_data.pkl  \
    --train-output-file ./data/train_test_split/train.pkl \
    --test-output-file ./data/train_test_split/test.pkl

As you can see from the above, the command that I want to use for this stage is the train_test_split.py script from the dvc_scripts folder. Therefore, I should add this script to the list of dependencies for the stage (i.e. if I make any change to this file, it should invalidate the current stage). This CLI script itself will call the main file src/train_test_split/train_test_split.py holding the actual implementation for the train test split stage. So this file for sure also needs to be part of dependencies.

Besides that, there is the parameters for this stage, as well as the outputs that it generates

As mentioned before, this requires quite a bit of typing AND I should not forget to include any of the python scripts that the stage depends on, in which you can easily make a mistake. In order to solve this problem, I go back to what I like doing most: AUTOMATE things :)

Automating things...

So the python script for each stage in the dvc_scripts folder will know everything about the input, output, and metrics files, because I provide them with commandline arguments. For example, if I run the dvc_scripts/train_test_split.py with the --help argument, you will see the following:

$ python dvc_scripts/train_test_split.py  --help 

usage: train_test_split.py [-h] --raw-input-file file --train-output-file file
                           --test-output-file file

optional arguments:
  -h, --help            show this help message and exit
  --raw-input-file file
                        Name of the train output file
  --train-output-file file
                        Name of the train output file
  --test-output-file file
                        Name of the test output file
  --show-dvc-add-command
                        After the script is done, it will show the command you
                        will need to use for adding stage

As you can see, I have introduced commandline arguments for each of the items that are relevant:

The file holding the raw input
Where to write the training subset to
Where to write the test subset to

Generating the dvc run -n STAGE command

You also see that I have one more argument, namely the --show-dvc-add-command. This is the automation part: By adding this argument, the python script will use all of the information provided in the commandline arguments to build up the string that you have to add as argument to the dvc run -n STAGE command.

If we execute the dvc_scripts/train_test_split.py as follows:

$ python dvc_scripts/train_test_split.py  \
    --raw-input-file ./data/raw/raw_data.pkl  \
    --train-output-file ./data/train_test_split/train.pkl \
    --test-output-file ./data/train_test_split/test.pkl \
    --show-dvc-add-command

we will get the following output:

Size train set: 712
Size test set: 179


Please copy paste the items below AFTER command to create stage:
    dvc run -n STAGE_NAME \

  -d ./data/raw/raw_data.pkl \
  -d dvc_scripts/train_test_split.py \
  -d src/train_test_split/train_test_split.py \
  \
  -o ./data/train_test_split/train.pkl \
  -o ./data/train_test_split/test.pkl \
  \
  -p train_test_split \
  \
  \
  python dvc_scripts/train_test_split.py \
    --raw-input-file ./data/raw/raw_data.pkl \
    --train-output-file ./data/train_test_split/train.pkl \
    --test-output-file ./data/train_test_split/test.pkl

The first two lines will give the information about the split between train and test. However, after that you will get the information about the text to add after a dvc run statement. As you can see, the first dependency on the raw_data.pkl file is taken from the commandline, but the two other ones are automatically determined during the calling of the script.

You can now copy these lines and easily generate your new stage based on the way you call thescript in the dvc_folder.

Implementation of the generator function

The implementation of the function is using the importlib python module and adds all modules to the list of dependencies that are under the current working directory.

def get_dvc_stage_info(deps, outputs, metrics, params, all_args):
    """Build up the commandline that should be added after dvc run -n STAGE
    for the provided dependencies, outputs, metrics, parameters, and arguments
    provided to the script.

    This function will add all python script source files to the list of dependencies
    that are included during the runtime

    Args:
        deps: list of manual dependencies (i.e. not code) that should be added with a
            -d argument to the dvc run commandline. Typically these will be the input
            files that this script depends on, as the python scripts will be
            automatically determined
        outputs: list of all the output files that should be added with a -o argument
            to the dvc run commandline
        metrics: list of the metrics files that should be added with a -M argument
            to the dvc run commandline
        params: list of the parameters that should be added with a -p argument to the
            dvc run commandline.
        all_args: List of all of the commandline arguments that were given to this
            dvc_script script. All of these arguments are used to build up the final
            cmd provided as argument to the dvc run commandline

    Returns:
        String holding the complete text that should be added directly after a
        dvc run -n STAGE_NAME
        run
    """

    python_deps = []
    _modules = sorted(list(set(sys.modules)))

    # For each unique module, we try to import it and try to get the
    # file in which this is defined. For some of the builtins, this will 
    # fail, hence the except 
    for i in _modules:
        imported_lib = importlib.import_module(i)
        try:
            if not os.path.relpath(imported_lib.__file__, "").startswith(".."):
                python_deps.append(os.path.relpath(imported_lib.__file__, ""))
        except AttributeError:
            pass
        except ValueError:
            pass

    # Now create unique sorted list and add concatenate the user based deps
    # and the python module deps
    python_deps = sorted(list(set(python_deps)))
    all_deps = deps + python_deps

    # Start building the dvc run commandline
    # 1. Add all dependencies
    ret = ""
    for i in all_deps:
        ret += f"  -d {i} \\\n"
    ret += f"  \\\n"

    # 2. Add all outputs
    for i in outputs:
        ret += f"  -o {i} \\\n"
    ret += f"  \\\n"

    # 3. Add all parameters
    ret += f"  -p {','.join(params)} \\\n  \\\n"

    # 4. Add all metrics
    for i in metrics:
        ret += f"  -M {i} \\\n"
    ret += f"  \\\n"

    # We want to create newlines at every new argument that was provided to the script
    all_args = map(lambda x: f"\\\n    {x}" if x[0] == "-" else x, all_args)

    # Now build up the final string
    ret += "  python " + " ".join(all_args) 
    return ret

Conclusion

Although you normally would not need to write these lines very often, I really want to automate all items that I can!

Also, instead of writing the component that should be placed after the dvc run statement, the above code could also be adapted to instead show the yaml lines that are to be copied into the dvc.yaml file for this stage. The academic statement for this would be "Left as an exercise for the reader" ;-)

Multiple different models with DVC

Guido Diepen — Sat, 20 Mar 2021 14:05:35 GMT

Recently, started to play around with DVC and although the main idea is very cool and quite clear (and the tutorials were also quite clear), one of the things I did not see that much is how to deal with multiple different model types over time.

For example, in case you start out with a very simple logistic regression model and later on you decide to investigate more complex models like random forest or xgboost models.

So if you are working with just one type of model, one way I can see this with DVC is the following:

Create your project
Initialize DVC on your project
Implement logistic regression in a python script
Add a new stage train_model to your dvc.yaml that depends on for example the train.pkl output of a train_test_split stage and outputs a model.pkl file with the trained model parameters
Add a new stage evaluate_model to your dvc.yaml that depends on the model.pkl file of the train_model stage and the test.pkl output of a train_test_split stage

Step 5 could then also write some metrics to a test.json file such that you can keep track of the performance.

If you make small changes to the model (e.g. changing the feature-set or some hyper parameters in case your doing things with Ridge for example), I would consider this just small variations on one existing model-type. But it becomes different when you now start looking at a new type of model that you would like to compare to your first logistic regression model, for example you want to investigate an xgboost model. At the moment, I see two separate ways of dealing with this: always just one active model or multiple models

Always one active model

One way is to create just a new branch in your code, replace the code that is being called by the scripts executed in the DVC stages and keep all other things the same. This means that instead of the logistic regression, your train_model and evaluate_model stage now will refer to your new xgboost model. Then if you are happy with the performance compared to the previous model, you just merge your branch back into master and this way replace the original logistic regression model with your new xgboost model.

This also means that you will use the same metrics file, now with the metrics for the new model

A big advantage of this approach is that you can easily keep track of the performance of any evolution of your solution for the problem (i.e. could be different model types, different features, different hyper parameters, etc). If each version of your solution always writes to the same test.json metrics file, you can easily see how you were improving over time.

A disadvantage is that you must determine based on for example git tags when you changed model types for example

Multiple simultaneous models

Another approach would be to just create two new stages for each type of model you want to investigate. For example, if you want to add the xgboost model, you would add train_xgboost_model and evaluate_xgboost_model stages to your dvc.yaml file. Also, the original two stages might then also be called train_logistic_regression and evaluate_logistic_regression.

If you look at the DAG that would be created, that would then be something like the DAG below:

The drawback of this approach is that you will always have to keep track of which model you consider to be the active one in each of the tags if you want to track your improvements over time. A big advantage is that whenever you have a new dataset you want to train your data on, you can easily compare all of the different model types at the same time.

Hybrid approach

Maybe one approach that combines the best of both would be to have two stages for each of the different model types you would like to consider (train and evaluate) and create two separate stages for the model that you consider to be the selected one: train_selected_model and evaluate_selected_model.

These two additional stages would then be a copy of the stages for the model you consider active. This means that if you start out with a logistic regression model, the train_selected_model and evaluate_selected_model would be copies of the train_logistic_regression and evaluate_logistic_regression.

If later on you find that an xgboost model is outperforming the logistic regression significantly, you can then update the train_selected_model and evaluate_selected_model stages to refer to the train and evaluate stages of the xgboost model.

This approach would allow you to easily keep track of the main performance of your selected model (the type of which could change over time) as well as having access to the performance of all other approaches also

Other approaches/solutions?

Since I only started investigating DVC very recently, still in the process of learning all of the capabilities and what it can do. So far, very impressed with DVC though and really think it is a very good way to make sure that you can easily reproduce any model in the future. Also played a bit with the experiments feature that recently got released and have to say that I really like it also!

Very curious to hear your opinion about DVC. Also, very interested in hearing opinions about dealing with different model types for the same problem within one project. How do you solve this? Let me know in a comment!

Keeping pandas dataframe column names when using Pipeline with OneHotEncoder

Guido Diepen — Wed, 24 Feb 2021 08:43:38 GMT

In this post, I will show how to create a simple custom scikit-learner Transformer that allows you to easily deal with OneHotEncoders and Pandas Dataframes.

Recently, I re-started working more with SKLearn again to organize both my preprocessing flows, as well as the flows for estimator training and testing. In my previous jobs, I have also used these pipelines, but always ran into the same major frustration when trying to use the combination:

scikit-learn Pipeline
Pandas Dataframe
scikit-learn OneHotEncoder

This frustration is the fact that after applying a pipeline with a OneHotEncoder in it on a pandas dataframe, I lost all of the column/feature names. And of course, it is possible to fix this afterwards again using the `get_feature_names` functionality of the Pipeline but it always felt like a bit of patching afterwards.

So for example, the following code:

import pandas as pd

cat_column = ['AA', "AA", "AB", "AA", "AA", "AC", "AC"]
df = pd.DataFrame({"categorical_column":cat_column})

print(df)

will give me a simple dataframe with just one categorical column:

  categorical_column
0                 AA
1                 AA
2                 AB
3                 AA
4                 AA
5                 AC
6                 AC

Now if I want to convert this to OneHot encoded data, I have multiple options. One possible option is to make use of the get_dummies functionatity provided by pandas dataframes. The main problem I have with this is that I have no standardized way of dealing with unseen values, missing columns when moving from training to test, etc.

Another way is to make use of the OneHotEncoder preprocessing transformer that is provided by scikit-learn. With the following code, you can easily get an array that is the one hot encoded representation:

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

result = ohe.fit_transform(df)

print(ohe.categories_)
print(result)

So we indicate we want to create a OneHotEncoder with a dense structure (especially since I am going to work with dataframes later on, no need to work with the sparse matrix here). After that, we fit the encoder on the earlier created dataframe and transform the data right away according to the fitted transformer.

The first print statement will give us the categories found by the OneHotEncoder:

[array(['AA', 'AB', 'AC'], dtype=object)]

As you can see, three separate values were found. The second print statement gives us the following output:

[[1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 0. 1.]]

This output has three columns (one for each of the categories found by the OneHotEncoder and 7 lines, one for each of the original lines in the dataframe). As you can see, all information regarding the column names has vanished in this output, the information is only available in the fitted ohe transformer.

So if you now want to convert this back to a dataframe, you can do this with the following python statement:

print(pd.DataFrame(result, columns=ohe.categories_))

which will give you the following output:

    AA   AB   AC
0  1.0  0.0  0.0
1  1.0  0.0  0.0
2  0.0  1.0  0.0
3  1.0  0.0  0.0
4  1.0  0.0  0.0
5  0.0  0.0  1.0
6  0.0  0.0  1.0

The problem with this is now that this will work easily for this one column, but if you have multiple columns, you will have to do a bit more coding to get the column prefix there (because what happens if you have two categorical columns, both having rows with values AA for example).

Also, this approach of converting the output of the OneHotEncoder transformer directly after it was run works easily whenever you work directly with this encoder. However, if this encoder is part of an arbitrarily complex pipeline, it becomes more tricky.

Since I see this problem occur more than once and I did not really find any solution that did exactly what I wanted/needed, and because I also like to understand how these things work, I decided to go for the next solution: build your own :-)

Implementation of DataFrameOneHotEncoder

To solve this problem, I implemented a new custom scikit-learn transfomer, namely DataFrameOneHotEncoder. The arguments for creating are exactly the same as the arguments for the scikit-learn OneHotEncoder, with the addition of a col_overrule_params. This additional argument is a dictionary where you can provide for each column the parameters that need to be overruled (e.g. if you want to apply drop=first for a specific column for example).

The fit method of this transformer is implemented as follows:

def fit(self, X, y=None):
    """Fit a separate OneHotEncoder for each of the columns in the dataframe

    Args:
        X: dataframe
        y: None, ignored. This parameter exists only for compatibility with
            Pipeline

    Returns
        self

    Raises
        TypeError if X is not of type DataFrame
    """
    if type(X) != pd.DataFrame:
        raise TypeError(f"X should be of type dataframe, not {type(X)}")

    self.onehotencoders_ = []
    self.column_names_ = []

    for c in X.columns:
        # Construct the OHE parameters using the arguments
        ohe_params = {
            "categories": self.categories,
            "drop": self.drop,
            "sparse": False,
            "dtype": self.dtype,
            "handle_unknown": self.handle_unknown,
        }
        # and update it with potential overrule parameters for the current column
        ohe_params.update(self.col_overrule_params.get(c, {}))

        # Regardless of how we got the parameters, make sure we always set the
        # sparsity to False
        ohe_params["sparse"] = False

        # Now create, fit, and store the onehotencoder for current column c
        ohe = OneHotEncoder(**ohe_params)
        self.onehotencoders_.append(ohe.fit(X.loc[:, [c]]))

        # Get the feature names and replace each x0_ with empty and after that
        # surround the categorical value with [] and prefix it with the original
        # column name
        feature_names = ohe.get_feature_names()
        feature_names = [x.replace("x0_", "") for x in feature_names]
        feature_names = [f"{c}[{x}]" for x in feature_names]

        self.column_names_.append(feature_names)

    return self

As you can see, during the fit, we loop over all of the columns in the provided dataframe (after ensuring the X argument is actually a dataframe). For each column, we create a new sklearn OneHotEncoder using the arguments we received and fit this encoder to our current column. After that, we obtain the feature names and replace them with the desired format.

The transform method is the following:

def transform(self, X):
    """Transform X using the one-hot-encoding per column

    Args:
        X: Dataframe that is to be one hot encoded

    Returns:
        Dataframe with onehotencoded data

    Raises
        NotFittedError if the transformer is not yet fitted
        TypeError if X is not of type DataFrame
    """
    if type(X) != pd.DataFrame:
        raise TypeError(f"X should be of type dataframe, not {type(X)}")

    if not hasattr(self, "onehotencoders_"):
        raise NotFittedError(f"{type(self).__name__} is not fitted")

    all_df = []

    for i, c in enumerate(X.columns):
        ohe = self.onehotencoders_[i]

        transformed_col = ohe.transform(X.loc[:, [c]])

        df_col = pd.DataFrame(transformed_col, columns=self.column_names_[i])
        all_df.append(df_col)

    return pd.concat(all_df, axis=1)

(The full code can be found on Github)

A small example for this DataFrameOneHotEncoder, where I apply it on a similar dataframe as we did before, but this time I created two copies of the categorical column) is the following:

# We want to use default parameters on all columns, except for
# the column categorical_column_2, where we want to drop the first value

cat_column = ['AA', "AB", "AA", "AA", "AB", "AB"]
df = pd.DataFrame({"categorical_column":cat_column, "categorical_column_2": cat_column})

df_ohe = DataFrameOneHotEncoder(col_overrule_params={"categorical_column_2":{"drop":"first"}})


print(df_ohe.fit_transform(df))

will give the following output:

   categorical_column[AA]  categorical_column[AB]  categorical_column_2[AB]
0                     1.0                     0.0                       0.0
1                     0.0                     1.0                       1.0
2                     1.0                     0.0                       0.0
3                     1.0                     0.0                       0.0
4                     0.0                     1.0                       1.0
5                     0.0                     1.0                       1.0

As you can see, both the categorical columns have been converted to one hot encoded columns, where each new column name is the combination of the original column followed by the categorical value surrounded with [].

Furthermore, you can also see that for the categorical_column_2, the encoder dropped the first value AA, as requested by the parameters when we created the DataFrameOneHotEncoder.

There problably are a million other ways this can be implemented. For me it was a nice way to start understanding more about the inner details of the sklearn transformers. As stated before, the full source code for the DataFrameOntHotEncoder can be found Github. Always interested in hearing whether this helped anybody else. Also, if you have a better way, also always interested in hearing that!!

Submitting pyspark jobs to Livy with livy_submit

Guido Diepen — Wed, 19 Jun 2019 13:39:47 GMT

For some of the things I am currently working on, I am using a Spark cluster. However, direct access is not possible and the only two ways that I can run spark jobs on the cluster are using Zeppelin notebooks and using the Livy REST API, both installed on an edge node (where Zeppelin is actually using Livy also). So the layout is similar to the one depicted below:

Although I really do like notebooks (mostly Jupyter for normal python stuff, but also for some exploratory pyspark stuff I do appreciate Zeppelin ), the major problem I have with them is the fact it is difficult, if not nearly impossible to have a good Git work flow with them. So for anything that is not exploratory, I typically want to work with just Python scripts that I can put easily into a Git repository. This way, I can keep track of all modifications that I make.

Since I do have access to the Livy REST API, which itself is essentially a wrapper around spark-submit, I decided that we needed to go deeper and add another wrapper layer around Livy again :)

The result of this is the python script livy_submit, which allows you to easily submit some pyspark code to the cluster for execution. The default is to create a new Livy session for each job that you send, but optionally, you can also connect to an existing Livy session.

Each python script you submit with livy_submit is executed as a separate statement within the Livy session, which means that if you choose to reconnect to an existing livy session, the new script you send is just added as a new statement. This also means that all the variables created in the earlier statements will all exist also.

Basic livy_submit usage

Suppose you have the extremely basic calculate_pi.py script to calculate approximation of pi using spark. The contents of this script are the following:

import sys
from random import random
from operator import add

# Change this to play around with larger sets
partitions = 2
n = 10000 * partitions

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))

Now to submit this script to your spark cluster using livy_submit, you can use the following code:

livy_submit https://edgenode:port/path_for_livy -s calculate_pi.py

The above code will execute the following steps:

Connect to the Livy server given by the URL
Create a new Livy session and display the session ID
Poll the Livy server to check whether the new session is available (in idle state)
Create a new statement using the contents of the file calculate_pi.py and execute this statement
Poll the Livy server for the status of the job and display this as progress bar
After the statement is finished, retrieve the output (everything printed to stdout) from the driver and display this to the user
Delete the Livy session

The output of the code will be something like:

(test_env) C:\Users\Guido>livy_submit https://edgenode:port/path_for_livy -s calculate_pi.py
Started session with id = 1028
Waiting for session to become idle before sending statements DONE
Now executing the contents of the script LivySubmit - calculate_pi.py (statement id=0)
available |##################################################| 100.0% Complete

Output of the application on the driver:
--------------- text/plain ---------------
Pi is roughly 3.137200

Finished executing script, now removing the spark session
{'msg': 'deleted'}

(test_env) C:\Users\Guido>

Source and Installation

The source for livy_submit can be found on GitHub. I have also uploaded the project to PyPI.org, which means that you can install it with the following statement:

pip install Livy-Submit

Next steps

Livy-Submit can already do more than just the statement provided above. For example, you can also set different spark settings like the number of executors to use, the amount of memory per executor, the number of cores per executor, etc.

Additionally, you can also set some default values for the LivySubmit URL using environment variables.

I will provide some more details and examples about these features in future blog posts. Also, the application is still under development. If you have any ideas or run into any problems with it, please drop me a comment or fill in an issue on the GitHub page.

Implementing a simple plugin framework in Python

Guido Diepen — Fri, 15 Feb 2019 15:33:55 GMT

Recently, I had the need to work with a plugin-based architecture, where the plugins needed to be shareable amongst different projects in an easy way. In order to do so, I decided to create a separate python module for each plugin, where the functionality of the plugin was implemented as a class within that module.

Originally, I had the idea that each project should just have one folder where all of these modules with plugins could be located in. However, after using an initial implementation for some time, I decided it would be better if each project would define a base folder under which all plugins would be located but that each plugin could be located in an arbitrary number of subdirectories under this main plugin directory.

Suppose we want to make a really basic system where each plugin will implement a method that applies a function over a given argument. The base Plugin class could then be as follows:

class Plugin(object):
    """Base class that each plugin must inherit from. within this class
    you must define the methods that all of your plugins must implement
    """

    def __init__(self):
        self.description = 'UNKNOWN'

    def perform_operation(self, argument):
        """The method that we expect all plugins to implement. This is the
        method that our framework will call
        """
        raise NotImplementedError

Writing a plugin that performs the identity function would require you to first set the description in the constructor and provide an implementation of the perform_operation method. An example implementation of this is as follows:

import plugin_collection 

class Identity(plugin_collection.Plugin):
    """This plugin is just the identity function: it returns the argument
    """
    def __init__(self):
        super().__init__()
        self.description = 'Identity function'

    def perform_operation(self, argument):
        """The actual implementation of the identity plugin is to just return the
        argument
        """
        return argument

Because we want to create a system where we dynamically load the different plugin modules, we will define a new class PluginCollection that will take care of the loading of all plugins and enables the functionality to apply the perform_operation of all plugins on a supplied value. The basic components for this PluginCollection class are as follows:

class PluginCollection(object):
    """Upon creation, this class will read the plugins package for modules
    that contain a class definition that is inheriting from the Plugin class
    """

    def __init__(self, plugin_package):
        """Constructor that initiates the reading of all available plugins
        when an instance of the PluginCollection object is created
        """
        self.plugin_package = plugin_package
        self.reload_plugins()


    def reload_plugins(self):
        """Reset the list of all plugins and initiate the walk over the main
        provided plugin package to load all available plugins
        """
        self.plugins = []
        self.seen_paths = []
        print()
        print(f'Looking for plugins under package {self.plugin_package}')
        self.walk_package(self.plugin_package)


    def apply_all_plugins_on_value(self, argument):
        """Apply all of the plugins on the argument supplied to this function
        """
        print()
        print(f'Applying all plugins on value {argument}:')
        for plugin in self.plugins:
            print(f'    Applying {plugin.description} on value {argument} yields value {plugin.perform_operation(argument)}')

The final component of this class is the walk_package method. The basic steps in this method are:

Look for all modules in the supplied package
For each module found, get all classes defined in the module and check if the class is a subclass of Plugin, but not the Plugin class itself. Each class that satisfies these criteria will be instantiated and added to a list. The advantage of this check is that you can still place some other python modules within the same directories and they will not influence the plugin framework.
Check all sub-folders within the current package and recurse into these packages to recursively search for plugins.

Items 1 and 2 of the above steps can be implemented with the following code:

def walk_package(self, package):
    """Recursively walk the supplied package to retrieve all plugins
    """
    imported_package = __import__(package, fromlist=['blah'])

    for _, pluginname, ispkg in pkgutil.iter_modules(imported_package.__path__, imported_package.__name__ + '.'):
        if not ispkg:
            plugin_module = __import__(pluginname, fromlist=['blah'])
            clsmembers = inspect.getmembers(plugin_module, inspect.isclass)
            for (_, c) in clsmembers:
                # Only add classes that are a sub class of Plugin, but NOT Plugin itself
                if issubclass(c, Plugin) & (c is not Plugin):
                    print(f'    Found plugin class: {c.__module__}.{c.__name__}')
                    self.plugins.append(c())

The recursive part of the method is then implemented by calling the same walk_package method again on each subpackage found within the current package. We first build up the list of all current paths (either just the __path__ in case it is a string, or the elements if it is a _Namespace object. After that, we just initiate the recursive call for each of the elements. The complete implementation of the recursive part is then:

    # Now that we have looked at all the modules in the current package, start looking
    # recursively for additional modules in sub packages
    all_current_paths = []
    if isinstance(imported_package.__path__, str):
        all_current_paths.append(imported_package.__path__)
    else:
        all_current_paths.extend([x for x in imported_package.__path__])

    for pkg_path in all_current_paths:
        if pkg_path not in self.seen_paths:
            self.seen_paths.append(pkg_path)

            # Get all subdirectory of the current package path directory
            child_pkgs = [p for p in os.listdir(pkg_path) if os.path.isdir(os.path.join(pkg_path, p))]

            # For each subdirectory, apply the walk_package method recursively
            for child_pkg in child_pkgs:
                self.walk_package(package + '.' + child_pkg)

If we now implement two additional plugins DoublePositive (doubles the argument) and DoubleNegative (doubles and negates the argument) and place the modules for these two plugins in a subdirectory double under the plugins directory, our plugin collection should be able to handle these.

In order to test everything, we can write a very simple application:

from plugin_collection import PluginCollection

my_plugins = PluginCollection('plugins')
my_plugins.apply_all_plugins_on_value(5)

This application will first initialize a PluginCollection on the package plugins. When instantiated, this plugin collection will recursively look for all plugins defined under the folder plugins. After the initialization is done, it will call the perform_operation on each of the plugins with the value 5.

And behold, if you now run the application you will see the following output:

$ python main_application.py

Looking for plugins under package plugins
    Found plugin class: plugins.identity.Identity
    Found plugin class: plugins.double.double_negative.DoubleNegative
    Found plugin class: plugins.double.double_positive.DoublePositive

Applying all plugins on value 5:
    Applying Identity function on value 5 yields value 5
    Applying Negative double function on value 5 yields value -10
    Applying Double function on value 5 yields value 10

Without any direct reference to any of the plugin modules in your code, it automatically found all of the plugins..... MAGIC :)

A complete working version of the above example code can be found in a separate Github repository.

Drop me a comment in case you think this is useful, or if you have tips, comments, or better approaches :)

Tracking multiple faces

Guido Diepen — Mon, 27 Feb 2017 11:07:24 GMT

In my previous blog article I showed how you can use the excellent OpenCV and dlib libraries to easily create a program that can detect a face and track it when the face is moving.

The code in the previous blog article was created in such a way that if it was not tracking any face, it would look for all faces in the current frame. If multiple faces were found, the largest face was selected and used for tracking. As long as the dlib correlation tracker was able to successfully track the face, no other faces would be detected.

In this post, I will show the modifications that are needed to go from tracking the largest face to tracking all faces that are visible. So, if there are three faces on the screen we would like to see that each of the three faces is detected:

The biggest change will be that even if we are tracking one or more faces, we will have to run the HAAR face detection every so many frames to detect any new face that might have become visible in the meantime. When tracking just one face, we don't have to worry about this: only after the tracker lost track of the face, we have to start looking for a new face to track.

An import problem we need to solve for this is that when we run the face detection, we somehow need to determine which of the detected faces match the faces we are already tracking with the correlation trackers. One simple approach for checking if a detected face matches an existing correlation tracker region is to see whether the center point of the detected face is within the region of a tracker AND if the center point of that same tracker is also within the bound of the detected face.

So the approach to detect and track multiple faces is to use the following steps within our main loop:

Update all correlation trackers and remove all trackers that are not considered reliable anymore (e.g. too much movement)
Every 10 frames, perform the following:
1. Use face detection on the current frame and find all faces
2. For each found face, check if there exists a tracker for which holds that the center point of the detected face is within the region of the tracker AND whether the center point of that same tracker is within the bounding box of the detected face.
3. If there exist such a tracker, the detected face most likely was already being tracked. If no such tracker exists, we are dealing with a new face and we have to start a new tracker for this face.
Use the region information for all trackers (including the trackers for the new faces created in the previous step) to draw the bounding rectangles

Within the main loop, we can update all existing trackers and delete all of the trackers for which the tracking quality falls below the threshold we have set (in the code below the value 7). The lower you set this value, the less the tracker will loose the object, but the higher the chances are you are not tracking the original object anymore.

#Update all the trackers and remove the ones for which the update
#indicated the quality was not good enough
fidsToDelete = []
for fid in faceTrackers.keys():
    trackingQuality = faceTrackers[ fid ].update( baseImage )

    #If the tracking quality is good enough, we must delete
    #this tracker
    if trackingQuality < 7:
        fidsToDelete.append( fid )

for fid in fidsToDelete:
    print("Removing tracker " + str(fid) + " from list of trackers")
    faceTrackers.pop( fid , None )

Every 10 frames, we now want to detect all the faces in the current frame and map each of the found faces to an existing tracked face where possible or create a new tracker for faces we were not yet tracking.

#Now use the haar cascade detector to find all faces
#in the image
faces = faceCascade.detectMultiScale(gray, 1.3, 5)

#Loop over all faces and check if the area for this
#face is the largest so far
#We need to convert it to int here because of the
#requirement of the dlib tracker. If we omit the cast to
#int here, you will get cast errors since the detector
#returns numpy.int32 and the tracker requires an int
for (_x,_y,_w,_h) in faces:
    x = int(_x)
    y = int(_y)
    w = int(_w)
    h = int(_h)

    #calculate the centerpoint
    x_bar = x + 0.5 * w
    y_bar = y + 0.5 * h

    #Variable holding information which faceid we 
    #matched with
    matchedFid = None

    #Now loop over all the trackers and check if the 
    #centerpoint of the face is within the box of a 
    #tracker
    for fid in faceTrackers.keys():
        tracked_position =  faceTrackers[fid].get_position()

        t_x = int(tracked_position.left())
        t_y = int(tracked_position.top())
        t_w = int(tracked_position.width())
        t_h = int(tracked_position.height())

        #calculate the centerpoint
        t_x_bar = t_x + 0.5 * t_w
        t_y_bar = t_y + 0.5 * t_h

        #check if the centerpoint of the face is within the 
        #rectangleof a tracker region. Also, the centerpoint
        #of the tracker region must be within the region 
        #detected as a face. If both of these conditions hold
        #we have a match
        if ( ( t_x <= x_bar   <= (t_x + t_w)) and 
             ( t_y <= y_bar   <= (t_y + t_h)) and 
             ( x   <= t_x_bar <= (x   + w  )) and 
             ( y   <= t_y_bar <= (y   + h  ))):
            matchedFid = fid

    #If no matched fid, then we have to create a new tracker
    if matchedFid is None:
        print("Creating new tracker " + str(currentFaceID))
        #Create and store the tracker 
        tracker = dlib.correlation_tracker()
        tracker.start_track(baseImage,
                            dlib.rectangle( x-10,
                                            y-20,
                                            x+w+10,
                                            y+h+20))

        faceTrackers[ currentFaceID ] = tracker

        #Increase the currentFaceID counter
        currentFaceID += 1

In order to show that each face is recognized and tracked separately, we also want to write a descriptive text above the bounding rectangle. To make this functionality show a glimpse of what we want to achieve (full face recognition) we first show a text indicating we are in a detection face, and after that we show the actual description. After a face has been detected, we will use a small background thread that will set the description for region after about 2 seconds. Now the display can be achieved by checking whether there already exists a description for this tracker and write the text accordingly:

#Now loop over all the trackers we have and draw the rectangle
#around the detected faces. If we 'know' the name for this person
#(i.e. the recognition thread is finished), we print the name
#of the person, otherwise the message indicating we are detecting
#the name of the person
for fid in faceTrackers.keys():
    tracked_position =  faceTrackers[fid].get_position()

    t_x = int(tracked_position.left())
    t_y = int(tracked_position.top())
    t_w = int(tracked_position.width())
    t_h = int(tracked_position.height())

    cv2.rectangle(resultImage, (t_x, t_y),
                            (t_x + t_w , t_y + t_h),
                            rectangleColor ,2)

    #If we do have a name for this faceID already, we print the name
    if fid in faceNames.keys():
        cv2.putText(resultImage, faceNames[fid] , 
                    (int(t_x + t_w/2), int(t_y)), 
                    cv2.FONT_HERSHEY_SIMPLEX,
                    0.5, (255, 255, 255), 2)
    else:
        cv2.putText(resultImage, "Detecting..." , 
                    (int(t_x + t_w/2), int(t_y)), 
                    cv2.FONT_HERSHEY_SIMPLEX,
                    0.5, (255, 255, 255), 2)

When we combine all of this, we now can do detection of multiple faces and track these faces separately. As soon as we have detected a face and we are tracking it, we only would have to do face recognition once because as long as we are tracking a face we know the link between the tracked region and the properties.

A demo of this new version is given in the animation below:

From this animation, you can see that initially my real face is detected as Person 0. From the beginning till the end, it is tracked and stays known as Person 0. The two pictures are later detected and tracked as Person 1 and Person 2. You will see that the tracking of one of the images is lost because it goes outside of the screen. Furthermore, you will see that the white paper partially overlaps with the region of my real face. Depending on the value of the earlier mentioned threshold the tracker will be able to keep on tracking the person even when it is partially occluded.

Like the previous article, you can find the complete source of this project on my Github page.

Detecting and tracking a face with Python and OpenCV

Guido Diepen — Fri, 10 Feb 2017 08:34:54 GMT

At work, I was asked whether I wanted to help out on a project dealing with a robot that could do autonomous navigation and combine this with both speech recognition and most importantly: face recognition. The moment I heard about this project, I knew I wanted to be involved :)

During the course of the project, we got asked whether we would be able to have the robot ready before the 18th of November: Deloitte was sponsoring the TEDx in Amsterdam and it would be great to show the robot at the Deloitte stand. Fortunately, we were able to finish everything before the 18th:

Although the robot is visible in the picture above, being held by Naser Bakhshi, you cannot see it that well, so below a close-up of the robot we demonstrated at the TEDx:

As you can see in the picture of the robot, it consists of a Lego Mindstorms EV3 unit, some additional Lego Technic, and an iPhone. The two components of the robot you cannot see are a laptop (the actual brain of our robot) and a router (connecting the EV3 unit and the iPhone with the laptop).

As mentioned, one of the features of our robot is that it will do face recognition. In order do this, the first thing we will have to do is to detect faces and keep tracking them. In this blog post, I want to focus on showing how we made use of Python and OpenCV to detect a face and then use the dlib library to efficiently keep tracking the face.

Detecting a face

After we decided to make use of Python, the first feature we would need for performing face recognition is to detect where in the current field of vision a face is present. Using the OpenCV library, you can make use of the HAAR cascade filters to do this efficiently.

During the implementation, we made use of Anaconda with Python 3.5, OpenCV 3.1.0, and dlib 19.1.0. If you want to use the code in this article, please make sure that you have these (or newer) versions.

In order to do the face detection, we first need to perform a couple of initializations:

#Import the OpenCV library
import cv2

#Initialize a face cascade using the frontal face haar cascade provided
#with the OpenCV2 library
faceCascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

#The deisred output width and height
OUTPUT_SIZE_WIDTH = 775
OUTPUT_SIZE_HEIGHT = 600

#Open the first webcame device
capture = cv2.VideoCapture(0)

#Create two opencv named windows
cv2.namedWindow("base-image", cv2.WINDOW_AUTOSIZE)
cv2.namedWindow("result-image", cv2.WINDOW_AUTOSIZE)

#Position the windows next to eachother
cv2.moveWindow("base-image",0,100)
cv2.moveWindow("result-image",400,100)

#Start the window thread for the two windows we are using
cv2.startWindowThread()

rectangleColor = (0,165,255)

The rest of the code will be an infinite loop that retrieves the latest image from the webcam, detects all faces within the image, draws a rectangle around the largest face, and finally shows the output in a window.

The above can be achieved with the following code within an infinite loop:

#Retrieve the latest image from the webcam
rc,fullSizeBaseImage = capture.read()

#Resize the image to 320x240
baseImage = cv2.resize( fullSizeBaseImage, ( 320, 240))


#Check if a key was pressed and if it was Q, then destroy all
#opencv windows and exit the application
pressedKey = cv2.waitKey(2)
if pressedKey == ord('Q'):
cv2.destroyAllWindows()
exit(0)



#Result image is the image we will show the user, which is a
#combination of the original image from the webcam and the
#overlayed rectangle for the largest face
resultImage = baseImage.copy()


#For the face detection, we need to make use of a gray colored
#image so we will convert the baseImage to a gray-based image
gray = cv2.cvtColor(baseImage, cv2.COLOR_BGR2GRAY)
#Now use the haar cascade detector to find all faces in the
#image
faces = faceCascade.detectMultiScale(gray, 1.3, 5)


#For now, we are only interested in the 'largest' face, and we
#determine this based on the largest area of the found
#rectangle. First initialize the required variables to 0
maxArea = 0
x = 0
y = 0
w = 0
h = 0


#Loop over all faces and check if the area for this face is
#the largest so far
for (_x,_y,_w,_h) in faces:
    if  _w*_h > maxArea:
        x = _x
        y = _y
        w = _w
        h = _h
        maxArea = w*h

    #If one or more faces are found, draw a rectangle around the
    #largest face present in the picture
    if maxArea > 0 :
        cv2.rectangle(resultImage,  (x-10, y-20),
	    		    (x + w+10 , y + h+20),
		    	    rectangleColor,2)



#Since we want to show something larger on the screen than the
#original 320x240, we resize the image again
#
#Note that it would also be possible to keep the large version
#of the baseimage and make the result image a copy of this large
#base image and use the scaling factor to draw the rectangle
#at the right coordinates.
largeResult = cv2.resize(resultImage,
		     (OUTPUT_SIZE_WIDTH,OUTPUT_SIZE_HEIGHT))

#Finally, we want to show the images on the screen
cv2.imshow("base-image", baseImage)
cv2.imshow("result-image", largeResult)

Tracking the face

The above code for face detection has some drawbacks:

The code might be computationally expensive
If the detected person is turning his/her head slightly, the haar cascade might not detect the face anymore
Very difficult to keep track of a face between frames (i.e. to later only do face recognition one a detected face once and not in every loop).

A better approach for this is to do the detection of the face once and then use the correlation tracker from the excellent dlib library to just keep track of the relevant region from frame to frame.

For this to work, we need to import another library and initialize additional variables:

import dlib

#Create the tracker we will use
tracker = dlib.correlation_tracker()

#The variable we use to keep track of the fact whether we are
#currently using the dlib tracker
trackingFace = 0

Within the infinite for-loop, we will now have to determine if the dlib correlation tracker is currently tracking a region in the image. If this is not the case, we will use a similar code as before to find the largest face, but instead of drawing the rectangle, we use the found coordinates to initialize the correlation tracker.

#If we are not tracking a face, then try to detect one
if not trackingFace:

    #For the face detection, we need to make use of a gray
    #colored image so we will convert the baseImage to a
    #gray-based image
    gray = cv2.cvtColor(baseImage, cv2.COLOR_BGR2GRAY)
    #Now use the haar cascade detector to find all faces
    #in the image
    faces = faceCascade.detectMultiScale(gray, 1.3, 5)

    #In the console we can show that only now we are
    #using the detector for a face
    print("Using the cascade detector to detect face")


    #For now, we are only interested in the 'largest'
    #face, and we determine this based on the largest
    #area of the found rectangle. First initialize the
    #required variables to 0
    maxArea = 0
    x = 0
    y = 0
    w = 0
    h = 0


    #Loop over all faces and check if the area for this
    #face is the largest so far
    #We need to convert it to int here because of the
    #requirement of the dlib tracker. If we omit the cast to
    #int here, you will get cast errors since the detector
    #returns numpy.int32 and the tracker requires an int
    for (_x,_y,_w,_h) in faces:
        if  _w*_h > maxArea:
            x = int(_x)
            y = int(_y)
            w = int(_w)
            h = int(_h)
            maxArea = w*h

    #If one or more faces are found, initialize the tracker
    #on the largest face in the picture
    if maxArea > 0 :

        #Initialize the tracker
        tracker.start_track(baseImage,
                            dlib.rectangle( x-10,
                                            y-20,
                                            x+w+10,
                                            y+h+20))

        #Set the indicator variable such that we know the
        #tracker is tracking a region in the image
        trackingFace = 1

Now the final bit within the infinite loop is to check again if the correlation tracker is actively tracking a face (i.e. it could be that it just detected a face with the above code). If the tracker is actively tracking a face in the image, we will now update the tracker. Depending on the quality of the update (i.e. how confident is the tracker about whether it is still tracking the same face) we either draw a rectangle around the region indicated by the tracker or we indicate we are not tracking a face anymore:

#Check if the tracker is actively tracking a region in the image
if trackingFace:

    #Update the tracker and request information about the
    #quality of the tracking update
    trackingQuality = tracker.update( baseImage )



    #If the tracking quality is good enough, determine the
    #updated position of the tracked region and draw the
    #rectangle
    if trackingQuality >= 8.75:
        tracked_position =  tracker.get_position()

        t_x = int(tracked_position.left())
        t_y = int(tracked_position.top())
        t_w = int(tracked_position.width())
        t_h = int(tracked_position.height())
        cv2.rectangle(resultImage, (t_x, t_y),
                                    (t_x + t_w , t_y + t_h),
                                    rectangleColor ,2)

    else:
        #If the quality of the tracking update is not
        #sufficient (e.g. the tracked region moved out of the
        #screen) we stop the tracking of the face and in the
        #next loop we will find the largest face in the image
        #again
        trackingFace = 0

As you can see in the code, we print a message to the console every time we use the detector again. If you look at the output of the console while running this application, you will notice that even if you move quite a bit around on the screen, the tracker is quite good at following a face once it is detected.

When using the above code, you should see a screen similar to the following, where the program indicates it detected my face:

You can find the source for both the detect and detect-and-track in my [Face Recognition GitHub repository](https://github.com/gdiepen/face-recognition). If you have any updates, please fork and send me a pull request.

Will also do some follow-up posts on how we did the face recognition.

Converting tabs to spaces in ViM with retab

Guido Diepen — Thu, 11 Oct 2012 08:56:23 GMT

Because I wanted to do some things with the Twitter API, lately I started using Python a bit more again. One cause for major headaches when working with Python is the issue of source files mixing tabs and spaces.

Personally, I always like to set up a tabstop and shiftwidth of 4 in my ViM session and expand all tabs to spaces. I achieve this by using these settings in my .vimrc:

set tabstop=4 set shiftwidth=4 set expandtab

Before, when I opened a file that had mixed tabs and spaces, I always used a find-and-replace approach to change all tabs into spaces. However, yesterday I found out that a much easier approach exists in ViM (as usually… ![;)](http://www.guidodiepen.nl/wp-includes/images/smilies/icon_wink.gif) ), namely by using the :retab command. After opening a file that has mixed tabs and spaces, using the :retab command will actually reformat the current buffer based on the setting of expandtab.

Always happy when I find these little gems that make my life easier

Small update in SynCE-KPM

Guido Diepen — Fri, 12 Feb 2010 17:52:21 GMT

Even though I was not feeling well today (spent day sick at home, having cold shivers all the time…), I did look into one small new feature for SynCE-KPM.

I noticed that on my windows computer, Windows Mobile Device Center was able to show a picture of my device. I was always wondering where this information was taken from. Turns out that this picture is actually taken from the device itself. I guess it must be something prescribed by ActiveSync or Windows Mobile Device Center to have the image /windows/sync.ico present.

After the device is connected, SynCE-KPM will now look if that file exists, and if it does, obtain it and show it to the user in the main window.

Example screen with two different phones I have:

[![SynCE-KPM now shows device (example 1)](/content/images/2010/02/synce-kpm-screenshot-20100212-01.png "SynCE-KPM now shows device (example 1)")](/content/images/2010/02/synce-kpm-screenshot-20100212-01.png)SynCE-KPM now shows device (example 1)

[![SynCE-KPM now shows device (example 2, my HTC HD2)](/content/images/2010/02/synce-kpm-screenshot-20100212-02.png "SynCE-KPM now shows device (example 2, my HTC HD2)")](/content/images/2010/02/synce-kpm-screenshot-20100212-02.png)SynCE-KPM now shows device (example 2, my HTC HD2)

I still have the major new feature left: implementing the remote registry editor. This is still on my TODO list, but unfortunately I don’t have as much spare time as I would like to have to work on this.

Another thing that I need to work on with SynCE-KPM is the usage of layout managers instead of not being able to resize as with the current implementation. Will take me some time also. Will get there though, slowly

Viewing registry (keys only...) from within SynCE-KPM

Guido Diepen — Thu, 03 Sep 2009 06:09:14 GMT

After the previous post I started working on integrating the stand-alone prototype within SynCE-KPM. As mentioned, at the moment I only have a view on the keys, not yet the values. Furthermore, at the moment there is no error checking, switching a device will not work, and some other minor things

However, as you can see from the screencast

it is already possible to view all the registry keys that are present in the registry. Also, from the screencast you can see the Fetching… node that is shown to the user whenever the actual data is being queried for the first time. It can also be clearly seen that the fetching is done in a separate thread, not related to the GUI thread, because even while fetching, the GUI still is responsive. Any further requests for fetching the user does, are just queued to the dataserver, which will retrieve them in the order they were requested.

I have already thought about how to incorporate the values and I have come up with a solution that uses the current registry and registry key objects, but with a different model. This means I will have to custom models, both subclasses of QAbstractItemModel, using the same underlying data, but only the parts they need.

One thhings I still need to do is determining what to do with updates (i.e. when to refetch keys, when being clicked on, or have the user explicitly request refetch). Also some GUI things need to be changed. At the moment it appears that the current window width of synce-kpm is not enough for a registry editor, might have to make the application a little bit wider. All in all, plenty of things to be done (and sooooo little time… )

Release of SynCE-KPM 0.12

Guido Diepen — Wed, 20 Aug 2008 15:27:35 GMT

Although it has been released for some time already, finally there is the message on my blog ! :)

Unfortunately I did not have a lot of time to do this earlier, first of all my time was taken completely by finishing up on my thesis and getting all the details right. After that there was a servercrash at my hosting provider and after that I went on vacation.

The main improvements with SynCE-KPM 0.12 are that some bugs have been fixed and that SynCE-HAL is now supported. This synce-hal is a new connection manager that is intented to replace odccm. The main advantage of synce-hal over odccm is that we do not longer require a separate daemon running (odccm) in the background, but everything is taken care of by HAL. The moment a device is plugged in, HAL will make fire up the needed stuff. Another advantage of SynCE-HAL is that it also is able to work transparently with the legacy way of connecting (i.e. PPP over USB) and that bluetooth is working also (though I don’t know about whether this feature is enabled in the packaged version of synce-hal, I did have it working when running the SVN version of it some time ago).

When people have new ideas of things that can be added to future versions of SynCE-KPM, please let me know and I will see what I can do to implement these wishes. One thing that is still on my own wish list is to implement a remote registry editor. For this I will have to put some more time in pyqt though !:)