image-dataset-converter release

Based on lessons learned from our wai-annotations library, we simplified and streamlined the design of a data processing library (though limited to just image datasets). Of course, it makes use of the latest seppl version, which also simplified how plugins are being located at runtime and development time.

The new kid on the block is called image-dataset-converter and its code is located here:

Whilst it is based on wai-annotations, it already contains additional functionality.

And, of course, we also have resources demonstrating how to use the new library:

XTuner Docker images available

XTuner is an efficient, flexible and full-featured toolkit for fine-tuning large models (InternLM, Llama, Baichuan, Qwen, ChatGLM) and released under the Apache 2.0 license. The advantage of this framework is that it is not tied down to a specific LLM architecture, but supports multiple ones out of the box. With the just released version v0.2.0 of our llm-dataset-converter Python library, you can read and write the XTuner JSON format (and apply the usual filtering, of course).

Here are the newly added image tags:

  • In-house registry:


  • Docker hub:

    • waikatodatamining/pytorch-xtuner:2024-02-19_cuda11.7

Of course, you can use these Docker images in conjunction with our gifr Python library for gradio interfaces as well (gifr-textgen). Just now we released version 0.0.4 of the library, which is more flexible in regards to text generation: it can now support send and receive the conversation history and also parse JSON responses.

Text classification support

Large language models (LLMs) for chatbots are all the rage at the moment, but there is plenty of scope of simpler tasks like text classification. Requiring less resources and being a lot faster is nice as well.

We turned the HuggingFace example for sequence classification into a docker image to make it easy for building such classification models.

  • In-house registry:


  • Docker hub:

    • waikatodatamining/pytorch-huggingface-transformers:4.36.0_cuda11.7_classification

Our gifr Python library for gradio received an interface for text classification (gifr-textclass) in version 0.0.3.

The llm-dataset-converter library obtained native support for text classification formats with version 0.1.1.

Llama-2 Docker images available

Llama-2, despite not actually being open-source as advertised, is a very powerful large language model (LLM), which can also be fine-tuned with custom data. With version v0.0.3 of our llm-dataset-converter Python library, it is now possible to generate data in jsonlines format that the new Docker images for Llama-2 can consume:

  • In-house registry:


  • Docker hub:

    • waikatodatamining/pytorch-huggingface-transformers:4.31.0_cuda11.7_llama2

Of course, you can use these Docker images in conjunction with our gifr Python library for gradio interfaces as well (gifr-textgen).

gifr release

A lot of our Docker images allow the user to make predictions in two ways: using simple file-polling or via a Redis backend. File-polling is great for testing, but unsuitable for a production system due to wear-and-tear on SSDs.

Initially, I developed a really simple library for sending and receiving data via Redis, called simple-redis-helper:

With this library you get some command-line tools for broadcasting, listening, etc. Sufficient for someone who is comfortable with the command-line (or especially when logged in remotely via terminal), but not so great for your clients.

Now, there is the brilliant gradio library that was specifically developed for such scenarios: to create easy to use and great looking interfaces for your machine learning models.

The last couple of days, I have put together a new library that is tailored to our Docker images called gifr:

With the first release, the following types of models are supported:

  • image classification

  • image segmentation

  • object detection/instance segmentation

  • text generation

llm-dataset-converter release

Over the last couple of months, we have been working on a little command-line tool that allows you to convert LLM datasets from one format into another, appropriately called llm-dataset-converter:

With the first release (0.0.1), you can not only load data from and save to in various formats (csv/tsv, text, json, jsonlines, parquet). The tool lets you define pipelines using the following format:

reader [filter [filter ...]] [writer]

Each component in the pipeline comes with its own set of command-line parameters. You can even tee off records and process them differently (e.g., writing the same data to different output formats).

The library also has other tools, for downloading files or datasets from huggingface or combining text files.

In order to make building such pipeline-oriented tools simpler to develop, we created a base library that manages the handling of plugins (and, if necessary, their compatibility) called seppl (Simple Entry Point PipeLines):

Thanks to seppl, the llm-dataset-converter library can be easily extended with additional modules, as it uses a dynamic approach to locating plugins: you only need to define in what modules to look for what superclass (like Reader, Filter, Writer).