XTuner Docker images available

XTuner is an efficient, flexible and full-featured toolkit for fine-tuning large models (InternLM, Llama, Baichuan, Qwen, ChatGLM) and released under the Apache 2.0 license. The advantage of this framework is that it is not tied down to a specific LLM architecture, but supports multiple ones out of the box. With the just released version v0.2.0 of our llm-dataset-converter Python library, you can read and write the XTuner JSON format (and apply the usual filtering, of course).

Here are the newly added image tags:

  • In-house registry:

    • public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-xtuner:2024-02-19_cuda11.7

  • Docker hub:

    • waikatodatamining/pytorch-xtuner:2024-02-19_cuda11.7

Of course, you can use these Docker images in conjunction with our gifr Python library for gradio interfaces as well (gifr-textgen). Just now we released version 0.0.4 of the library, which is more flexible in regards to text generation: it can now support send and receive the conversation history and also parse JSON responses.

Text classification support

Large language models (LLMs) for chatbots are all the rage at the moment, but there is plenty of scope of simpler tasks like text classification. Requiring less resources and being a lot faster is nice as well.

We turned the HuggingFace example for sequence classification into a docker image to make it easy for building such classification models.

  • In-house registry:

    • public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-huggingface-transformers:4.36.0_cuda11.7_classification

  • Docker hub:

    • waikatodatamining/pytorch-huggingface-transformers:4.36.0_cuda11.7_classification

Our gifr Python library for gradio received an interface for text classification (gifr-textclass) in version 0.0.3.

The llm-dataset-converter library obtained native support for text classification formats with version 0.1.1.

Llama-2 Docker images available

Llama-2, despite not actually being open-source as advertised, is a very powerful large language model (LLM), which can also be fine-tuned with custom data. With version v0.0.3 of our llm-dataset-converter Python library, it is now possible to generate data in jsonlines format that the new Docker images for Llama-2 can consume:

  • In-house registry:

    • public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-huggingface-transformers:4.31.0_cuda11.7_llama2

  • Docker hub:

    • waikatodatamining/pytorch-huggingface-transformers:4.31.0_cuda11.7_llama2

Of course, you can use these Docker images in conjunction with our gifr Python library for gradio interfaces as well (gifr-textgen).

gifr release

A lot of our Docker images allow the user to make predictions in two ways: using simple file-polling or via a Redis backend. File-polling is great for testing, but unsuitable for a production system due to wear-and-tear on SSDs.

Initially, I developed a really simple library for sending and receiving data via Redis, called simple-redis-helper:

https://github.com/fracpete/simple-redis-helper

With this library you get some command-line tools for broadcasting, listening, etc. Sufficient for someone who is comfortable with the command-line (or especially when logged in remotely via terminal), but not so great for your clients.

Now, there is the brilliant gradio library that was specifically developed for such scenarios: to create easy to use and great looking interfaces for your machine learning models.

The last couple of days, I have put together a new library that is tailored to our Docker images called gifr:

https://github.com/waikato-datamining/gifr

With the first release, the following types of models are supported:

  • image classification

  • image segmentation

  • object detection/instance segmentation

  • text generation

llm-dataset-converter release

Over the last couple of months, we have been working on a little command-line tool that allows you to convert LLM datasets from one format into another, appropriately called llm-dataset-converter:

https://github.com/waikato-llm/llm-dataset-converter

With the first release (0.0.1), you can not only load data from and save to in various formats (csv/tsv, text, json, jsonlines, parquet). The tool lets you define pipelines using the following format:

reader [filter [filter ...]] [writer]

Each component in the pipeline comes with its own set of command-line parameters. You can even tee off records and process them differently (e.g., writing the same data to different output formats).

The library also has other tools, for downloading files or datasets from huggingface or combining text files.

In order to make building such pipeline-oriented tools simpler to develop, we created a base library that manages the handling of plugins (and, if necessary, their compatibility) called seppl (Simple Entry Point PipeLines):

https://github.com/waikato-datamining/seppl

Thanks to seppl, the llm-dataset-converter library can be easily extended with additional modules, as it uses a dynamic approach to locating plugins: you only need to define in what modules to look for what superclass (like Reader, Filter, Writer).

Finetune GTP2-XL Docker images available

The finetune-gpt2xl repository allows the fine-tuning and using of GPT2-XL and GPT-Neo models (the repository uses the Hugging Face transformers library) and is now available via the following docker images:

  • In-house registry:

    • public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-huggingface-transformers:4.7.0_cuda11.1_finetune-gpt2xl_20220924

  • Docker hub:

    • waikatodatamining/pytorch-huggingface-transformers:4.7.0_cuda11.1_finetune-gpt2xl_20220924

Segment-Anything in High Quality Docker images available

Docker images for Segment-Anything in High Quality (SAM-HQ) are now available.

Just like SAM, SAM-HQ is a great tool for aiding a human annotating images for image segmentation or object detection, as it can determine a relatively good outline of an object based on either a point or a box. Only pre-trained models are available.

The code used by the docker images is available from here:

github.com/waikato-datamining/pytorch/tree/master/segment-anything-hq

The tags for the images are as follows:

  • In-house registry:

    • public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-sam-hq:2023-08-17_cuda11.6

    • public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-sam-hq:2023-08-17_cpu

  • Docker hub:

    • waikatodatamining/pytorch-sam-hq:2023-08-17_cuda11.6

    • waikatodatamining/pytorch-sam-hq:2023-08-17_cpu