Llama-2 Docker images available

Llama-2, despite not actually being open-source as advertised, is a very powerful large language model (LLM), which can also be fine-tuned with custom data. With version v0.0.3 of our llm-dataset-converter Python library, it is now possible to generate data in jsonlines format that the new Docker images for Llama-2 can consume:

  • In-house registry:

    • public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-huggingface-transformers:4.31.0_cuda11.7_llama2

  • Docker hub:

    • waikatodatamining/pytorch-huggingface-transformers:4.31.0_cuda11.7_llama2

Of course, you can use these Docker images in conjunction with our gifr Python library for gradio interfaces as well (gifr-textgen).

gifr release

A lot of our Docker images allow the user to make predictions in two ways: using simple file-polling or via a Redis backend. File-polling is great for testing, but unsuitable for a production system due to wear-and-tear on SSDs.

Initially, I developed a really simple library for sending and receiving data via Redis, called simple-redis-helper:

https://github.com/fracpete/simple-redis-helper

With this library you get some command-line tools for broadcasting, listening, etc. Sufficient for someone who is comfortable with the command-line (or especially when logged in remotely via terminal), but not so great for your clients.

Now, there is the brilliant gradio library that was specifically developed for such scenarios: to create easy to use and great looking interfaces for your machine learning models.

The last couple of days, I have put together a new library that is tailored to our Docker images called gifr:

https://github.com/waikato-datamining/gifr

With the first release, the following types of models are supported:

  • image classification

  • image segmentation

  • object detection/instance segmentation

  • text generation

llm-dataset-converter release

Over the last couple of months, we have been working on a little command-line tool that allows you to convert LLM datasets from one format into another, appropriately called llm-dataset-converter:

https://github.com/waikato-llm/llm-dataset-converter

With the first release (0.0.1), you can not only load data from and save to in various formats (csv/tsv, text, json, jsonlines, parquet). The tool lets you define pipelines using the following format:

reader [filter [filter ...]] [writer]

Each component in the pipeline comes with its own set of command-line parameters. You can even tee off records and process them differently (e.g., writing the same data to different output formats).

The library also has other tools, for downloading files or datasets from huggingface or combining text files.

In order to make building such pipeline-oriented tools simpler to develop, we created a base library that manages the handling of plugins (and, if necessary, their compatibility) called seppl (Simple Entry Point PipeLines):

https://github.com/waikato-datamining/seppl

Thanks to seppl, the llm-dataset-converter library can be easily extended with additional modules, as it uses a dynamic approach to locating plugins: you only need to define in what modules to look for what superclass (like Reader, Filter, Writer).

Finetune GTP2-XL Docker images available

The finetune-gpt2xl repository allows the fine-tuning and using of GPT2-XL and GPT-Neo models (the repository uses the Hugging Face transformers library) and is now available via the following docker images:

  • In-house registry:

    • public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-huggingface-transformers:4.7.0_cuda11.1_finetune-gpt2xl_20220924

  • Docker hub:

    • waikatodatamining/pytorch-huggingface-transformers:4.7.0_cuda11.1_finetune-gpt2xl_20220924

Segment-Anything in High Quality Docker images available

Docker images for Segment-Anything in High Quality (SAM-HQ) are now available.

Just like SAM, SAM-HQ is a great tool for aiding a human annotating images for image segmentation or object detection, as it can determine a relatively good outline of an object based on either a point or a box. Only pre-trained models are available.

The code used by the docker images is available from here:

github.com/waikato-datamining/pytorch/tree/master/segment-anything-hq

The tags for the images are as follows:

  • In-house registry:

    • public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-sam-hq:2023-08-17_cuda11.6

    • public.aml-repo.cms.waikato.ac.nz:443/pytorch/pytorch-sam-hq:2023-08-17_cpu

  • Docker hub:

    • waikatodatamining/pytorch-sam-hq:2023-08-17_cuda11.6

    • waikatodatamining/pytorch-sam-hq:2023-08-17_cpu

Redis-related Docker image updates

The redis-docker-harness Python library, which is used by a lot of our Docker images, has received a number of updates (at time of writing, the version of the library in use is 0.0.4):

  • ability to specify a password for the Redis server

  • specify the timeout parameter for the the Redis client, with larger timeouts resulting in lower CPU load (the default is now 0.01 instead of 0.001)

Unfortunately, this required re-releasing the most recent images of the following frameworks:

  • detectron2

  • mmdetection

  • mmsegmentation

  • yolov5

  • yolov7

  • Segment Anything (SAM)

  • DEXTR

The images kept their version number, you just need to pull them again, or use --pull ALWAYS in conjunction with docker run.