Skip to main content

· 7 min read

TL;DR: I've created a synthetic dataset for training LLMs to use tools, and am training a model using DPO to improve accuracy. The dataset is available here, and I'll be releasing trained models soon.

Why tool usage?

When ChatGPT emerged, its initial utility as a search engine replacement and knowledge engine. However, over time, it became clear that it, and models of its class, possessed far more interesting skills: some degree of reasoning ability, and even the ability to perform tasks by outputting structured text (i.e. tool calls). This sparked excitement about its potential as an "agent" capable of interacting with the world. Demos followed, from buying things on the internet to more general tasks like making $100k(!). For a brief period after ChatGPT's launch, it seemed that all we needed to do was find the right agent framework, the correct system prompt, and boom, we'd have AGI.

· 10 min read

Left: Drawbench prompt "A rainbow penguin in a tuxedo". Right: SDXL output with SuperPrompt applied to the same input prompt.

TL;DR: I've trained a 77M T5 model to expand prompts, and it meets or exceeds existing 1B+ parameter LLMs in quality and prompt alignment.

When DALL-E 3 was released, I was a) really impressed, and b) struck by how obviously good an idea the prompt augmentation it uses was. To that end, I've spent some time over the last few months playing around with some various approaches to emulate that functionality.

· 5 min read

For the last two years, the improvement in fidelity and control in diffusion models has been incredible; generation results are often indistinguishable from real images, and a huge ecoystem of tools has emerged to offer various forms of control over the generation process. However, for a lot of real-world workflows, you quickly run up against limitations that make things hard for less technical users.

· 4 min read

Update: This was written over winter break 2023, in the pre-Sora days, when a 10-second video generation was impressive. While probably not as useful now, this technique might be useful again in the future.

Background

Text-to-video diffusion models have started to mature massively in the last few months; starting roughly with Gen-1 and Pika, and increasing in popularity in open-source with AnimateDiff and now Stable Video Diffusion.

In most video diffusion models, including Stable Video Diffusion this is because the diffusion process is being applied across the entire batch; you can think of SVD inference as denoising a batch of latents concurrently, as done in text-to-image diffusion models; though in this case, the U-net is attending to all latents at once (technically, through separate temporal and spatial layers). As you can imagine, this is very compute-intensive; this also means the model has only seen videos up to a few seconds long (though at different framerates), due to training costs increasing with batch size. If you try to generate longer videos, i.e. past 14 frames with SVD and 25 with SVD-XT, the coherence and motion suffer.