JANUARY 30, 2023

The Google Research team has published a paper for MusicLM, a machine learning model that generates high-fidelity music from text prompts, and it works extremely well. But they won't release it to the public, at least not yet.

You can browse and play through the examples to listen to results obtained by the research team for a wide variety of text-to-music tasks, including audio generation from rich captions, long generation, story mode, text and melody conditioning, painting caption conditioning, 10s audio generation from text, and generation diversity,

I'm particularly surprised by the text and melody conditioning examples, where a text prompt—say, "piano solo," "string quarter," or "tribal drums"—can be combined with a melody prompt—say "bella ciao - humming"—generating accurate results.

Even when they don't release the model, Google Research has publicly released MusicCaps to support future research, "a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts."

DECEMBER 16, 2022

According to OpenAI, "embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts."

They introduced a new text and code embeddings API endpoint in January 25, 20221 capable of measuring the relatedness of text strings.

Here's a list of common uses of text embeddings, as listed in OpenAI's documentation.

  • Search (where results are ranked by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed)
  • Classification (where text strings are classified by their most similar label)

I look forward to testing this API on my writing to see how well it recommends, classifies, and clusters my mini-essays.

SEPTEMBER 22, 2022

OpenAI has open-sourced Whisper, a real-time speech transcription system.

"We’ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition."

JULY 23, 2022

Two months ago, HuggingFace open-source "state-of-the-art diffusion models for image and audio generation in PyTorch" at

"Diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves as a modular toolbox for inference and training of diffusion models."

Here's a text-to-image example from the repository's README.

# !pip install diffusers transformers
from diffusers import DiffusionPipeline

model_id = "CompVis/ldm-text2im-large-256"

# load model and scheduler
ldm = DiffusionPipeline.from_pretrained(model_id)

# run pipeline in inference (sample random noise and denoise)
prompt = "A painting of a squirrel eating a burger"
images = ldm([prompt], num_inference_steps=50, eta=0.3, guidance_scale=6)["sample"]

# save images
for idx, image in enumerate(images):"squirrel-{idx}.png")

Latent diffusion is the type of model architecture used in Google's Imagen or OpenAI's DALL·E to generate images from text and increase the resolution of output images.

JULY 11, 2022

Here's a video in which I test if OpenAI's DALL-E can generate usable texture maps from an uploaded image.

This texture comes with one of Apple's project examples and the idea of generating textures with DALL-E came from Adam Watters on Discord.

JULY 4, 2022

OpenAI's DALL-E 2 creates variations of my hand sketches.

See transcript ›

JULY 3, 2022

I continue to play with DALL-E 2 from time to time. I've posted a few videos and live streams on the topic and plan to share more clips with tiny bits from my experiments and some of my favorite results so far. Tomorrow, a video sharing how DALL-E can copy my hand drawings will come out on YouTube.

JUNE 25, 2022

Here are my impressions of OpenAI's latest iteration of DALL·E, an AI system that generates images from text. I've generated images in different styles and variations of my drawings, experimented with public pages, mask edits, uploads, and more.

See transcript ›

Want to see older publications? Visit the archive.

Listen to Getting Simple .