AI15: Using LLMs with actions, not only text
Instead of describing what we want to LLMs, what if they can also infer from our actions?
Most arguments against a chat interface for AI compare AI with the invention of the Graphical User Interface (GUI). We used to enter commands in the command line until the GUI was invented. Instead of having to remember and type specific commands, we can select text, move files, and open documents visually. Consequently, we shouldn’t be typing specific prompts. There should be a GUI for AI.
But what could that look like?
The first wave of LLM GUI consists mostly of sliders, drop-downs, and buttons, with prompts behind them. For example, a slider for “sarcasm” is essentially a spectrum from “Do not be sarcastic” to “Be as sarcastic as possible”. Or a drop-down for tone is essentially “Be formal”, “Be funny”, “Be casual”, and so on. Developers find the best prompts to achieve each goal so that users can simply select what they want, instead of figuring out how to write a good prompt.
Can we go further than that?
Since LLMs take in text, what can we “say” to LLMs with our actions, beyond slides, drop-downs, and buttons?
We can take inspiration from AI image apps, such as Pebblely, Krea, and Playground. Instead of being limited to describing what we want in text, we can move objects on a canvas, use reference images, and draw sketches.
Here are some ideas:
Dragging a text document onto another text document merges them into a single text document or creates a new document that combines their ideas, like combining items in a video game. (But what if I drag a spreadsheet onto a slide deck? Or an image onto another image?)
Sharing a folder with an AI chat provides the files in the folder as context. The folder names and file names can be additional context for the LLM. For example, the LLM should treat documents in “Jokes” and documents in “Business” differently.
Adding to and removing files from a shared folder automatically updates the context for the LLM.
Putting reports, images, and spreadsheets in a folder creates a slide deck, just like how my iPhone automatically creates video montages with my photos and videos.
Enlarging a textbox generates more text while shrinking it summarizes.
Rearranging two paragraphs generates different text between them to connect the two paragraphs.
Color-coding text in different colors tells the LLM to edit the various sections differently. E.g. red = fix errors, green = improve phrasing, blue = summarize, yellow = expand.
Perhaps we can go even further and create proactive assistance, like those we see in sci-fi movies:
When I’m doing X, the AI detects that and offers to complete it for me, where X is organizing my files, renaming my files, clearing my emails, and so on.
I have a folder with a spreadsheet for my personal expenses. Whenever I add an invoice to the folder, the information will be automatically extracted and added to the spreadsheet.
Whenever I download an invoice (e.g. Spotify), the AI will check if I have been using the app and, if I haven’t been, suggest I cancel the subscription.
At the end of the work day, the AI summarizes my digital activities for the day so that I can review the day and plan for the next day.
There are a few key questions to consider:
How do we design the experience such that users know what can be done and what to expect?
Should the results be deterministic like existing software so that users know what to expect? Or should they be non-deterministic like ChatGPT so that users can get creative outputs?
How do we handle privacy and security, especially when the AI would need to constantly monitor our computers to offer help proactively?
Do LLMs actually enable something impossible previously or perform the tasks better than code?
Jargon explained
Tauri is a framework for building desktop apps while using HTML and CSS (or React or Vue) for the frontend, making it easier for web developers to build desktop apps. Another popular option is Electron but it seems like Tauri is gaining popularity because Tauri apps are smaller, load faster, and are more secure. We recently started building a desktop app with Tauri, and I’ll share more once I have more experience building with Tauri.
Interesting links
AI-native UX: A thread of examples by
.
How to Build an Agent (or: The Emperor Has No Clothes): Thorsten Ball explained and showed in this article that AI agents are simply “an LLM, a loop, and enough tokens”, and, if I may add, tools. His example is in Go. If you prefer Python, I wrote about How to build a productivity AI agent in 24 lines of (Python) code last week.
Smithery: A repository of MCPs for your AI agents. It even has an MCP for all the MCPs on its platform!
- , a machine learning engineer at Abnormal Security, shares what could go wrong with MCP.