Alfred Intelligence: Issue 1
Things about AI that I didn't understand and that you might also want to know
I have wanted to (emphasis on “wanted to”) learn artificial intelligence since 2020, when my cofounder Swee Kiat came to my home every Saturday to go through one of the classic AI papers (CNN, RNN, etc.). We have since built and launched multiple AI products, with Pebblely being the most successful one so far. I have also learned to use tools like A1111 and ComfyUI to do the bare minimum. But I would be lying if I said I’m comfortable with AI now. I feel overwhelmed just trying to keep up with AI news and not even trying to understand the new developments.
I had a similar experience when I was learning to code. I wanted to learn to code in 2013 and jumped from HTML and CSS to Ruby on Rails to React. But nothing stuck. It was only after I completed a Udacity nanodegree on frontend web development in 2018 that I finally was afraid of coding. I realized a big part was being ok with not knowing and googling to find answers. While I’m comfortable with some code, I still wouldn’t say I’m a hardcore engineer. But I know enough to not be afraid of trying to understand code (with the help of ChatGPT these days).
Can I reach the same level of comfort and proficiency in AI? Most AI content is either too shallow (e.g. influencers posting about the latest AI tools without even trying them) or too technical (e.g. research papers, paper clubs, AI/ML courses). I need something in between, something for people interested in AI to get comfortable with AI first, build a foundation, and then gradually work from there. I believe we don’t have to understand the mathematical formulas and notations (yet), the technical jargon should be explained to assist learning, and the best way to understand is to write about it (or even build something with it).
“What I cannot create, I do not understand”—Richard Feynman
Hence, this new series! My goal is to share my notes every week to reinforce what I have just learned and help those in a similar situation as me. This will include things I’m working on, new concepts I’m figuring out, and relatively easier-to-understand articles and videos.
To be frank, I’m not sure how well I can keep this up. But taking my own advice, I’m leveraging the New Year Momentum to attempt to create this new habit, even if I might fail. I hope you will join me in this!
Computer use
This week, I looked into computer use, mostly inspired by Claude’s computer use and because Swee Kiat and I are exploring what we could enable everyday users to do with computer use.
What is computer use? It is essentially letting AI use our computer to do things. Instead of simply generating text or images, the AI will navigate the web, click on buttons, and type, just like us, to complete tasks.
There are a few components to it:
A large language model (LLM) that is intelligent enough to break down the task and come up with a plan.
This AI has several tools it can use to complete the tasks.
A tool lets it interact with the computer, such as moving the mouse, clicking on buttons, and taking screenshots. This could be xdotool (used by Claude) or PyAutoGUI.
Another tool helps it understand our screen via screenshots, such as what the different sections on the screen are, where the buttons are, and where it can type (or technically, “image parsing”). This could be Omniparser, UGround, or even a grid system to create coordinates on the screenshot, combined with a vision model.
Yet another tool lets it do things to the computer with text commands, such as opening a program (via something called Bash).
The same LLM or another one evaluates the steps taken and makes adjustments if necessary.
If you don’t understand things like Omniparser, UGround, etc, don’t worry. I don’t fully understand them yet too. I just know roughly what they do and why they are needed. Be it Omniparser, UGround, or the grid system idea, we can consider them as different tools to help the AI understand our screen. And there are many, many options. Developers experiment with each to see if it works well for their use case. For example, we have been playing with Omniparser to see how fast it is.
The AI would make a plan, take a screenshot to understand the current state of the computer, figure out the next step, verify the step was done correctly or adjust the plan, and repeat until the task is completed. Because it can plan and complete such complex, multi-step tasks itself, it is often known as an “AI agent” or simply “agent”. (But from what I read, people are still debating what an agent really means.)
There are frameworks that combine the different components together so that you can build an agent more easily. If you are familiar with web development frameworks, they are kind of like React and Vue but for building AI agents. The most popular one is LangChain; I also came across ShowUI and LaVague this week.
Using AI to use computers became very popular recently (October 2024) because Claude 3.5 Sonnet scored 22.0% on OSWorld (see below), a huge leap from the previous best of 7.8%. Humans, on the other hand, score 72.36% (we are only human!)
I’m looking into user interfaces for AI agents next. Let me know if you come across anything interesting or novel!
Jargon explained
OSWorld: I came across this while reading Anthropic’s announcement for computer use. OSWorld is a simulated computer environment to evaluate AI agents on 369 handpicked real-world computer tasks. This allows us to compare (or “benchmark”, a common word used in the space) different AI agents. There are many benchmarks used to judge how well the different AI models perform across various domains, such as solving coding challenges, answering medical school questions, and reasoning.
Grounding: This came up several times when I was reading about image parsing, and I had no idea what it meant. It has nothing to do with lightning rods. It means linking abstract concepts (e.g. “Click on ‘Next’) to visual items (e.g. the ‘Next button’) or actions (e.g. clicking the button) so that the AI understands which button on the screen I’m referring to and clicks it.
Sectioning: I learned about this from Anthropic’s article on building effective agents. It means breaking a task into independent subtasks to be executed in parallel for speed or better results. According to Anthropic, “For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.” An example is “implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.”
Interesting articles and videos
The Present Future: AI's Impact Long Before Superintelligence:
shared some interesting real-world use cases of AI.Building effective agents: This is a great and simple overview of agentic systems by the team at Anthropic.
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use: This is a research paper but it is not technical and relatively easy to read. If you are curious what you can do with computer use, this has many examples.
Recent issues
If you notice I misunderstood something, please let me know (politely). And feel free to share interesting articles and videos with me. It’ll be great to learn together!