Letting AI Use Your Computer

My experiments with Claude's computer use and some thoughts

Jan 16, 2025

Imagine an AI that can not only generate text, images, and videos for you but actually do things on your computer for you. Like filing your taxes for you, creating financial reports with the invoices on your computer, tracking product prices on Amazon and buying when they are on discount, and so on.

Well, this is already possible, even though its performance is still not very good. Currently, the best AI model for such tasks (Claude 3.5 Sonnet) can complete 22.0% of 369 real-world computer tasks—a huge improvement from the previous best (7.8%) but a long way from human performance (72.36%). Such tasks include installing apps on the computer, organizing emails, and editing a spreadsheet.

How does the AI do things on a computer? The way Claude does it is this:

It thinks through the task and plans the steps required to complete the task.
It takes a screenshot of the computer screen to understand what’s on the screen, like how we would look at our computer.
It takes an action, such as browsing the web, clicking buttons, or typing.
It takes another screenshot to figure out the next step.
It repeats the steps until it thinks the task is completed.

I asked it to do four tasks with varying levels of success. But before we go through the results, I want to remind you (and myself) how bad we are at understanding the impact of new technologies. In 1995, Bill Gates described the internet on the Late Show with David Letterman and was laughed at (starting at 3:09):

BG: Well, it’s become a place where people are publishing information. So everybody can have their own homepage, companies are there, the latest information. It’s wild what’s going on. You can send electronic mail to people. It’s the big new thing.
DL: It’s easy to criticize something you don’t fully understand, which is my position here. But I can remember a couple of months ago, there was like a big breakthrough announcement that on the internet or on some computer deal, they were going to broadcast a baseball game. You could listen to a baseball game on your computer. And I just thought to myself, does radio ring a bell?
(Audience laughs)

Here I was, watching the video from 1995 on the internet. And now you, reading my essay on the internet.

Looking at the examples below, it’s easy to dismiss Claude (or AI in general) as useless, costly, and slow. “We can do it much faster and better ourselves!”, we think. But let us keep an open mind. Given how quickly AI has been developing, it will likely become useful, cheap, and fast enough for our daily tasks in the next few years (and we should also try to accelerate this).

How good is AI at using my computer?

In the screenshots below, you can see Claude’s thought process on the left panel and the simulated computer on the right panel. I could let the AI use my actual computer but Anthropic (the company behind Claude) made it easier to test via a simulated computer, which is also safer because we don’t know what the AI might do.

1. Go to alfredlua.com and find the person’s latest blog post

Claude succeeded at this simple task easily. It took a minute and cost $0.07. I initially thought it read the entire 10-minute reading time blog post but it seems like it only read the first two lines on the screenshot since it didn’t screenshot (or “see”) anything below the fold. Technically, I only asked it to find the blog post, which it did. When I tried a slightly different prompt (“Go to alfredlua.com and summarize the latest blog post”), Claude scrolled through the blog post.

I published a blog post between the tests, so Claude went to a different blog post for the second test. It thought out loud, “Let me scroll down to read the content:”

An interesting discovery is I can give new instructions while Claude is working. As it was scrolling slowly, I sent “Stop. Just summarize the introduction.” Claude paused for a while and returned me a summary, based on what it had seen.

Because it is a long blog post and I didn’t want to wait for Claude to slowly finish reading, I tried interrupting it.

A challenge of letting AI use a simulated computer is retrieving files. I wanted the summary in a text file on my actual computer but Claude couldn’t export the file directly. It can probably upload the file onto a cloud folder for me to download, though.

This second test of summarizing part of a blog post and exporting the summary cost $0.29.

2. Create a spreadsheet evaluating different blogging platforms

It failed to complete the spreadsheet but, judging from the attempt, I think it can probably succeed with enough tries. What went wrong? It failed to type three column headers, Platform, Pricing, and Ease of Use, and didn’t correct the mistake. The rest of the data assumed the columns were correct, which resulted in a table that didn’t make sense. I read that Claude would correct its mistakes but perhaps Claude was planning to review its output only after completing the data entry. I couldn’t tell because I hit the rate limit before it completed the task. This raises an interesting question of when it should review its output. Every step could be too much; at the very end could be inefficient.

As for the quality of the result, I was hoping it would search the web for the information but it seems like it used the knowledge in the model, which I admit is probably more cost-effective than browsing the web. However, information in the model could be outdated. I’ll be curious to see how much better the information would get if the AI browsed the web but also how much slower and more costly it would be. For what it’s worth, I think it will be a lot faster if the AI can “read” a webpage by extracting the HTML than looking at screenshots of the page like a human would. The downside is this will miss the nuances of how the page actually looks, which could contain additional information (e.g. dynamic content, iframes, map embeds). When should it act like a human and when should it work like a computer?

This took about three minutes and cost $0.35. It would take me, or an intern, much much longer and arguably this spreadsheet can be considered a good starting point for further research.1

3. Search and compile kid-friendly farmstays into a spreadsheet

For the second attempt, it clicked “Print” instead of “Save as”.

The full prompt was: Search online for the top 3 kid-friendly farmstays in Perth, Australia and why they are recommended. Compile the information into a spreadsheet for comparison.

I tried this task twice. It (kind of) successfully created and saved a spreadsheet on the first try but failed to save on the second attempt.

On the first go, it mistyped the search term in the middle of the existing search term but the search results were still fine (kudos to Google). It experienced several issues with bash (a program for interacting with the computer via text command) and couldn’t open LibreOffice Calc (the free spreadsheet program on the simulated computer). Eventually, it gave me the information as text and said I could copy it into Excel, Google Sheets, or any spreadsheet program of my choice. I had to ask it again to put the information into a spreadsheet before it did. This took about six minutes and cost $0.39.

For the first attempt, it inserted the search term in the middle of the existing one, instead of replacing it. Not replacing the correct word(s) seems to be a common issue with Claude's computer use.

For the second try, I restarted the simulated computer in case it was causing the bash issues. This time, Claude could open LibreOffice Calc and create the spreadsheet. But when it wanted to click “Save as”, it clicked “Print” and got stuck. Another issue was that it considered the sponsored result as the top recommended farmstay because it was at the top of the search results (which is something I have seen many human friends do). On a positive note, Claude tried different search terms, which could be helpful if it was much faster. This took about six minutes too and cost $0.32.

For both tests, it didn’t browse each farmstay listing. It created the key features listed in the spreadsheet using the few points on the Google search results, which include the rating and number of reviews, a truncated review (which the model seemed to have expanded on), and two labels (e.g. “Child friendly” and “Free parking”). Maybe I should have been more specific and asked it to browse further but (1) if I need to specify every single step, I’m not leveraging the model’s supposed reasoning intelligence and (2) for the current tech, it will take a much longer time and cost a lot more.

4. QA our insurance app

The full prompt was: Go to https://health-insurance-sg.vercel.app/ and find the premiums for Income Enhanced IncomeShield Preferred for a 32 yo.

It successfully navigated to the app, selected the right company (Income), and asked the correct question. But there were two issues: (1) unlike humans who can constantly watch the screen for updates, Claude only takes screenshots at intervals, so it missed part of the answer that went out of view, and (2) Claude can only scroll up with the Page Up key, which scrolls the entire webpage, so it couldn’t scroll specifically the chat section to see the full answer.

Also, reflecting now, a better prompt for quality assurance would have been:

Go to https://health-insurance-sg.vercel.app/ and find the premiums for Income Enhanced IncomeShield Preferred for a 32 yo. It should be $397.29 for MediShield and $360 for the private insurance ($300 payable with MediSave and $60 cash payment required). If it is incorrect, figure out why and report.

What shall we let AI use our computer for?

Here are some of my thoughts after the tests:

On speed

Some stuff feels annoyingly slow because I know I can do it way faster. For example, Claude took about a minute to close the “Tip of the day” popup after opening LibreOffice Calc. Saving a spreadsheet also took about a minute. I suspect this is because we know where on the screen to look but the AI doesn’t, so it has to analyze the entire screenshot. Given this limitation, we could start by using it for things that we are comfortable letting it do in the background, where speed doesn’t matter. But humans, being humans, would probably want to supervise the entire process until we are comfortable with the AI doing it autonomously.

The slowness also makes experimenting with different tasks and learning about what it is capable of much harder. If it is much faster, we can much more quickly figure out what other amazing things we can do with it.

On accuracy

It will click on the wrong things (such as “Print” instead of “Save as”), which is very human. But we will correct the mistake instantly. The AI should ideally be able to correct itself but it didn’t in my experiments.

Data entry and form filling seem to be the most common use cases so far because AI can navigate well and type quickly. However, it would be less useful if it consistently selects the wrong cell or field and does not correct itself.

One study also found that when Claude couldn’t find a specific option on the page, it incorrectly switched its focus to another section of the website, instead of scrolling to find the option.

On working in parallel

Since I tested on a simulated computer, it couldn’t access and work with the files on my computer. If I ran this on my computer, I wouldn’t be able to use it when the AI is using it because the AI would be taking screenshots and using the mouse. Can there be a digital twin of my computer for the AI so that we can work in parallel? Or is that an easy way to share files with each other? Via a shared folder or cloud storage?

On figuring out use cases

Letting AI use our computers to complete tasks will likely become better, cheaper, and faster in the next few years. I believe the main challenge will be identifying what it is best suited to do, like how we are still figuring out what to use ChatGPT for.

Here are some things people have tried asking Claude to do on their computer:

Interestingly, Claude’s system prompt forbids it from doing several actions. This is likely to prevent spam and scams but greatly limits the possible applications.

If AI could do anything on your computer for you, what would you ask it to do?

For now, Perplexity Pro might be the best tool for such research. Check out this answer, which was generated in about 30 seconds.

Letters To Alfred

Discussion about this post