Programming with AI —image based book recommender — Part 1
How I built a book recommender app without writing a line of code
You can try the cool image-based book recommender here.
I’m sure that many of the readers of this post have a “drawer” full of uncompleted cool side projects. Since AI helps my team to code on a daily basis, I’ve decided to take it for a ride and see how far it can take me on one of my favorite side projects, preferably with minimal code writing.
Since I was truly surprised at how far it went, I’ve decided to:
- Write about my experience
- Examine the limit of how far you can get with “no code AI programming” and what the actual limitations are
This post (actually, this series) will describe this journey.
Intro
I’ve had this idea brewing for years — an app to recommend books by just taking a photo. As said, I’m usually busy with client projects. But with help from the latest AI tools, I decided to give it a go. It wasn’t always smooth sailing — computer vision and text recognition threw me some curveballs. But to my surprise, I managed to get a good prototype working much faster than I expected. It’s amazing how AI can help bring ideas to life, even in areas where you’re not an expert. This journey has been both humbling and exciting, and I’m eager to see where it leads.
Here’s the pitch: “imagine you enter a bookstore, and you really want to buy a book, But which one? Additionally, there might be one of these discounted tables, with many books scattered around, just waiting for you to choose one or more, but again, which one? Instead of helplessly poking around, I wish I had an app to solve this. Enter the Book Shazam: take a photo of this setting, and get personalized ratings for the books.”
So why now? In our company, Shibumi AI, we’ve been using LLMs for quite a while now, roughly since the API launch in 2021. We also utilize Copilot for various programming tasks. And of course, we use the web UI’s themselves to write texts, summarize texts, and so on. Occasionally, we experimented with writing code in the web app (whether it’s ChatGPT or Claude).
Until recently, the results weren’t particularly impressive. But lately, specifically since the release of Claude Sonnet 3.5, we’ve gotten the feeling that it’s possible to create real software projects, albeit simple ones, using this tool. We’ve tried here and there (and also seen others do it) browser extensions, simple applications, etc. Therefore, I decided it was time to step up and build a slightly more complex product. This series of posts documents the process.
Additioanlly, there is an ongoing discussion lately about whether and when non-developers will be able to create full fledged apps or products using AI. During this project, I kept this discussion in mind and tried to
- Not to touch the code, or at least as little as possible
- Pay special attention to task that might be easy for me as a technical guy, but slightly challenging to very challenging for the non-developer.
Planning the project
In general, this is not a super hard project, but it has it’s challenges. Additionally, as a data scientist, I’m stronger in Python and ML models, and less in JavaScript and UX. But the LLM, and specifically Claude, makes me feel that the range has shortened. So let’s begin.
First, let’s break down the product a bit — We need to build:
- User interface
- Computer vision (detection and OCR) system
- Recommendation engine
Clearly, it’s tempting to implement advanced and cool solutions such as cutting-edge recommenders and OCRs, but we’ll go with a product approach of simplicity.
Any novice product manager will draw you something like this:
This sketch generally means that from an early stage of the development, you should have a working solution. In other words, you shouldn’t work on your final 1.0 version until it's ready and then release it, since you won’t get any user feedback throughout the period. What you should do is to start with an ugly simple working solution, and MVP (minimal viable product) and gradually improve it, with user feedback at every stage. We’ll adopt this approach.
So this is how we’re going to address our task:
Computer vision system: with simple UI, this will be the backbone of the app. We want the functionality of uploading an image, detecting the books, recognizing their names (OCR — optical character recognition), and clicking them. We’ll let Claude handle everything.
Recommendation engine: The simplest recommendation engine right now is just to “ask LLM”. We need to deal with the cold start problem (no data for new users), so we can simply ask the user to enter a few books they’ve recently read and liked (basic, I know). We’ll handle this in the next post
UX: After having a “working” system, we’ll optimize the UX:
- Make the design look nicer and more modern
- Make the process more friendly and fluent.
More features:
- To make this app more functional, we’ll add a login system which will allow users to use the app more than once.
- Eventuly, we’ll deploy the app to the cluod to allow users to access it.
- Out of scope for this series, we can add more features to the app such as rating books that are not in the image and more.
The process — book detection
Let’s first discuss the general approach to development with LLM: as said earlier, Claude’s abilities brought us to something like version 0.2 of LLM development. The 0.1 version was GPT4 (and 4o) which would mostly return code that sometimes worked, and required a few iterations for every task.
In Claude, things are better: code versions are saved as an artifact, and you can manage a kind of dialogue (which sometimes includes editing messages with the model to take an incorrect turn).
Claude also includes “preview” feature, that allows run simple scripts in the editor itself. This seems like a small feature, but in my opinion is crucial for non developers. We will not use it in this walkthrough.
When you ask Claude to write code, you should:
- Be very specific.
- Write all required features clearly.
- Do not write too long.
So I asked for the following:
Note that this is a good practice to perform programming tasks with LLM:
- If the models fail badly in one of the sections, you can edit the request with a “reinforcement” for the relevant section (as you see in the “make sure” — from my experience in computer vision tasks, the model often gets lazy and chooses a model that can’t really handle the task.)
- Also, note that I gave the model the freedom to choose the technology — it could choose JavaScript or one of its flavors (such as React)… In practice, it chose Flask — which greatly affects the entire architecture — instead of a whole module running in the browser, we’ll have a server/client system here. There are pros and cons to this choice. However since we are planning to use LLM in the app, it will be easier to keep the key on the server at the moment. Additionally, I feel more comfortable with Python so I won’t complain for now. But there’s a chance we’ll want a client-side-only application later.
For testing purposes, we’ll equip ourselves with a test image:
Claude has a nice UI with text on the left and code (“artifact”) on the right. It returned the output in Python, HTML, and Javascript, but stored the code in one file, noting in the comments — that I should separate them manually — .py and .HTML.
Additionaly, Claude adds to the code clear instructions and explanations, which might be very helpful both in using the code and understanding it better. This is a very useful tool if you want to learn along building.
After I put the files in place, I ran python app.py according to Claude’s instructions and get this is the resulting app:
Pretty ugly, but right now we’re trying to make it work, the design will come in later.
File upload and book detections work — all books are detected except for 2. But there is a bug — clicking a book doesn’t do anything.
Since we are in conversation with Claude, I can briefly note the malfunction, and Claude will try to fix the bug:
Claude rewrites the HTML code, and now it works, see the “Get Info” pop-up below.
Now it remains to fill the popup with content — the name of the book and the rating in a later stage. I’ll ask Claude to insert an OCR component:
Claude chooses to use Tesseract — the most generic OCR tool with a Python package. Not the best tool, but its up to date and improving over time, so we’ll give it a go:
Apparently, as you see above, the Tesseract didn’t work. In most books, it doesn’t find anything, in some it detects a bit of gibberish, and only in one does it detect text (not the clearest text) which makes me think it’s some kind of malfunction.
So let’s give it another chance or two.
So Claude made some improvements in the code, but the results are still not good.
I think we can improve Tesseract more and there are better packages for what’s called “text in the wild” However we currently want to move fast here, so it’s good to sometimes point Claude in the right direction: let’s go to Google vision API which is a very good.
Claude obeys, replaces the Tesseract code with Google Vision, and all that’s left for me is to get a service account key from Google’s hairy interface. This task can be a bit challenging for the non-developer, but Claude will help here with instructions. After putting everything in place, lo and behold, the OCR works:
That wasn’t very hard, and got us a significant part of the project, let’s celebrate!
Analysis
But wait, let’s think for a second. We are eager to move forward and complete the MVP. However, since such a major part of the app is ready, let’s (along with our celebrations) think about what is missing to actually productize it (apart from the recommender and UX we discussed earlier)
- Computer vision optimization
- Architecture and deployment
Computer vision
We had two tasks here:
- Detection
- OCR
We pushed forward in a hacky manner with one image and witnessed the following results (see detection results image above):
- 17 books detected (true positives)
- 2 books undetected (false negative)
- 2 false positives
The undetected books should be handled. Let’s look at what Claude have done.
Looking at the code, Claude chose for us the good old Yolo5 model, pretrained on the COCO dataset, And “highlighted” the book class, which is fortunately enough included in the dataset.
However, it also chose the “S” model, which means small.
We can:
- Replace the model with a bigger one (no-brainer)
- Replace the model with Yolo v8 (a newer model with better accuracy)
- Replace the model with a model that was specifically trained on books (need to get in some gory result tables)
- Fine-tune the model by ourselves (will require us to roll up our sleeves)
From a quick experiment, changing the Yolo-small to Yolo-medium does improve results.
To really get exhaustive results, we need to collect a more significant number of images as a Test set. Let’s say ~100, in different settings and lighting conditions — but this will be out of scope here.
This kind of computer vision optimization is one of our expertises at Shibumi.
Architecture and deployment
This app currently runs on my machine only. To make it available for users we’ll need to do some groundwork:
- Wrapping the app with better serving capabilities (e.g gunicorn, docker)
- Optimizing serving time and process: currently, it seems that the code instantiates the model for every API call, which is suboptimal. We need to optimize this process to allow us to serve multiple users
These items and others are essentials for bringing such apps from the MVP stage to a full working app and this is what we mostly do in Shibumi. We’ll handle it in a later stage.
Some thoughts
As said, it seems that we are at stage 0.2 of “programming with LLMs”. We are exploring the capabilities and limitations. One of the major limitations is the small and annoying things in programming — inserting some key (as we’ve seen with the service account jey of Google API) into the code, setting up a server, and fixing a small bug. As a developers, solving these issues is part of our job. But for non-developers, these might be serious blockers. How many of these things will disappear in version 0.3 or later? How many will be solved by external tools like Replit or Cursor, for example? It’s hard to know. But we are focusing on what we have right now.
What’s Next
That’s it for now. In the next part we’ll add the recommender system.
This is the full series:
- Computer vision APP <- we are here
- Recommendation engine <- next part
- UX alignment (user login, modern look and feel)
You can find the code for this part here