Indexing iCloud Photos with AI Using LLaVA and pgvector

I’ve been fascinated about the rise of AI. However, for the most of the part, it feels like magic. I like to get to the bottom fo the things as much as possible. My favorite courses at CS Undergrad were Digital Design, Computer Architecture, Operating Systems and Networking, and I was in the 1% in them (except networking, damn Electronics students). I liked them very much. Though my father taught me Visual Basic and databases when I was around 10, taking these courses was finally when it clicked me. It felt like when Neo finally saw the Matrix being just green code in a vertical <marquee>, but mine was in a much less cool version of enlightenment: “that’s you mean by 64-bit CPU” and “oh okay hyper-threading does mean 2x CPU all the time”, and “I know for loop order is important for optimal cache performance in some architectures”, “oh internet is 1500 bytes of messages exchanged. it’s a miracle it works at this scale” etc.

But, the AI changed everything for me. To be frank, I have almost no idea what’s going on with the latest developments. And this is coming from a CS PhD drop-out (partly due to military obligation postponement and partly due to self-discovery process) that had no choice but to take several classical (in other words, boring) AI and Machine Learning courses due to hype and almost no serious professor wanted to work on operating systems, distributing systems and networking (which were my favorite), but only shiny things like Bioinformatics with 50 people in one paper and old machine learning.

The way the modern AIs work, and you can literally download a bunch of them in a single file and operate in less than 8GB RAM in a M1 machine is fucking amazing. But, as most of us does not need to deal with CPU architectures all the day, we don’t have to really understand how matrix multiplication ends up the sand having an artificial intelligence.

For a while, I wanted to play with LLMs and concept of RAG for a while for an internal sue case at Resmo, but I did not have strong use cases. Let me put out this for once. Adding an AI/LLM powered chatbot to your website for customer support, summarizing documentation or converting natural language to SQL is not useful at all but yeah it probably checks a checkbox somewhere so that you can say you have “AI” in your product.

So, to avoid blabbering any further, in short my hobby project is leveraging a multi-modal LLM that can understand images, and improve the semantic search on my photo archive in iCloud. Apple Photos can already recognize the things on the image and provide a full text search on images. However, it can only detect objects and colors. I found the Google Photos search much better.

Describing iCloud Photos using LLM

As I said, during my MsC and PhD, I had to take several ML/AI courses and know better that there are state of the art labeling and segmentation algorithms for images that perform very well and work much more efficiently than an LLM. But I want to try this idea.

How about we ask an LLM and what it sees in an image, and embed the response as a vector using a popular algorithm and let the users search on that? Nothing state of the art, but it’s an interesting case. Obviously the performance will be dependent upon the recognition of the LLM, but can an open-source model be good enough to search my photos? If it’s good enough, can it evolve into something else? Let’s dive in.

For starters, if you are using iCloud Photos, all of your Photos thumbnails are searchable in a local directory, even if all the Photos are not in your Mac due to storage concerns. There is also an SQLite database of your Photos and I had a descriptive blog on it in 2019, however the schemas seem to have changed and you have better luck on the Simon’s blog.

To keep things simple, I made a code to recursively list the jpeg files in the thumbnails folder of iCloud which can be found in ~/Photos/iCloud.... They are smaller versions of the original photos, but they are more than enough for our use case. As an LLM model, I’ve chosen LLaVA with Q4 and used the just executable llamafile which makes deployment extremely easy. This has an API that I can call and all I need to do is to encode the images.

Prompts

I don’t like the term prompt engineering. In my opinion, it’s not a deep enough subject with several branches to be called actual engineering, it feels like an insult to actual engineering. If we called every trial and error without really understanding the root cause engineering, it would cheapen what real engineering’s all about – years of tough study, understanding the nuts and bolts of stuff, not just poking around and seeing what sticks.

I tried several prompts to understand what would give a better description for the images. Considering each generation takes around 10 seconds on my M1 Max 64GB machine with LLaVA 7B Q4 model (I used the REST API, I’m sure it could be better), I did not have much choice. Since I’m not a PhD student anymore, I will not be releasing anything against benchmarks, but I’ll be sharing a few examples on what LLaVA generates with different prompts using different temperatures and compare 7B and 13B parameter models.

Of course, GPT4V generates perfect descriptions for my images in great detail. But at what cost? All the LLM and RAG examples around blogs and YouTube defaults with OpenAI, and it makes me sad. Don’t get me wrong, I’m grateful that they exist, but the ability to run a LLM with vision on your computer is amazing, and we should not be dependent on a single company no matter what they do. I’m an old Linux user and that could not afford a Macbook, and without access to Linux, a free operating system and a package manager, my programming skills would not progress as much as they did. So don’t be a simp for a company with a $100 billion valuation and go ahead and support open source LLMs, don’t default to proprietary. I’m grateful for the work OpenAI does to advance the area, but there are even special LLMs that can even work on edge.

Prompt 1:

"A chat between a user and an artificial intelligence assistant. The assistant gives detailed answers to the human's questions.
USER:[img-10]Describe this image in detail
ASSISTANT:"

Prompt 2:

"Concise image summary request.
USER:[img-10] Provide a brief, concise description of this image, highlighting only the most essential elements in a few words.
ASSISTANT:"

Prompt 3:

"Detailed image analysis dialogue.
USER:[img-10] I need a thorough analysis of this image, including all elements, colors, and any noticeable features.
ASSISTANT:"

Prompt 4:

"Interactive session for image analysis and description.
USER:[img-10] Please provide a comprehensive description of this image, focusing on all visible details.
ASSISTANT:"

ChatGPT

Describe this image in detail

Some test results:

Image 1: A flyover over the Sphere

2024-01-02-llava-pic1.jpeg

Image 2: Belgrade The Temple of Saint Sava

2024-01-02-llava-pic2.jpeg

Image 3: Two luggage in hotel room

2024-01-02-llava-pic3.jpeg

Image 4: My daughter playing in sand with toys

2024-01-02-llava-pic4.jpeg

Embeddings and pgvector

I remember the emergence of word2vec in my university years, and it felt like a magic back then and still does. As I said, I don’t need to understand all the details how it works underneath (but it’d be really helpful), but the fundamentals of it. Given a string, embedding models convert it to a vector that you can run a simple distance algorithm to find related ones or the ones that are similar to your query. Even though I don’t care how models are trained and embeddings work, with abstracting that part, I can mvoe on with my life and do multi-dimensional vector similarity. I stored the state of my application on my favorite database Postgres, and with pgvector extension, querying the similarity was extremely easy.

Python Code for generating descriptions, embeddings and querying them

This is just iterating the files in my iCloud Photo library, prompting local LLaVA using HTTP API.

import os
import psycopg2

from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(user="mustafa", password="", database="postgres")
register_vector(conn)

prompt = "Detailed image analysis dialogue.\nUSER:[img-10] I need a thorough analysis of this image, including all elements, colors, and any noticeable features.\nASSISTANT:"

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')


for root, dirs, files in os.walk("/Users/mustafa/Pictures/Photos Library.photoslibrary/resources/derivatives/"):
    for name in files:
        x = os.path.join(root, name)
        filename, ext = os.path.splitext(x)
        if ext == ".jpeg":
          response = llm(prompt, x)
          description = response['content']
          embeddings = model.encode(response['description'])
            cur.execute("INSERT INTO results (filename, prompt, model, description, embedding) VALUES(%s, %s, %s, %s)",
                        (x, prompt, json.dumps(response), description), embeddings)
            conn.commit()

For querying the data, we use the same embedding model, and just ask pgvector to bring the closest vectors. I did not use any indexing because the data is already small, but pgvector supports indexes like IVFFlat and HNSW for faster retrieval with a small hit to correctness. But since these are just image descriptions, I’m sure it would not matter much.

import ipyplot

from PIL import Image

query = "inside of a shopping mall"
embeddings = model.encode(query)
cur.execute("SELECT filename, description, embedding <-> %s as score FROM results ORDER BY embedding <-> %s LIMIT 5", (embeddings, embeddings))
rows = cur.fetchall()
images = []
labels = []
for row in rows:
    img = Image.open(row[0])
    images.append(img)
    desc = row[1]
    score = row[2]
    labels.append(f'{score:.4f}\n{desc}')

ipyplot.plot_images(images, labels, max_images=30, img_width=350, zoom_scale=1)

In the end, I’ve used the ipyplot library to show a grid of images with their labels and distance to my query. I’ve exhausted my Python knowledge and can’t wait to go back to writing some Kotlin.

Results

Well, the results are surprisingly good in my dataset of ~4000 images. I tired different quantization ranging from LLaVA 7B-Q4 to LLaVA 13B-FP16, and it did not matter much for my use case. Of course, if it was an official benchmark, and we had an actual dataset to compare, I’m sure larger models would result in a slightly higher score, but the performance of it is substantially lower and was not worth it in my dataset to fool around.

What’s next?

I like to blog. I like to share my insights and experiences, so it somehow inspires other people or gave an idea. If I see an actual product being built upon any of the ideas here, I would be more than happy. This is my goal. I have no time or relevant experience to pursue further on this idea. Otherwise, instead of this blog, it’d be on a pitch deck.

There are still possible improvements upon what I’ve shared so far. A user of this product will likely search for very specific things, like “blue t-shirt baby on Christmas”, not paragraphs of data. However, the LLM outputs are much longer than typical user query. I’m not really sure if it is a good idea to run very different string lengths using vector embeddings.

Additionally, there can be a multi-layered approach that incorporates the labels in the document, metadata of the image, face recognition, combination of multiple prompts for long and short descriptions and some sort of multi-level scoring system that can improve the performance to bring more relevant performance, like how advanced RAGs perform.

Alternatively, you can use OpenAI’s GPT4V and embeddings but where is the fun in that?

Comments

comments powered by Disqus