Theseus devlog
I was able to make it control the youtube UI (tested "hide shorts" button)
Language is the barrior. It has pretty hard time understanding video's language based on its title. Some are even auto-translated making it even harder to guess. But I found open-source LLMs are just not capable of understanding the context. It cannot even distinguish Korean and Japanese
Seems like my expectation was too high. I think it's nearly impossible for LLM to simply guess the video's language from its title.
Youtube tab snapshot is pretty big. I think I need some pre-processer that can summarize the current screen (and only the current screen. ecluding elements that aren't shown)