NormCap
NormCap is a desktop application written in Python to quickly apply Optical Character Recognition (OCR) on any part of screen(s). It was born out of the need to extract text from screenshots that I received via e-mail, but is also applicable to many other use-cases like copying text from non-selectable user interface elements or even photographs.
It works like this:
- You start NormCap (a red border around screen appears)
- Select a region on the screen
- Wait a moment, while the text is being extracted from the image and copied to the clipboard
- Paste (Ctrl+v) the extracted image wherever you like.
Tech
- PySide6 (Qt) - Multi Platform UI Framework.
- briefcase - Tool for packaging Python applications for various platforms.
- Tesseract - The open source Optical Character Recognition engine.
Design goals
- Quick startup: The application is targeted towards occasional use, probably only a few power-users will leave it running in the background (system tray). Therefore a quick start will improve the user experience quite a lot.
- Simple to use: It should offer minimal customizability and focus on the core features. As a utility, it should many “stay out of your way”. A minimalist user interface also improves accessibility for non-English speakers and reduces localization efforts.
- Multi platform: The app should be working on all major operating system to make it available for a broader range of users and especially attractive for users that work on different systems (I myself work on various operating system as well).
- Multi screens: The desktop setup of many IT professionals or enthusiasts can get quite complex, including multiple displays, different resolutions, Hi-dpi displays, potentially in combined setup with non-Hi-dpi displays.
- “Intelligent”: Often, a user doesn’t want to retrieve the text 1:1 as it is shown on the screen. E.g. in longer paragraph of text, usually you don’t want to preserve the original line breaks. Or if you are selecting a group of email addresses, you usually do not want to include conjunctions like “and” but a plain list of the email addresses. The application should be able to handle common scenarios and adapt accordingly, but it also should be possible to opt out of such “intelligent” parsing.
What I have learned
- Consider to select one platform instead providing multi-platform support.
Integrating an application with different operating systems is quite time consuming to maintain and comes with trade-offs regarding user experience. Expect to spend a lot of time platform-related quirks. - Shipping a Python application as a user-friendly “binary” is an unsolved problem.
Tools like briefcase, Nutika, PyInstaller and others do a fabulous job, but still it is quite difficult to robustly package a Python application with them. Major pain points are external (non-Python) dependencies and installers (with proper updating mechanism). - QT is a powerful UI framework, but its Python bindings (PySide/PyQt) come with challenges.
One major pitfall can be the lifetime management of the Qt Objects, as the C++-objects can get destroyed without destroying the Python object and vice versa. Unit-testing and UI-testing comes with its own challenges, even with the support of pytest-qt. At last but not least, I find it quite challenging to structure the interaction flow via Qt’s Slots and Signals. It’s difficult to keep an overview about the different possible flows, and IDE’s provide little support here. - Taking screenshots on Linux is quite challenging
The various DE’s are so different and also change over time (I’m talking to you, Wayland). I can’t count the distros I had to install in a VM to reproduce bugs. - Showing full-screen windows consistently over all platforms is also super difficult.
Especially on macOS and, again, the different Linux DE’s. Also, supporting multi-monitor setups with differently scales is hard.