This site uses Akismet to reduce spam. Learn how your comment data is processed. Skip to content Optical Character Recognition OCR is the conversion of scanned images of handwritten, typewritten or printed text into searchable, editable documents. Share this article. Share your Thoughts Cancel reply. Read our complete collection of recommended free and open source software. The collection covers all categories of software.
How to use OCR from the command line in Linux? Ask Question. Asked 4 years, 6 months ago. Active 2 years, 3 months ago. Viewed 35k times. Improve this question. Village Village 4, 13 13 gold badges 45 45 silver badges 74 74 bronze badges. Possible duplicate of OCR on Linux systems — curiousdannii. I'll vote leave open. Add a comment. Active Oldest Votes. Improve this answer.
You can use a bash file to do all the command lines for you. The first page looks quite challenging, though. It has different text styles and sizes, and decoration. However, the output is close to the original. Obviously, the formatting was lost, but the text is correct. The vertical watermark was transcribed as a line of gibberish at the bottom of the page. The text was too small to be read by tesseract accurately, but it would be easy enough to find and delete it.
The worst result would have been stray characters at the end of each line. Curiously, the single letters at the start of the list of questions and answers on page two have been ignored.
The section from the PDF is shown below. As you can see in our result below, the characters were read, but the format of the diagram was lost. Again, tesseract struggled with the small size of the subscripts, and they were rendered incorrectly. In fairness, though, it was still a good result. Use Google Fonts in Word. Use FaceTime on Android Signal vs. Customize the Taskbar in Windows What Is svchost.
Best Smartwatches. Best Gaming Laptops. Best Smart Displays. Best Home Security Systems. Best External Solid State Drives. Best Portable Chargers. In other words, the PDF file contains text based and selectable data, not graphical and therefore unselectable information. Sometimes you may receive a PDF file which — though the PDF format supports actual text inside pages — contains only images with text.
This can be frustrating as copy and paste will not be available. You can OCR these pages also, with a small workaround. You will first want to convert your PDF file to images — one image per page — and then OCR the individual pages into text.
A little more work, but still a great time saver over re-typing text manually. We saw how we could easily convert images to text using a simple command.
We also looked at converting images to text-based PDF files, and referred an article where you can find information on how to pre-convert image-based PDF files to images so they can subsequently be converted to text using the OCR method shown here.
0コメント