Thursday 21 July 2011

How to extract Text from PDF, DOC, HTML, CHM, and RTF files

How to extract Text from PDF, DOC, HTML, CHM, and RTF files

Have a document in PDF format that you would like to convert to a text document? Or maybe an HTML or CHM (Windows Help File) that you need to convert into simply plain text? Why might this be useful you ask? Most PDF documents are not editable and selecting the text manually can be a tedious process.

You can use Text-Mining-Tool to automatically extract text from a PDF file so that you can use it in any program freely. Or if you cannot open a PDF file because you do not have a PDF viewer installed, you can use this tool to extract the text and read the document.

Text Mining Tool (Download it from here) is completely free and does not even require an installation, simply unzip it and run the program to use it.




Click the Open button and choose your file that you want to convert to text. Click ok and the large window below the buttons will eventually fill with all of the text extracted from the document.



Click Save to save the extracted text to your computer. You can also click Clipboard to copy the mined text to the Windows clipboard.

For convenience, the following hotkeys can be used to perform the operations:


  • Open – F3 or O.
  • Save – F2 or S.
  • Clipboard – F5 or C.
  • Exit – F10 or Escape.

You can also use the minetext console tool to create a batch script for extracting text from multiple files. This can be useful if you have a directory with a large number of files that need to have text extracted.

The included console tool minetext has the following syntax:

minetext <input file>

minetext <input file> <output file>

where:

<input file>  - any file with one of the following extensions:
pdf, doc, rtf, chm, htm, html
<output file> - file you want to write text mined from input file

If you’re a web designer, this program can be very useful to grab the text from a Word document without getting all of the extra Microsoft Office styling code included with the text.

This is a very simple program that is very simple to use! It has one basic purpose and it does a good job! Enjo

No comments:

Post a Comment

VMware Cloud Learning Video's

Here is a nice summary list of all VMworld US 2018 Breakout session with the respective video playback & download URLs. Enjoy! Bra...