There is something called an IFilter that can be used to read most text based files, office documents, some PDFs, images, etc.
IFilters are either installed with the OS or get installed with software. For example, MS Office Pro 2003 installs an IFilter to OCR tiff images and return the text.
As for a PDF, a PDF can be composed of images and text. Usually PDFs created directly from productivity software like MS Office, Open Office, etc. are composed of text and can readily be parsed using an IFilter. However some PDFs are composed of strictly images, scanned documents for example. These can't be easily parsed.
There may be other ways, but maybe this will help.