Hey,
From what I could gather from your description is that we'll be extracting information from emails ( that we'll be configuring as we build this utility ). I am really not sure if these emails contain the PDF or is the information available as plain text in emails.
Three case scenarios that I could think of :
1. Information is available as plain text in emails, fastest turn around time
2. Information available as PDF, where PDF is populated with textual data ( second fastest turn around time )
3. Information available as PDF, where PDF contains an image ( this one will require OCR component to be fit in as well, hence longest turn around time )
Some more information would really help me plan this project out, I am interested in working on this.