How to remove malicious code from PDF files

How to remove malicious code from PDF files

WARNING! Do not open an eBook without making sure that the PDF file is clean.
This article focuses on the danger of free PDF files which float around the internet. I’ll describe in detail how I check for malicious content which might (or might not) be attached to the zip archive. Before I go any further, I’d like to state that this information presented here is based on my “notes” and for informational purposes only. I can not guarantee that the methods I describe here will work for you as they do for me.

How can I know if a PDF eBook contains malicious JavaScript or a virus?
The short answer is that I have to check the file before I open it. I do this with the help of PDF Tools. Please note that this article assumes that you use a computer that runs the powerful and robust Linux operating system. Any distribution can do what I describe here.

Let’s get the pdfid.py Python script

A quick search points me in the right direction which is the code author’s website
Once I open that web page, I press Ctrl + F to access the web browsers search feature and enter this into the search field without the quotes: “pdfid_v0_2_5.zip

Next, I extract the zip file in my Downloads directory which I access by entering this into the Linux terminal: cd /home/youruserid/Downloads/pdfid-master (press enter).
Then I type “ls -l” to list the files inside the pdfid-master directory. This will reveal the following content:

pdfid-master]$ ls -l
total 36
drwxr-xr-x 2 me me 4096 May 27 2016 img
drwxr-xr-x 2 me me 4096 Dec 1 15:21 pdfid
-rw-r–r– 1 me me 3487 May 27 2016 README.md
-rw-r–r– 1 me me 311 May 27 2016 setup.py

Inside the pdfid directory are a few more files and the one that’s needed is called pdfid.py which is the actual script.

Before I go any further I need to specify where my PDF eBooks are because the path to the PDF files needs to be entered correctly. To help me find the correct path, I open a file browser like Dolphin or Thunar and navigate to the ebooks directory. The browser will display the correct path which I then use during the next step.

OK, back to the python script. To execute the pdfid.py script, I type this command into the terminal: python pdfid.py /home/myid/Downloads/ebookdir/ebooktitle.pdf
Then I press enter and wait until the terminal displays the result shown below.

PDF Header: %PDF-1.4
obj 12246
endobj 12246
stream 4725
endstream 4666
xref 1
trailer 1
startxref 1
/Page 281
/Encrypt 0
/ObjStm 0
/JS 3
/JavaScript 0
/AA 1
/OpenAction 0
/AcroForm 0
/JBIG2Decode 0
/RichMedia 0
/Launch 0
/EmbeddedFile 0
/XFA 0
/Colors > 2^24 0

The noteworthy entries from the output are /Encrypt, /JavaScript and depending on the PDF file, /Actions. We are looking for the number 0 (zero) which means that the file is most likely clean. For the record, about 20 percent of the eBooks I got in the past were infected with something. Thanks to YouTube, educational eBooks have less and less significance but some are handy to have on the go or to mark up with my own hints and tweaks.

What to do with a PDF file that looks suspicious

Again, assuming we use Linux or at least live-booted into a portable Linux distribution via USB boot, then the terminal is our friend. To get rid of hidden malware of JavaScript files, I type this command: (Note: I don’t have to be root to do this step)

pdf2ps NameTitleOfBook.pdf – | pdf2pdf – NewNameTitleOfBook.pdf

I then press enter and wait a bit.Depending on how fast a computer crunches numbers, this process can take a few minutes. Once the conversion is finished, we’ll see the new file change from a blank white placeholder.pdf to the actual eBook cover. I then compare the file size again to see if it removed a few megabytes. If yes, then we probably dodged a bullet. Now it’s time to open the new file and make sure that everything looks right and if it does, I delete the old original PDF eBook file.

I do this to all of my eBooks and delete the originals. For extra security, I quickly reboot my computer and read the virus-free eBooks.

A word of warning

There is an old saying that the best things in life are free. Don’t be fooled. On the net, nothing is free except open source software and material which is published under the GPL or MIT license. It’s easy to find some warez site and download eBooks. We all have done it although we don’t know who made them. There could be another malicious file on your system which was placed somewhere during the unzip or unrar process and PDF files that contain JavaScript to access links on your computer. I personally don’t trust any file that floats around on the net and is up for grabs just like that.

Live booting from a USB goes a long way while reading PDF files or, if you have VirtualBox installed, then a throw-away Linux install that serves the purpose of “looking/testing something” will do as well. You have been warned.

Last but not least remember that privacy is dead and has been for many years. Got questions or comments? Fire away.

2 thoughts on “How to remove malicious code from PDF files”

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.