As a system administrator, part of your responsibility is to help users manage their data. One of the vital aspects of doing that is to ensure your organization has a good backup plan, and that your users either make their backups regularly, or else don’t have to because you’ve automated the process.
However, sometimes the worst happens. A file gets deleted by mistake, a filesystem becomes corrupt, or a partition gets lost, and for whatever reason, the backups don’t contain what you need.
As we discussed in How to prevent and recover from accidental file deletion in Linux, before trying to recover lost data, you must find out why the data is missing in the first place. It’s possible that a user has simply misplaced the file, or that there is a backup that the user isn’t aware of. But if a user has indeed removed a file with no backups, then you know you need to recover a deleted file. If a partition table has become scrambled, though, then the files aren’t really lost at all, and you might want to consider using TestDisk to recover the partition table, or the partition itself.
What happens if your file or partition recovery isn’t successful, or is only in part? Then it’s time for Scalpel. Scalpel performs file carving operations based on patterns describing unique file types. It looks for these patterns based on binary strings and regular expressions, and then extracts the file accordingly.
This tool isn’t currently being maintained, but it’s ever-reliable, compiling and running exactly as expected. If you’re running Red Hat Enterprise Linux (RHEL) 7, RHEL 8, or Fedora, you can download Scalpel’s RPM installers, along with its dependency,
libtre, from klaatu.fedorapeople.org.
Starting with Scalpel
Scalpel comes bundled with a comprehensive list of file types and their most unique identifying features. Sometimes, a file can be identified by predictable text at its head and tail:
htm n 50000 <html </html>
While at other times, cryptic-looking hex codes are necessary:
jpg y 200000000 \xff\xd8\xff\xe0\x00\x10 \xff\xd9
Scalpel expects you to duplicate
/etc/scalpel.conf edit your copy to include the file types you hope to recover, and to exclude the file types you know you don’t need. For instance, if you know you don’t have or care about
.fws files, then comment that line out of the file. Doing this can speed up the recovery process and reduce false positives.
In the configuration file, the format of a file definition is, from left to right:
- The file’s extension.
- Whether the header and footer are case sensitive (
- The minimum and maximum file size you want Scalpel to find.
- A standard header that identifies the beginning of the file.
- A standard footer that identifies the end of the file.
footer field is optional. If no footer is provided, then Scalpel extracts the number of bytes you set as the file type’s maximum value.
You might find that a recovery effort only rescues part of a file, such as this mostly-recovered JPG:
This result means that you probably need to increase the file’s bounds maximum value, and then re-scan, so that the end of the file can be recovered, too:
Defining new file types
First, make a copy of the Scalpel configuration file. If all your users generate similar data, then you may only need one config file for your entire organization. Or, you might find it better to have one config file per department.
To add your own file types to a Scalpel config, start with some investigative forensics.
For text files, you ideally have some predictable structure you can anticipate. For instance, an XML file probably starts with
<xml and ends with
</xml. Binary files are similarly predictable. Using the
hexdump command, you can view a typical header from the file type you want to define. Here’s the results for an XCF, the default layered graphic file from GIMP:
$ head --bytes 8 example.xcf | hexdump --canonical 00000000 67 69 6d 70 20 78 63 66 |gimp xcf| 00000008
This output is from a Red Hat Enterprise Linux 8 system. On older systems, an older syntax may be necessary:
$ head --bytes 8 example.xcf | hexdump -C 00000000 67 69 6d 70 20 78 63 66 |gimp xcf| 00000008
The canonical output of
hexdump displays the address in the far left column, and the decoded values on the far right. In the center column are the hexadecimal bytes of the first 8 bytes of the XCF file’s first line.
Most binary files in
/etc/scalpel.conf look pretty similar to that output, except that these values are prefaced with the
\x escape sequence to denote that the numbers are actually hexadecimal digits. For instance, a JPG file looks like this in the configuration file:
jpg y 200000000 \xff\xd8\xff\xe0\x00\x10 \xff\xd9
Compare that value with a test hexdump of the first 6 bytes (because that’s how many bytes
scalpel.conf contains in its JPG definition) of any JPG file on your system:
$ head --bytes 6 example.jpg | | hexdump --canonical 00000000 ff d8 ff e0 00 10 |......| 00000006
Compare the footer with the last 2 bytes to match what the config file shows:
$ tail --bytes -2 example.jpg | hexdump --canonical 00000000 ff d9 |..| 00000002
These values match up, so you can be confident that valid JPG files probably all start and end in a predictable sequence.
Note: The Ogg entry in the
scalpel.conf file is misleading, as it lacks the
\x escape sequence. If you need to recover an Ogg file, fix this, or replace its definition.
Getting to work
Now, to obtain the same level of confidence for all files you need to recover (such as XCF, in the previous example). To reiterate, this is your workflow for defining the binary file types common to the victim drive:
- Get the hexadecimal values of the first few bytes of a file type using the
head --bytes ncommand.
- Get the last few bytes using the
tail --bytes -ncommand.
- Repeat this process on several different files of the same type to confirm consistency of this pattern, adjusting the length of your header and footer patterns as required.
- Enter the header and footer values into your custom Scalpel config, using the
\xnotation to identify each byte as a hexadecimal character.
Follow this sequence for each important binary file type you need to recover.
If a file is plaintext, provide a common header and footer, such as
#!/bin/sh for shell scripts,
# (the space after the
# is important) for markdown files with an h1 level title,
<xml for XML files, and so on.
When you’re ready to run Scalpel, create a directory where it can place your rescued files:
$ mkdir /run/media/seth/rescuer/scalped
Note: Do not create this directory on the same volume that contains the lost data.
If the victim drive is not yet mounted, mount it, and then run Scalpel:
$ scalpel -c my-scalpel.conf \ -o /run/media/seth/rescuer/scalped \ /run/media/seth/victim
You can also run Scalpel on a disk image:
$ scalpel -c my-scalpel.conf \ -o ~/scalped ~/victim.img
When Scalpel is done, review the files in your designated rescue directory.
All in all, it’s best to make backups so you can avoid doing file recovery at all. But, should the worst happen, try Scalpel and carve carefully.