Quantcast
Channel: Fast way to get checksum for all files within a huge nested directory - Unix & Linux Stack Exchange
Viewing all articles
Browse latest Browse all 3

Fast way to get checksum for all files within a huge nested directory

$
0
0

We have a requirement to screen user-uploaded content. However, I've noticed that most of our user-uploaded content has actually originated from our own system: for example someone downloads a pdf from our document library, renames it as something else to suit their needs, and re-uploads it into their "custom content" section, which can be shared with other users.

I'd like to mark these files as trusted, without someone having to actually look at them, and I thought I could do this using file size and some kind of checksum. eg

  • for a given new file
    • find all files in our resource library folder with the same file extension and same filesize
    • for all the ones with the same extension & size, do some kind of checksum comparison.
    • If we find a match, then declare the new file as trusted.

Now, our resource library directory is 132 GB - quite large. So, any solution that involves looking at every file in there (even every file with the same extension) is going to be quite slow.

It seems like the sensible thing to do is keep some kind of database (not necessarily using a literal DBMS) of file checksums, which is automatically updated when the contents change, or perhaps just run with a scheduler once a day. Then, for any given new file, I can get the checksum and look it up in the db.

This feels like it must be a solved problem. Does anyone have any ideas?

thanks, Max


Viewing all articles
Browse latest Browse all 3

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>