file locking...

Sun Mar 1 19:47:14 UTC 2009

On Sun, Mar 01, 2009 at 09:54:59AM -0800, bruce wrote:
> 
> hi bruno.
> 
> for my situation. i have a bunch of files being created by an upfront
> process, and on the backend, i have a number of client/child processes that
> get created, which have to operate/process the files. no file is processed
> by more than a single client app.
....
> i can easily setup a file read/write lock process where a client app
> gets/locks a file, and then copies/moves the required files from the initial
> dir to a tmp dir. after the move/copy, the lock is released, and the client
> can go ahead and do whatever with the files in the tmp dir.. the process
> allows multiple clients to operate in a pseudo parallel manner...
> 
> i'm trying to figure out if there's a much better/faster approach that might
> be available.. which is where the academic/research issue was raised..
> 
> the issue that i'm looking at is analogous to a FIFO, where i have lots of
> files being shoved in a dir from different processes.. on the other end, i
> want to allow mutiple client processes to access unique groups of these
> files as fast as possible.. access being fetch/gather/process/delete the
> files. each file is only handled by a single client process.

You should benchmark some strategies.

Standard flock(), lockf(), stat() and fcntl() and friends should
do the trick.....   You do not want to do a copy.
Locking over NFS is 'interesting' so look for a 'lock free'
strategy if you expect this to run on NFS file systems now or
in the future.

My simple minded solution is to have the creation process or a dispatcher
process move files from the front end input directory into a modest set of
directories for the back end processing to pick from.   The number of
back end processes and dirs can be tuned to match the number of processors and
IO subsystem performance.

Renaming a file on the same device is a "quick" meta data transaction
that does not risk data loss.  The renaming can also be done by the
input process at the point that the file creation is finished.  It may
be necessary to do this step to ensure that a tail end process does not
open a file before it is ready for processing.

Depending on the complexity of the activity you may need to resort to
some of the tricks used by sendmail or postfix.  For example how do
you know that a data file has been processed: not at all, completely or
incompletely and if it matter should it get processed twice.

Do consider the use of mmap() as it can help you limit the pollution of the
page cache by a one time read activity.  With mmap revisit the NFS topic. 

The number of files in any one directory can be important.   Establish
some design limits.  Too many files in any one dir can bog down the
file system.  Also give consideration to the names of the files. 
Stuff like sorting file names can be important 
i.e. 2009.28.07 .vs.  2009.07.28 .vs. 09.7.28 .vs. 09.11.11.   
Also dots and dashes (".","-") are shell or regexp meta character 
so this might be a better list to start. 
i.e. 2009_28_07 .vs.  2009_07_28 .vs. 09_7_28 .vs. 09_11_11.   
Letters and hex may play well... but design for backup, easy
system admin and general simplicity including documentation
(documentation important characters might be > .vs. > for
html documents, see also TeX, LaTeX, SGML, XML).

Predictable file names have been central to some security risks so
depending on the data value/ risk issues some attention may need to be
given to the creation steps.

-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?