[dm-devel] RFC: dm-switch target [v2]

Pasi Kärkkäinen pasik at iki.fi
Thu Sep 1 21:09:30 UTC 2011


On Wed, Aug 31, 2011 at 04:19:24PM -0400, Jim Ramsay wrote:
> Note: This is a repost with cleaned-up code, which was originally posted
> by Jason Shamberger.
>   http://www.redhat.com/archives/dm-devel/2011-March/msg00131.html
> 
> - The license in the headers has been cleared up - This is and always
>   has been GPL code.
> - Code formatting and style more closely match the Linux Kernel.
> 
> ---------------------------
> 
> We propose a new DM target, dm-switch, which can be used to efficiently
> implement a mapping of IOs to underlying block devices in scenarios
> where there are: (1) a large number of address regions, (2) a fixed size
> of these address regions, (3) no pattern than allows for a compact
> description with something like the dm-stripe target.
> 

Great, I've been waiting for this module :)

Do you guys have some userland tool/script to populate the page table
in dm-switch, or is there some other way to test this (with eql storage) ?

Thanks!

-- Pasi


> Motivation:
> 
> Dell EqualLogic and some other iSCSI storage arrays use a distributed
> frameless architecture.  In this architecture, the storage group
> consists of a number of distinct storage arrays ("members"), each having
> independent controllers, disk storage and network adapters.  When a LUN
> is created it is spread across multiple members.  The details of the
> spreading are hidden from initiators connected to this storage system.
> The storage group exposes a single target discovery portal, no matter
> how many members are being used.  When iSCSI sessions are created, each
> session is connected to an eth port on a single member.  Data to a LUN
> can be sent on any iSCSI session, and if the blocks being accessed are
> stored on another member the IO will be forwarded as required.  This
> forwarding is invisible to the initiator.  The storage layout is also
> dynamic, and the blocks stored on disk may be moved from member to
> member as needed to balance the load.
> 
> This architecture simplifies the management and configuration of both
> the storage group and initiators.  In a multipathing configuration, it
> is possible to set up multiple iSCSI sessions to use multiple network
> interfaces on both the host and target to take advantage of the
> increased network bandwidth.  An initiator can use a simple round robin
> algorithm to send IO on all paths and let the storage array members
> forward it as necessary.  However, there is a performance advantage to
> sending data directly to the correct member.  The Device Mapper table
> architecture supports designating different address regions with
> different targets.  However, in our architecture the LUN is spread with
> a chunk size on the order of 10s of MBs, which means the resulting DM
> table could have more than a million entries, which consumes too much
> memory.
> 
> Solution:
> 
> Based on earlier discussion with the dm-devel contributors, we have
> solved this problem by using Device Mapper to build a two-layer device
> hierarchy:
> 
>     Upper Tier ??? Determine which array member the IO should be sent to.
>     Lower Tier ??? Load balance amongst paths to a particular member.
> 
> The lower tier consists of a single multipath device for each member.
> Each of these multipath devices contains the set of paths directly to
> the array member in one priority group, and leverages existing path
> selectors to load balance amongst these paths.  We also build a
> non-preferred priority group containing paths to other array members for
> failover reasons.
> 
> The upper tier consists of a single switch device, using the new DM
> target module proposed here.  This device uses a bitmap to look up the
> location of the IO and choose the appropriate lower tier device to route
> the IO.  By using a bitmap we are able to use 4 bits for each address
> range in a 16 member group (which is very large for us).  This is a much
> denser representation than the DM table B-tree can achieve.
> 
> Though we have developed this target for a specific storage device, we
> have made an effort to keep it a general purpose as possible in hopes
> that others may benefit.  We welcome any feedback on the design or
> implementation.
> 
> --- dm-switch.h ---
> 
> /*
>  * Copyright (c) 2010-2011 by Dell, Inc.  All rights reserved.
>  *
>  * This file is released under the GPL.
>  *
>  * Description:
>  *
>  *     file:    dm-switch.h
>  *     authors: Kevin_OKelley at dell.com
>  *              Jim_Ramsay at dell.com
>  *              Narendran_Ganapathy at dell.com
>  *
>  * This file contains the netlink message definitions for the "switch" target.
>  *
>  * The only defined message at this time is for uploading the mapping page
>  * table.
>  */
> 
> #ifndef __DM_SWITCH_H
> #define __DM_SWITCH_H
> 
> #define MAX_IPC_MSG_LEN 65480	/* dictated by netlink socket */
> #define MAX_ERR_STR_LEN 255	/* maximum length of the error string */
> 
> enum Opcode {
> 	OPCODE_PAGE_TABLE_UPLOAD = 1,
> };
> 
> /*
>  * IPC Page Table message
>  */
> struct IpcPgTable {
> 	uint32_t total_len;	/* Total length of this IPC message */
> 	enum Opcode opcode;
> 	uint32_t userland[2];	/* Userland optional data (dmsetup status) */
> 	uint32_t dev_major;	/* DM device major */
> 	uint32_t dev_minor;	/* DM device minor */
> 	uint32_t page_total;	/* Total pages in the volume */
> 	uint32_t page_offset;	/* Starting page offset for this IPC */
> 	uint32_t page_count;	/* Number of page table entries in this IPC */
> 	uint32_t page_size;	/* Page size in 512B sectors */
> 	uint16_t dev_count;	/* Number of devices */
> 	uint8_t pte_bits;	/* Page Table Entry field size in bits */
> 	uint8_t reserved;	/* Integer alignment  */
> 	uint8_t ptbl_buff[1];	/* Page table entries (variable length) */
> };
> 
> /*
>  * IPC Response message
>  */
> struct IpcResponse {
> 	uint32_t total_len;	/* total length of the IPC */
> 	enum Opcode opcode;
> 	uint32_t userland[2];	/* Userland optional data */
> 	uint32_t dev_major;	/* DM device major */
> 	uint32_t dev_minor;	/* DM device minor */
> 	uint32_t status;	/* 0 on success; errno on failure */
> 	char err_str[MAX_ERR_STR_LEN + 1];
> 	/* If status != 0, contains an informative error message */
> };
> 
> /* Generic Netlink family attributes: used to define the family */
> enum {
> 	NETLINK_ATTR_UNSPEC,
> 	NETLINK_ATTR_MSG,
> 	NETLINK_ATTR__MAX,
> };
> #define NETLINK_ATTR_MAX (NETLINK_ATTR__MAX - 1)
> 
> /* Netlink commands (operations) */
> enum {
> 	NETLINK_CMD_UNSPEC,
> 	NETLINK_CMD_GET_PAGE_TBL,
> 	NETLINK_CMD__MAX,
> };
> #define NETLINK_CMD_MAX (NETLINK_CMD__MAX - 1)
> 
> #endif /* __DM_SWITCH_H */
> 
> --- dm-switch.c ---
> 
> /*
>  * Copyright (c) 2010-2011 by Dell, Inc.  All rights reserved.
>  *
>  * This file is released under the GPL.
>  *
>  * Description:
>  *
>  *     file:    dm-switch.c
>  *     authors: Kevin_OKelley at dell.com
>  *              Jim_Ramsay at dell.com
>  *              Narendran_Ganapathy at dell.com
>  *
>  * This file implements a "switch" target which efficiently implements a
>  * mapping of IOs to underlying block devices in scenarios where there are:
>  *   (1) a large number of address regions
>  *   (2) a fixed size equal across all address regions
>  *   (3) no pattern than allows for a compact description with something like
>  *       the dm-stripe target.
>  */
> 
> #include <linux/module.h>
> #include <linux/init.h>
> #include <linux/blkdev.h>
> #include <linux/bio.h>
> #include <linux/slab.h>
> #include <linux/device.h>
> #include <linux/version.h>
> #include <linux/dm-ioctl.h>
> #include <linux/device-mapper.h>
> #include <net/genetlink.h>
> #include <asm/div64.h>
> 
> #include "dm-switch.h"
> #define DM_MSG_PREFIX "switch"
> MODULE_DESCRIPTION(DM_NAME
> 		   " fixed-size address-region-mapping throughput-oriented path selector");
> MODULE_AUTHOR("Kevin D. O'Kelley <Kevin_OKelley at dell.com>");
> MODULE_LICENSE("GPL");
> 
> #if defined(DEBUG) || defined(_DEBUG)
> #define DBGPRINT(...)  printk(KERN_DEBUG #args)
> #define DBGPRINTV(...)
> /* #define DEBUG_HEXDUMP 1 */
> #else
> #define DBGPRINT(...)
> #define DBGPRINTV(...)
> #endif
> 
> /*
>  * Switch device context block: A new one is created for each dm device.
>  * Contains an array of devices from which we have taken references.
>  */
> struct switch_dev {
> 	struct dm_dev *dmdev;
> 	sector_t start;
> 	atomic_t error_count;
> };
> 
> /* Switch page table block */
> struct switch_ptbl {
> 	uint32_t pte_bits;	/* Page Table Entry field size in bits */
> 	uint32_t pte_mask;	/* Page Table Entry field mask */
> 	uint32_t pte_fields;	/* Number of Page Table Entries per uint32_t */
> 	uint32_t ptbl_bytes;	/* Page table size in bytes */
> 	uint32_t ptbl_num;	/* Page table size in entries */
> 	uint32_t ptbl_max;	/* Page table maximum size in entries; */
> 	uint32_t ptbl_buff[0];	/* Address of page table */
> };
> 
> /* Switch context header */
> struct switch_ctx {
> 	struct list_head list;
> 	dev_t dev_this;		/* Device serviced by this target */
> 	uint32_t dev_count;	/* Number of devices */
> 	uint32_t page_size;	/* Page size in 512B sectors */
> 	uint32_t userland[2];	/* Userland optional data (dmsetup status) */
> 	uint64_t ios_remapped;	/* I/Os remapped */
> 	uint64_t ios_unmapped;	/* I/Os not remapped */
> 	spinlock_t spinlock;	/* Control access to counters */
> 
> 	struct switch_ptbl *ptbl;	/* Page table (if loaded) */
> 	struct switch_dev dev_list[0];
> 	/* Array of dm devices to switch between */
> };
> 
> /*
>  * Global variables
>  */
> LIST_HEAD(__g_context_list);	/* Linked list of context blocks */
> static spinlock_t __g_spinlock;	/* Control access to list of context blocks */
> 
> /* Limit check for the switch constructor */
> static int switch_ctr_limits(struct dm_target *ti, struct dm_dev *dm)
> {
> 	struct block_device *sd = dm->bdev;
> 	struct hd_struct *hd = sd->bd_part;
> 	if (hd != NULL) {
> 		DBGPRINT("%s sd=0x%p (%d:%d), hd=0x%p, start=%llu, "
> 			 "size=%llu\n", __func__, sd, MAJOR(sd->bd_dev),
> 			 MINOR(sd->bd_dev), hd,
> 			 (unsigned long long)hd->start_sect,
> 			 (unsigned long long)hd->nr_sects);
> 		if (ti->len <= hd->nr_sects)
> 			return true;
> 		ti->error = "Device too small for target";
> 		return false;
> 	}
> 	ti->error = "Missing device limits";
> 	printk(KERN_WARNING "%s %s\n", __func__, ti->error);
> 	return true;
> }
> 
> /*
>  * Constructor: Called each time a dmsetup command creates a dm device.  The
>  * target parameter will already have the table, type, begin and len fields
>  * filled in.  Arguments are in pairs: <dev_path> <offset>.  Therefore, we get
>  * multiple constructor calls, but we will need to build a list of switch_ctx
>  * blocks so that the page table information gets matched to the correct
>  * device.
>  */
> static int switch_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> {
> 	int n;
> 	uint32_t dev_count;
> 	unsigned long flags, major, minor;
> 	unsigned long long start;
> 	struct switch_ctx *pctx;
> 	struct mapped_device *md = NULL;
> 	struct dm_dev *dm;
> 	const char *dm_devname;
> 
> 	DBGPRINTV("%s\n", __func__);
> 	if (argc < 4) {
> 		ti->error = "Insufficient arguments";
> 		return -EINVAL;
> 	}
> 	if (kstrtou32(argv[0], 10, &dev_count) != 0) {
> 		ti->error = "Invalid device count";
> 		return -EINVAL;
> 	}
> 	if (dev_count != (argc - 2) / 2) {
> 		ti->error = "Invalid argument count";
> 		return -EINVAL;
> 	}
> 	pctx = kmalloc(sizeof(*pctx) + (dev_count * sizeof(struct switch_dev)),
> 		       GFP_KERNEL);
> 	if (pctx == NULL) {
> 		ti->error = "Cannot allocate redirect context";
> 		return -ENOMEM;
> 	}
> 	pctx->dev_count = dev_count;
> 	if ((kstrtou32(argv[1], 10, &pctx->page_size) != 0) ||
> 	    (pctx->page_size == 0)) {
> 		ti->error = "Invalid page size";
> 		goto failed_kfree;
> 	}
> 	pctx->ptbl = NULL;
> 	pctx->userland[0] = pctx->userland[1] = 0;
> 	pctx->ios_remapped = pctx->ios_unmapped = 0;
> 	spin_lock_init(&pctx->spinlock);
> 
> 	/*
> 	 * Find the device major and minor for the device that is being served
> 	 * by this target.
> 	 */
> 	md = dm_table_get_md(ti->table);
> 	if (md == NULL) {
> 		ti->error = "Cannot locate dm device";
> 		goto failed_kfree;
> 	}
> 	dm_devname = dm_device_name(md);
> 	if (dm_devname == NULL) {
> 		ti->error = "Cannot acquire dm device name";
> 		goto failed_kfree;
> 	}
> 	if (sscanf(dm_devname, "%lu:%lu", &major, &minor) != 2) {
> 		ti->error = "Invalid dm device name";
> 		goto failed_kfree;
> 	}
> 	pctx->dev_this = MKDEV(major, minor);
> 	DBGPRINT("%s ctx=0x%p (%d:%d), type=\"%s\", count=%d, "
> 		 "start=%llu, size=%llu\n",
> 		 __func__, pctx, MAJOR(pctx->dev_this),
> 		 MINOR(pctx->dev_this), ti->type->name, pctx->dev_count,
> 		 (unsigned long long)ti->begin, (unsigned long long)ti->len);
> 
> 	/*
> 	 * Check each device beneath the target to ensure that the limits are
> 	 * consistent.
> 	 */
> 	for (n = 0, argc = 2; n < pctx->dev_count; n++, argc += 2) {
> 		DBGPRINTV("%s #%d 0x%p, %s, %s\n", __func__, n,
> 			  &pctx->dev_list[n], argv[argc], argv[argc + 1]);
> 		if (sscanf(argv[argc + 1], "%llu", &start) != 1) {
> 			ti->error = "Invalid device starting offset";
> 			goto failed_dev_list_prev;
> 		}
> 		if (dm_get_device
> 		    (ti, argv[argc], dm_table_get_mode(ti->table), &dm)) {
> 			ti->error = "Device lookup failed";
> 			goto failed_dev_list_prev;
> 		}
> 		pctx->dev_list[n].dmdev = dm;
> 		pctx->dev_list[n].start = start;
> 		atomic_set(&(pctx->dev_list[n].error_count), 0);
> 		if (!switch_ctr_limits(ti, dm))
> 			goto failed_dev_list_all;
> 	}
> 
> 	spin_lock_irqsave(&__g_spinlock, flags);
> 	list_add_tail(&pctx->list, &__g_context_list);
> 	spin_unlock_irqrestore(&__g_spinlock, flags);
> 	ti->private = pctx;
> 	return 0;
> 
> failed_dev_list_prev:		/* De-reference previous devices */
> 	n--;			/*   (i.e. don't include this one) */
> 
> failed_dev_list_all:		/* De-reference all devices  */
> 	printk(KERN_WARNING "%s device=%s, start=%s\n", __func__,
> 	       argv[argc], argv[argc + 1]);
> 	for (; n >= 0; n--)
> 		dm_put_device(ti, pctx->dev_list[n].dmdev);
> 
> failed_kfree:
> 	printk(KERN_WARNING "%s %s\n", __func__, ti->error);
> 	kfree(pctx);
> 	return -EINVAL;
> }
> 
> /*
>  * Destructor: Don't free the dm_target, just the ti->private data (if any).
>  */
> static void switch_dtr(struct dm_target *ti)
> {
> 	int n;
> 	unsigned long flags;
> 	struct switch_ctx *pctx = (struct switch_ctx *)ti->private;
> 	void *ptbl;
> 
> 	DBGPRINT("%s ctx=0x%p (%d:%d)\n", __func__, pctx,
> 		 MAJOR(pctx->dev_this), MINOR(pctx->dev_this));
> 	spin_lock_irqsave(&__g_spinlock, flags);
> 	ptbl = pctx->ptbl;
> 	rcu_assign_pointer(pctx->ptbl, NULL);
> 	list_del(&pctx->list);
> 	spin_unlock_irqrestore(&__g_spinlock, flags);
> 	for (n = 0; n < pctx->dev_count; n++) {
> 		DBGPRINTV("%s dm_put_device(%s)\n", __func__,
> 			  pctx->dev_list[n].dmdev->name);
> 		dm_put_device(ti, pctx->dev_list[n].dmdev);
> 	}
> 	synchronize_rcu();
> 	kfree(ptbl);
> 	kfree(pctx);
> }
> 
> /*
>  * NOTE: If CONFIG_LBD is disabled, sector_t types are uint32_t.  Therefore, in
>  * this routine, we convert the offset into a uint64_t instead of a sector_t so
>  * that all of the remaining arithmatic is correct, including the do_div()
>  * calls.
>  */
> static int switch_map(struct dm_target *ti, struct bio *bio,
> 		      union map_info *map_context)
> {
> 	struct switch_ctx *pctx = (struct switch_ctx *)ti->private;
> 	struct switch_ptbl *ptbl;
> 	unsigned long flags;
> 	uint64_t itbl, offset = bio->bi_sector - ti->begin;
> 	uint32_t idev = 0, irem;
> 	uint64_t *pinc = &pctx->ios_unmapped;
> 
> 	rcu_read_lock();
> 	ptbl = rcu_dereference(pctx->ptbl);
> 	if (ptbl != NULL) {
> 		itbl = offset;
> 		do_div(itbl, pctx->page_size);
> 		if (itbl < ptbl->ptbl_num) {
> 			irem = do_div(itbl, ptbl->pte_fields);
> 			idev =
> 			    (ptbl->ptbl_buff[itbl] >> (irem * ptbl->pte_bits))
> 			    & ptbl->pte_mask;
> 			if (idev <= pctx->dev_count) {
> 				pinc = &pctx->ios_remapped;
> 			} else {
> 				printk(KERN_WARNING "%s WARNING: dev=%d, "
> 				       "offset=%lld\n", __func__, idev, offset);
> 				idev = 0;
> 			}
> 		} else {
> 			printk(KERN_WARNING "%s WARNING: Page Table Entry "
> 			       "%lld >= %d\n", __func__, itbl, ptbl->ptbl_num);
> 		}
> 	}
> 	rcu_read_unlock();
> 	spin_lock_irqsave(&pctx->spinlock, flags);
> 	(*pinc)++;
> 	spin_unlock_irqrestore(&pctx->spinlock, flags);
> 	bio->bi_bdev = pctx->dev_list[idev].dmdev->bdev;
> 	bio->bi_sector = pctx->dev_list[idev].start + offset;
> 	return DM_MAPIO_REMAPPED;
> }
> 
> /*
>  * Switch status:
>  *
>  * INFO: #dev_count device [device] 5 'A'['A' ...] userland[0] userland[1]
>  *       #remapped #unmapped
>  *
>  * where:
>  *   "'A'['A']" is a single word with an 'A' (active) or 'D' for each device
>  *   The userland values are set by the last userland message to load the page
>  *   table
>  *   "#remapped" is the number of remapped I/Os
>  *   "#unmapped" is the number of I/Os that could not be remapped
>  *
>  * TABLE: #page_size #dev_count device start [device start ...]
>  */
> static int switch_status(struct dm_target *ti, status_type_t type, char
> 			 *result, unsigned int maxlen)
> {
> 	struct switch_ctx *pctx = (struct switch_ctx *)ti->private;
> 	char buffer[pctx->dev_count + 1];
> 	unsigned int sz = 0;
> 	int n;
> 	uint64_t remapped, unmapped;
> 	unsigned long flags;
> 
> 	result[0] = '\0';
> 	switch (type) {
> 	case STATUSTYPE_INFO:
> 		DMEMIT("%d", pctx->dev_count);
> 		for (n = 0; n < pctx->dev_count; n++) {
> 			DMEMIT(" %s", pctx->dev_list[n].dmdev->name);
> 			buffer[n] = 'A';
> 		}
> 		buffer[n] = '\0';
> 		spin_lock_irqsave(&pctx->spinlock, flags);
> 		remapped = pctx->ios_remapped;
> 		unmapped = pctx->ios_unmapped;
> 		spin_unlock_irqrestore(&pctx->spinlock, flags);
> 		DMEMIT(" 5 %s %08x %08x %lld %lld", buffer, pctx->userland[0],
> 		       pctx->userland[1], remapped, unmapped);
> 		break;
> 
> 	case STATUSTYPE_TABLE:
> 		DMEMIT("%d %d", pctx->dev_count, pctx->page_size);
> 		for (n = 0; n < pctx->dev_count; n++) {
> 			DMEMIT(" %s %llu", pctx->dev_list[n].dmdev->name,
> 			       (unsigned long long)pctx->dev_list[n].start);
> 		}
> 		break;
> 
> 	default:
> 		return 0;
> 	}
> 	return 0;
> }
> 
> /*
>  * Switch ioctl:
>  *
>  * Passthrough all ioctls to the first path.
>  */
> static int switch_ioctl(struct dm_target *ti, unsigned int cmd,
> 			unsigned long arg)
> {
> 	struct switch_ctx *pctx = (struct switch_ctx *)ti->private;
> 	struct block_device *bdev;
> 	fmode_t mode = 0;
> 
> 	/* Sanity check */
> 	if (unlikely(!pctx || !pctx->dev_list[0].dmdev ||
> 		     !pctx->dev_list[0].dmdev->bdev))
> 		return -EIO;
> 
> 	bdev = pctx->dev_list[0].dmdev->bdev;
> 	mode = pctx->dev_list[0].dmdev->mode;
> 	return __blkdev_driver_ioctl(bdev, mode, cmd, arg);
> }
> 
> static struct target_type __g_switch_target = {
> 	.name = "switch",
> 	.version = {1, 0, 0},
> 	.module = THIS_MODULE,
> 	.ctr = switch_ctr,
> 	.dtr = switch_dtr,
> 	.map = switch_map,
> 	.status = switch_status,
> 	.ioctl = switch_ioctl,
> };
> 
> /* Generic Netlink attribute policy (single attribute, NETLINK_ATTR_MSG) */
> static struct nla_policy __g_attr_policy[NETLINK_ATTR_MAX + 1] = {
> 	[NETLINK_ATTR_MSG] = {.type = NLA_BINARY,.len = MAX_IPC_MSG_LEN},
> };
> 
> /* Define the Generic Netlink family */
> static struct genl_family __g_family = {
> 	.id = GENL_ID_GENERATE,	/* Assign channel when family is registered */
> 	.hdrsize = 0,
> 	.name = "DM_SWITCH",
> 	.version = 1,
> 	.maxattr = NETLINK_ATTR_MAX,
> };
> 
> #ifdef DEBUG_HEXDUMP
> #define DEBUG_HEXDUMP_WORDS 8
> #define DEBUG_HEXDUMP_BYTES (DEBUG_HEXDUMP_WORDS * sizeof(uint32_t))
> 
> static inline void debug_hexdump_line(void *ibuff, size_t offset, size_t isize,
> 				      const char *func)
> {
> 	static const char *hex = "0123456789abcdef";
> 	unsigned char *iptr = &((unsigned char *)ibuff)[offset];
> 	char *optr, obuff[DEBUG_HEXDUMP_BYTES * 3];
> 	int osize;
> 
> 	while (isize > 0) {
> 		optr = obuff;
> 		for (osize = 0; osize < DEBUG_HEXDUMP_BYTES; osize++) {
> 			if (((osize & 3) == 0) && (osize != 0))
> 				*optr++ = ' ';
> 			*optr++ = hex[(*iptr) >> 4];
> 			*optr++ = hex[(*iptr++) & 15];
> 			if (--isize <= 0)
> 				break;
> 		}
> 		*optr = '\0';
> 		DBGPRINT("%s %04x %s\n", func, (unsigned int)offset, obuff);
> 		offset += DEBUG_HEXDUMP_BYTES;
> 	}
> }
> 
> static inline void debug_hexdump(void *ibuff, size_t isize, const char *func)
> {
> 	size_t iline = isize / DEBUG_HEXDUMP_BYTES;
> 	size_t irem = isize % DEBUG_HEXDUMP_BYTES;
> 	size_t offset = isize;
> 
> 	if (iline < 6) {
> 		debug_hexdump_line(ibuff, 0, isize, func);
> 		return;
> 	}
> 	debug_hexdump_line(ibuff, 0, (3 * DEBUG_HEXDUMP_BYTES), func);
> 	isize = (irem == 0) ? (3 * DEBUG_HEXDUMP_BYTES)
> 	    : ((2 * DEBUG_HEXDUMP_BYTES) + irem);
> 	offset -= isize;
> 	debug_hexdump_line(ibuff, offset, isize, func);
> }
> #else
> static inline void debug_hexdump(void *ibuff, size_t isize, const char *func)
> {
> }
> #endif
> 
> /*
>  * Generic Netlink socket read function that handles communication from the
>  * userland for downloading the page table.
>  */
> static int get_page_tbl(struct sk_buff *skb_2, struct genl_info *info)
> {
> 	uint32_t rc, pte_mask, pte_fields, ptbl_bytes, offset, size;
> 	uint32_t status = 0;
> 	unsigned long flags;
> 	char *mydata;
> 	void *msg_head;
> 	struct nlattr *na;
> 	struct sk_buff *skb;
> 	struct switch_ctx *pctx, *next;
> 	struct switch_ptbl *ptbl, *pnew;
> 	struct IpcPgTable *pgp;
> 	struct IpcResponse resp;
> 	dev_t dev;
> 	static const char *invmsg = "Invalid Page Table message";
> 
> 	/*
> 	 * For each attribute there is an index in info->attrs which points to
> 	 * a nlattr structure in this structure the data is given
> 	 */
> 	if (info == NULL) {
> 		printk(KERN_WARNING "%s missing genl_info parameter\n",
> 		       __func__);
> 		return 0;
> 	}
> 	na = info->attrs[NETLINK_ATTR_MSG];
> 	if (na == NULL) {
> 		printk(KERN_WARNING "%s no info->attrs %i\n", __func__,
> 		       NETLINK_ATTR_MSG);
> 		return 0;
> 	}
> 	mydata = (char *)nla_data(na);
> 	if (mydata == NULL) {
> 		printk(KERN_WARNING "%s error while receiving data\n",
> 		       __func__);
> 		return 0;
> 	}
> 	DBGPRINTV("%s seq=%d, pid=%d, type=%d, flags=0x%x, data=0x%p "
> 		  "(0x%x, %d)\n",
> 		  __func__, info->snd_seq, info->snd_pid,
> 		  info->nlhdr->nlmsg_type, info->nlhdr->nlmsg_flags,
> 		  mydata, na->nla_len, na->nla_len);
> 	debug_hexdump(mydata,
> 		      ((offsetof(struct IpcPgTable, ptbl_buff)<na->nla_len)
> 		       ? offsetof(struct IpcPgTable, ptbl_buff)
> 		       : na->nla_len), __func__);
> 	/*
> 	 * Format the reply message.  Return positve error codes to userland.
> 	 */
> 	skb = nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
> 	if (skb == NULL) {
> 		printk(KERN_WARNING "%s cannot allocate reply message\n",
> 		       __func__);
> 		return 0;
> 	}
> 	msg_head = genlmsg_put(skb, 0, info->snd_seq, &__g_family, 0,
> 			       NETLINK_CMD_GET_PAGE_TBL);
> 	if (skb == NULL) {
> 		printk(KERN_WARNING "%s cannot format reply message header\n",
> 		       __func__);
> 		return 0;
> 	}
> 	pgp = (struct IpcPgTable *)mydata;
> 	if (na->nla_len < sizeof(struct IpcPgTable)) {
> 		snprintf(resp.err_str, sizeof(resp.err_str),
> 			 "%s: too short (%d)", invmsg, na->nla_len);
> 		status = EINVAL;
> 		goto failed_respond;
> 	}
> 	if ((pgp->page_offset + pgp->page_count) > pgp->page_total) {
> 		snprintf(resp.err_str, sizeof(resp.err_str),
> 			 "%s: too many page table entries (%d > %d)",
> 			 invmsg, (pgp->page_offset + pgp->page_count),
> 			 pgp->page_total);
> 		status = EINVAL;
> 		goto failed_respond;
> 	}
> 	pte_mask = (1 << pgp->pte_bits) - 1;
> 	if (((pgp->dev_count - 1) & (~pte_mask)) != 0) {
> 		snprintf(resp.err_str, sizeof(resp.err_str),
> 			 "%s: invalid mask 0x%x for %d devices",
> 			 invmsg, pte_mask, pgp->dev_count);
> 		status = EINVAL;
> 		goto failed_respond;
> 	}
> 	pte_fields = 32 / pgp->pte_bits;
> 	size = ((pgp->page_count + pte_fields - 1) / pte_fields) *
> 	    sizeof(uint32_t);
> 	if ((sizeof(*pgp) - 1 + size) > na->nla_len) {
> 		snprintf(resp.err_str, sizeof(resp.err_str),
> 			 "Invalid Page Table message: incomplete messsage");
> 		status = EINVAL;
> 		goto failed_respond;
> 	}
> 	debug_hexdump(&pgp->ptbl_buff, size, __func__);
> 
> 	/*
> 	 * Look for the corresponding switch context block to create or update
> 	 * the page table.
> 	 */
> 	rc = 0;
> 	dev = MKDEV(pgp->dev_major, pgp->dev_minor);
> 	spin_lock_irqsave(&__g_spinlock, flags);
> 	list_for_each_entry_safe(pctx, next, &__g_context_list, list) {
> 		if (dev == pctx->dev_this) {
> 			rc = 1;
> 			break;
> 		}
> 	}
> 	if (rc == 0) {
> 		snprintf(resp.err_str, sizeof(resp.err_str),
> 			 "%s: invalid target device %d:%d",
> 			 invmsg, pgp->dev_major, pgp->dev_minor);
> 		status = EINVAL;
> 		goto failed_unlock;
> 	}
> 	DBGPRINTV("%s ctx=0x%p (%d:%d)\n", __func__, pctx, pgp->dev_major,
> 		  pgp->dev_minor);
> 
> 	ptbl = pctx->ptbl;
> 	if (((ptbl != NULL) && (pgp->page_offset > (ptbl->ptbl_num + 1))) ||
> 	    ((ptbl == NULL) && (pgp->page_offset != 0))) {
> 		snprintf(resp.err_str, sizeof(resp.err_str),
> 			 "%s: missing entries", invmsg);
> 		status = EINVAL;
> 		goto failed_unlock;
> 	}
> 	/*
> 	 * Don't allow userland to change context parameters unless the page
> 	 * table is being rebuilt.
> 	 */
> 	if (pgp->page_offset != 0) {
> 		if ((pgp->dev_count) != pctx->dev_count) {
> 			snprintf(resp.err_str, sizeof(resp.err_str),
> 				 "%s: invalid device count %d",
> 				 invmsg, pgp->dev_count);
> 			status = EINVAL;
> 			goto failed_respond;
> 		}
> 		if (ptbl != NULL) {
> 			if (pgp->pte_bits != ptbl->pte_bits) {
> 				snprintf(resp.err_str, sizeof(resp.err_str),
> 					 "%s: number of bits changed", invmsg);
> 				status = EINVAL;
> 				goto failed_unlock;
> 			}
> 			if (pgp->page_total != ptbl->ptbl_max) {
> 				snprintf(resp.err_str, sizeof(resp.err_str),
> 					 "%s: total number of entries changed",
> 					 invmsg);
> 				status = EINVAL;
> 				goto failed_unlock;
> 			}
> 		}
> 	}
> 
> 	/*
> 	 * Create a Page Table if needed.  Most of the time, the size of the
> 	 * table doesn't change.  In that case, re-use the existing table.
> 	 */
> 	ptbl_bytes = ((pgp->page_total + pte_fields - 1) / pte_fields) *
> 	    sizeof(uint32_t);
> 	if ((ptbl != NULL) && (ptbl_bytes == ptbl->ptbl_bytes)) {
> 		pnew = ptbl;
> 	} else {
> 		pnew = kmalloc((sizeof(*pnew) + ptbl_bytes), GFP_KERNEL);
> 		if (pnew == NULL) {
> 			snprintf(resp.err_str, sizeof(resp.err_str),
> 				 "Cannot allocate Page Table");
> 			status = EINVAL;
> 			goto failed_unlock;
> 		}
> 		pnew->ptbl_bytes = ptbl_bytes;
> 		DBGPRINT("%s ctx=0x%p (%d:%d) pnew=0x%p, buff=0x%p (%d), OK\n",
> 			 __func__, pctx, MAJOR(pctx->dev_this),
> 			 MINOR(pctx->dev_this), pnew, pnew->ptbl_buff,
> 			 ptbl_bytes);
> 	}
> 	pnew->pte_bits = pgp->pte_bits;
> 	pnew->pte_mask = pte_mask;
> 	pnew->pte_fields = pte_fields;
> 	pnew->ptbl_max = pgp->page_total;
> 	pnew->ptbl_num = pgp->page_offset + pgp->page_count;
> 	DBGPRINT("%s ctx=0x%p (%d:%d): bits=%d, mask=0x%x, num=%d, max=%d\n",
> 		 __func__, pctx, MAJOR(pctx->dev_this),
> 		 MINOR(pctx->dev_this), pnew->pte_bits, pnew->pte_mask,
> 		 pnew->ptbl_num, pnew->ptbl_max);
> 	offset = (pgp->page_offset + pte_fields - 1) / pte_fields;
> 	memcpy(&pnew->ptbl_buff[offset], pgp->ptbl_buff, size);
> 	pctx->userland[0] = pgp->userland[0];
> 	pctx->userland[1] = pgp->userland[1];
> 
> 	if (pnew != ptbl) {
> 		rcu_assign_pointer(pctx->ptbl, pnew);
> 		kfree(ptbl);
> 	}
> 
> failed_unlock:
> 	spin_unlock_irqrestore(&__g_spinlock, flags);
> 
> failed_respond:
> 	if (status)
> 		printk(KERN_WARNING "%s WARNING: %s\n", __func__, resp.err_str);
> 	else
> 		resp.err_str[0] = '\0';
> 
> 	/* Format the response message */
> 	resp.total_len = sizeof(struct IpcResponse);
> 	resp.opcode = OPCODE_PAGE_TABLE_UPLOAD;
> 	resp.userland[0] = pgp->userland[0];
> 	resp.userland[1] = pgp->userland[1];
> 	resp.dev_major = pgp->dev_major;
> 	resp.dev_minor = pgp->dev_minor;
> 	resp.status = status;
> 	rc = nla_put(skb, NLA_BINARY, sizeof(struct IpcResponse), &resp);
> 	if (rc != 0) {
> 		printk(KERN_WARNING
> 		       "%s WARNING: Cannot format reply message\n", __func__);
> 		return 0;
> 	}
> 	genlmsg_end(skb, msg_head);
> 	rc = genlmsg_unicast(&init_net, skb, info->snd_pid);
> 	if (rc != 0)
> 		printk(KERN_WARNING "%s WARNING: Cannot send reply message\n",
> 		       __func__);
> 	return 0;
> }
> 
> /* Operation for getting the page table */
> static struct genl_ops __g_op_get_page_tbl = {
> 	.cmd = NETLINK_CMD_GET_PAGE_TBL,
> 	.flags = 0,
> 	.policy = __g_attr_policy,
> 	.doit = get_page_tbl,
> 	.dumpit = NULL,
> };
> 
> /*
>  * Use the sysfs interface to inform the userland process of the family id to
>  * be used by the Generic Netlink socket.
>  */
> static ssize_t sysfs_familyid_show(struct kobject *kobj,
> 				   struct attribute *attr, char *buff)
> {
> 	return snprintf(buff, PAGE_SIZE, "%d", __g_family.id);
> }
> 
> static ssize_t sysfs_familyid_store(struct kobject *kobj,
> 				    struct attribute *attr, const char *buff,
> 				    size_t size)
> {
> 	return size;
> }
> 
> struct _sysfs_attr_ops {
> 	const struct attribute attr;
> 	const struct sysfs_ops ops;
> };
> static const struct _sysfs_attr_ops __g_sysfs_familyid = {
> 	.attr = {"familyid", 0644},
> 	.ops = {&sysfs_familyid_show, &sysfs_familyid_store}
> };
> 
> int __init dm_switch_init(void)
> {
> 	int r;
> 
> 	DBGPRINTV("%s\n", __func__);
> 	spin_lock_init(&__g_spinlock);
> 	r = dm_register_target(&__g_switch_target);
> 	if (r) {
> 		DMERR("dm_register_target() failed %d", r);
> 		return r;
> 	}
> 
> 	/* Initialize Generic Netlink communications */
> 	r = genl_register_family(&__g_family);
> 	if (r) {
> 		DMERR("genl_register_family() failed");
> 		goto failed;
> 	}
> 	r = genl_register_ops(&__g_family, &__g_op_get_page_tbl);
> 	if (r) {
> 		DMERR("genl_register_ops(get_page_tbl) failed %d", r);
> 		goto failed;
> 	}
> 	DBGPRINTV("%s Registered Generic Netlink group %d\n", __func__,
> 		  __g_family.id);
> 	r = sysfs_create_file(&__g_switch_target.module->mkobj.kobj,
> 			      &__g_sysfs_familyid.attr);
> 	if (r) {
> 		DMERR("/sys/module/familyid create failed %d", r);
> 		goto failed;
> 	}
> 	return 0;
> 
> failed:
> 	dm_unregister_target(&__g_switch_target);
> 	return r;
> }
> 
> void dm_switch_exit(void)
> {
> 	int r;
> 
> 	DBGPRINTV("%s\n", __func__);
> 	dm_unregister_target(&__g_switch_target);
> 	r = genl_unregister_family(&__g_family);
> 	if (r)
> 		DMWARN("genl_unregister_family() failed %d", r);
> 	return;
> }
> 
> module_init(dm_switch_init);
> module_exit(dm_switch_exit);
> 
> --
> dm-devel mailing list
> dm-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel




More information about the dm-devel mailing list