Sage Weil, Red Hat’s chief architect of Ceph and co-creator of Ceph – among many other credentials – recently held an “ask me anything” session on Reddit. Though you can read the whole thing for yourself, here, we’ve collected the top questions and answers for your edification. Read on!
Q. from /u/weetabeex: Being really bloody complex notwithstanding, are there any plans for proper geo-replication in the near future? I am assuming this has been discussed over and over again, so I wonder: do you think the consistency semantics will need to be relaxed to make this work?
How cool would it be to start a Jewel blueprint out of your reddit (hopefully super detailed) reply?
A. There are currently two geo-replication development projects underway: a v2 of the radosgw multiside federation, and RBD journaling for georeplication. The former will be eventually consistent (across zones), while RBD obviously needs to be point-in-time consistent at the replica.
We have also done some preliminary work to do async replication at the rados pool level. Last year we worked with a set of students at HMC to build a model for clock synchronization, verifying that we can get periodic ordering consistency points across all OSDs that could be replicated to another cluster. The results were encouraging and we have an overall architecture in mind... but we still need to put it all together.
Q from /u/emkt11: How you think Ceph will benefit from btrfs and zfs ? Also can we use journals from journaled file system e.g. ext4, rather then having ceph its own journal. Also does newstore enables ceph to avoid 2x * no. of replica write to only no. of replica write for every single write. Also what is the timeline aimed for Jewel?
A: btrfs and zfs both give you two big things: checksumming of all data (yay!) and copy-on-write that we can use to efficieintly clone objects (e.g., for rbd snapshots). The cost is fragmentation for small io workloads... which costs a lot on spinning disks. I'm eager to see how this changes with widely deployed SSDs.
We can't make much use of existing fs journals because they're tightly bound to the POSIX semantics and data model the file system provides.. which is not what Ceph wants. We work in terms of larger transactions over lots of objects, and after several years of pounding my head against it I've decided trying to cram that down a file systems' throat is a losing battle.
Instead, newstore manages its own metadata in a key/value database (rocksdb currently) and uses a bare minimum from the underlying fs (object/file fragments). It does avoid the 2x write for new objects, but we do still double-write for small IOs (where it is less painful).
Newstore will be in Jewel but still marked experimental--we likely won't have confidence by then it won't eat your data.
Q from /u/ivotron: Nowadays startups are doing great work that, in some cases, compete against research projects from universities (without the burden of having to write papers!). Would you advise for people to go to grad school when they have a specific project/idea in mind they want to develop? In your opinion, what are the pros and cons of the academic vs. startup route?
A: I'll start by saying I have a huge bias toward free software. If the choice is between research that will result in open publications (and hopefully open sourced code, or else IMO you're doing it wrong) and a startup writing proprietary code, there's no contest. If the startup is developing open source code, it's a trickier question.
I do get frustrated that a lot of research work is poorly applied: students build a prototype that works just well enough to generate the graphs but is a long way from being something that is useful or usable by the real world. The most common end result is that the student finishes their degree, the code is thrown away, and some proprietary software shop takes any useful ideas and incorporates them into their product line (and tries to hire the student). Working for a startup forces you to create something that is viable and useful to real customers, and if it's open source delivers real value to the industry.
This is probably a good time to plug CROSS, the new Center for Research in Open Source Systems at UCSC (https://cross.soe.ucsc.edu). One of the key ideas here is to bridge the gap between what students do for their graduate research and what is needed for an open source project to survive in the wild with an incubation / fellowship. It's a unique approach to bringing the fruits of investment in research into the open source community and I'm really excited that the program is now officially off the ground!
Q. from /u/optimusC: What is the largest size of Ceph cluster you've seen so far in production today?
A: The largest I've worked with was ~1300 OSDs. The largest I've heard of was CERN's ~7000 OSD test they did a few months back.
Right now our scaling issues are around OSD count. You can build much larger clusters (by an order of magnitude) by putting OSDs on top of RAID groups instead of individual disks, but we mostly haven't needed to do this yet.
Q. from /u/nigwil: What needs to be added to Ceph to allow it to replace Lustre for HPC workloads?
A: Possibly RDMA? XioMessenger is coming along so maybe that will kickstart HPC interest.
The largest friction we've seen in the HPC space is that all of the hardware people own is bought with Lustre's architecture in mind: it's all big disk arrays with hardware RAID and very expensive. It's needed for Lustre because it is scale-out but not replicated--each array is fronted by a failover pair of OST's.
Ceph is designed to use more commodity hardware and do its own replication.
Putting a 'production ready' stamp on CephFS will help, but for HPC is silly--the thing preventing us from doing that is an fsck tool, which Lustre has never had.
Q. from /u/bstillwell: What new storage technologies (NVMe, SMR, kinetic drives, ethernet drives, etc.) excite you most? Why?
A: NVNe will be big, but it's a bit scary because it's not obvious what we will be changing and rearchitecting to use it most effectively.
SMR is annoying because we've been hearing about it for years but there's still nothing very good for dealing with it. The best idea I've heard so far would push the allocator partly into the drive so that you'd saw "write these blocks somewhere" and the ack would tell you where they landed. There are some libsmr type projects out there that are promising, and I'd love to see these linked into a Ceph backend (like NewStore, where they'd fit pretty easily!).
Ethernet drives are really exciting, as they are exactly what we had in mind when we designed and architected Ceph. There is a big gap between the prototype devices (which we've played with and work!) and being buy them in quantity, though, that still makes my brain hurt. There are a few things we can/should do in Ceph to make this story more compelling (aarch64 builds coming soon!) but mostly it's a waiting game it seems.
Kinetic drives are cool in the same sense that ethernet drives are, except that they've fixed on an interface that Ceph must consume...which means we still need servers sitting in front of them. We have some prototype support in Ceph but the performance isn't great because the places we use key/value APIs assume lower latency...but I think we'll be able to plug them into NewStore more effectively. We'll see!
À propos de l'auteur
Parcourir par canal
Automatisation
Les dernières nouveautés en matière d'automatisation informatique pour les technologies, les équipes et les environnements
Intelligence artificielle
Actualité sur les plateformes qui permettent aux clients d'exécuter des charges de travail d'IA sur tout type d'environnement
Cloud hybride ouvert
Découvrez comment créer un avenir flexible grâce au cloud hybride
Sécurité
Les dernières actualités sur la façon dont nous réduisons les risques dans tous les environnements et technologies
Edge computing
Actualité sur les plateformes qui simplifient les opérations en périphérie
Infrastructure
Les dernières nouveautés sur la plateforme Linux d'entreprise leader au monde
Applications
À l’intérieur de nos solutions aux défis d’application les plus difficiles
Programmes originaux
Histoires passionnantes de créateurs et de leaders de technologies d'entreprise
Produits
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Services cloud
- Voir tous les produits
Outils
- Formation et certification
- Mon compte
- Assistance client
- Ressources développeurs
- Rechercher un partenaire
- Red Hat Ecosystem Catalog
- Calculateur de valeur Red Hat
- Documentation
Essayer, acheter et vendre
Communication
- Contacter le service commercial
- Contactez notre service clientèle
- Contacter le service de formation
- Réseaux sociaux
À propos de Red Hat
Premier éditeur mondial de solutions Open Source pour les entreprises, nous fournissons des technologies Linux, cloud, de conteneurs et Kubernetes. Nous proposons des solutions stables qui aident les entreprises à jongler avec les divers environnements et plateformes, du cœur du datacenter à la périphérie du réseau.
Sélectionner une langue
Red Hat legal and privacy links
- À propos de Red Hat
- Carrières
- Événements
- Bureaux
- Contacter Red Hat
- Lire le blog Red Hat
- Diversité, équité et inclusion
- Cool Stuff Store
- Red Hat Summit