Sage Weil, Red Hat’s chief architect of Ceph and co-creator of Ceph – among many other credentials – recently held an “ask me anything” session on Reddit. Though you can read the whole thing for yourself, here, we’ve collected the top questions and answers for your edification. Read on!
Q. from /u/weetabeex: Being really bloody complex notwithstanding, are there any plans for proper geo-replication in the near future? I am assuming this has been discussed over and over again, so I wonder: do you think the consistency semantics will need to be relaxed to make this work?
How cool would it be to start a Jewel blueprint out of your reddit (hopefully super detailed) reply?
A. There are currently two geo-replication development projects underway: a v2 of the radosgw multiside federation, and RBD journaling for georeplication. The former will be eventually consistent (across zones), while RBD obviously needs to be point-in-time consistent at the replica.
We have also done some preliminary work to do async replication at the rados pool level. Last year we worked with a set of students at HMC to build a model for clock synchronization, verifying that we can get periodic ordering consistency points across all OSDs that could be replicated to another cluster. The results were encouraging and we have an overall architecture in mind... but we still need to put it all together.
Q from /u/emkt11: How you think Ceph will benefit from btrfs and zfs ? Also can we use journals from journaled file system e.g. ext4, rather then having ceph its own journal. Also does newstore enables ceph to avoid 2x * no. of replica write to only no. of replica write for every single write. Also what is the timeline aimed for Jewel?
A: btrfs and zfs both give you two big things: checksumming of all data (yay!) and copy-on-write that we can use to efficieintly clone objects (e.g., for rbd snapshots). The cost is fragmentation for small io workloads... which costs a lot on spinning disks. I'm eager to see how this changes with widely deployed SSDs.
We can't make much use of existing fs journals because they're tightly bound to the POSIX semantics and data model the file system provides.. which is not what Ceph wants. We work in terms of larger transactions over lots of objects, and after several years of pounding my head against it I've decided trying to cram that down a file systems' throat is a losing battle.
Instead, newstore manages its own metadata in a key/value database (rocksdb currently) and uses a bare minimum from the underlying fs (object/file fragments). It does avoid the 2x write for new objects, but we do still double-write for small IOs (where it is less painful).
Newstore will be in Jewel but still marked experimental--we likely won't have confidence by then it won't eat your data.
Q from /u/ivotron: Nowadays startups are doing great work that, in some cases, compete against research projects from universities (without the burden of having to write papers!). Would you advise for people to go to grad school when they have a specific project/idea in mind they want to develop? In your opinion, what are the pros and cons of the academic vs. startup route?
A: I'll start by saying I have a huge bias toward free software. If the choice is between research that will result in open publications (and hopefully open sourced code, or else IMO you're doing it wrong) and a startup writing proprietary code, there's no contest. If the startup is developing open source code, it's a trickier question.
I do get frustrated that a lot of research work is poorly applied: students build a prototype that works just well enough to generate the graphs but
is a long way from being something that is useful or usable by the real world. The most common end result is that the student finishes their degree, the code is thrown away, and some proprietary software shop takes any useful ideas and incorporates them into their product line (and tries to hire the student). Working for a startup forces you to create something that is viable and useful to real customers, and if it's open source delivers real value to the industry.
This is probably a good time to plug CROSS, the new Center for Research in Open Source Systems at UCSC (https://cross.soe.ucsc.edu). One of the key ideas here is to bridge the gap between what students do for their graduate research and what is needed for an open source project to survive in the wild with an incubation / fellowship. It's a unique approach to bringing the fruits of investment in research into the open source community and I'm really excited that the program is now officially off the ground!
Q. from /u/optimusC: What is the largest size of Ceph cluster you've seen so far in production today?
A: The largest I've worked with was ~1300 OSDs. The largest I've heard of was CERN's ~7000 OSD test they did a few months back.
Right now our scaling issues are around OSD count. You can build much larger clusters (by an order of magnitude) by putting OSDs on top of RAID groups instead of individual disks, but we mostly haven't needed to do this yet.
Q. from /u/nigwil: What needs to be added to Ceph to allow it to replace Lustre for HPC workloads?
A: Possibly RDMA? XioMessenger is coming along so maybe that will kickstart HPC interest.
The largest friction we've seen in the HPC space is that all of the hardware people own is bought with Lustre's architecture in mind: it's all big disk arrays with hardware RAID and very expensive. It's needed for Lustre because it is scale-out but not replicated--each array is fronted by a failover pair of OST's.
Ceph is designed to use more commodity hardware and do its own replication.
Putting a 'production ready' stamp on CephFS will help, but for HPC is silly--the thing preventing us from doing that is an fsck tool, which Lustre has never had.
Q. from /u/bstillwell: What new storage technologies (NVMe, SMR, kinetic drives, ethernet drives, etc.) excite you most? Why?
A: NVNe will be big, but it's a bit scary because it's not obvious what we will be changing and rearchitecting to use it most effectively.
SMR is annoying because we've been hearing about it for years but there's still nothing very good for dealing with it. The best idea I've heard so far would push the allocator partly into the drive so that you'd saw "write these blocks somewhere" and the ack would tell you where they landed. There are some libsmr type projects out there that are promising, and I'd love to see these linked into a Ceph backend (like NewStore, where they'd fit pretty easily!).
Ethernet drives are really exciting, as they are exactly what we had in mind when we designed and architected Ceph. There is a big gap between the prototype devices (which we've played with and work!) and being buy them in quantity, though, that still makes my brain hurt. There are a few things we can/should do in Ceph to make this story more compelling (aarch64 builds coming soon!) but mostly it's a waiting game it seems.
Kinetic drives are cool in the same sense that ethernet drives are, except that they've fixed on an interface that Ceph must consume...which means we still need servers sitting in front of them. We have some prototype support in Ceph but the performance isn't great because the places we use key/value APIs assume lower latency...but I think we'll be able to plug them into NewStore more effectively. We'll see!
저자 소개
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
오리지널 쇼
엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리
제품
- Red Hat Enterprise Linux
- Red Hat OpenShift Enterprise
- Red Hat Ansible Automation Platform
- 클라우드 서비스
- 모든 제품 보기
툴
체험, 구매 & 영업
커뮤니케이션
Red Hat 소개
Red Hat은 Linux, 클라우드, 컨테이너, 쿠버네티스 등을 포함한 글로벌 엔터프라이즈 오픈소스 솔루션 공급업체입니다. Red Hat은 코어 데이터센터에서 네트워크 엣지에 이르기까지 다양한 플랫폼과 환경에서 기업의 업무 편의성을 높여 주는 강화된 기능의 솔루션을 제공합니다.