Introduction to Ceph and sharing of the principle architecture
1. Introduction to Ceph Architecture and Usage Scenarios
1.1 Introduction to Ceph
Ceph是一个统一的分布式存储系统,设计初衷是提供较好的性能、可靠性和可扩展性。
The Ceph project first originated from work done while Sage was a PhD student (the earliest results were published in 2004) and was subsequently contributed to the open source community. After several years of development, it is now supported and widely used by many cloud computing vendors. Both RedHat and OpenStack can be integrated with Ceph to support back-end storage for VM images.
1.2 Ceph features
- high performance
a. The traditional scheme of centralized storage of metadata addressing is abandoned and the CRUSH algorithm is used for balanced data distribution and high parallelism.
b. Isolation of disaster-tolerant domains is considered, enabling replica placement rules for various types of loads, such as cross-room, rack-aware, etc.
c. Capable of supporting thousands of storage nodes at scale, supporting terabytes to petabytes of data.
- High Availability
a. The number of copies can be flexibly controlled.
b. Support for fault domain separation and strong data consistency.
c. 多种故障场景自动进行修复自愈。
d. No single point of failure and automatic management.
- high scalability
a. Decentralization.
b. Flexibility of extension.
c. grows linearly as the number of nodes increases.
- Feature rich
a. Three storage interfaces are supported: block storage, file storage, and object storage.
b. Support custom interfaces and multiple language drivers.
1.3 Ceph Architecture
支持三种接口:
- Object: has a native API and is also compatible with Swift and S3 APIs.
- Block: supports thin provisioning, snapshots, and cloning.
- File: Posix interface with snapshot support.
1.4 Introduction to Ceph Core Components and Concepts
- Monitor
A Ceph cluster requires small clusters of multiple Monitors that synchronize data via Paxos, which is used to hold the metadata for the OSD.
- OSD
OSD is known as Object Storage Device, which is the process responsible for returning specific data in response to client requests. A Ceph cluster typically has many OSDs.
- MDS
MDS全称Ceph Metadata Server,是CephFS服务依赖的元数据服务。
- Object
The lowest-level storage unit in Ceph is the Object object, and each Object contains metadata and raw data.
- PG
PG is known as Placement Grouops and is a logical concept where a PG contains multiple OSDs. The PG layer was actually introduced to better distribute data and locate it.
- RADOS
RADOS, known as Reliable Autonomic Distributed Object Store, is the essence of Ceph clusters, where users implement data distribution, Failover, and other cluster operations.
- Libradio
Librados is a Rados-provided library, as RADOS is protocol is difficult to access directly, so the upper layers of RBD, RGW and CephFS are accessed through librados, currently providing PHP, Ruby, Java, Python, C and C++ support.
- CRUSH
CRUSH是Ceph使用的数据分布算法,类似一致性哈希,让数据分配到预期的地方。
- RBD
RBD, known as RADOS block device, is a block device service provided by Ceph to the public.
- RGW
RGW, known as RADOS gateway, is an object storage service provided externally by Ceph, with interfaces compatible with S3 and Swift.
- CephFS
CephFS, known as Ceph File System, is a file system service provided by Ceph to the public.
1.5 Three Types of Storage - Block Storage
Typical equipment. Disk Arrays, Hard Drives
主要是将裸磁盘空间映射给主机使用的。
Pros.
- Data protection is provided by means of Raid with LVM, etc.
- Multiple inexpensive hard drives are combined to increase capacity.
- Multiple disks are combined out of logical disks to improve read and write efficiency.
Disadvantages.
- 采用SAN架构组网时,光纤交换机,造价成本高。
- Data cannot be shared between hosts.
Usage Scenario.
- docker containers, virtual machine disk storage allocation.
- Log storage.
- File Storage.
- …
1.6 Three Types of Storage - File Storage
Typical equipment. FTP, NFS servers
To overcome the problem that block storage files cannot be shared, so there is file storage.
Setting up FTP with NFS services on the server is file storage.
Pros.
- 造价低,随便一台机器就可以了。
- Easy file sharing.
Disadvantages.
- Low read and write rates.
- Slow transfer rate.
Usage Scenario.
- Log storage.
- There is a directory structure for file storage.
- …
1.7 Three Types of Storage - Object Storage
Typical equipment. Distributed server with built-in high-capacity hard disk (swift, s3)
Multiple servers with built-in high-capacity hard drives are installed with object storage management software to provide read and write access to the outside world.
Pros.
- Read and write high speed with block storage.
- With features such as sharing of file storage.
Usage Scenario. (suitable for updating data with few changes)
- Image Storage.
- Video Storage.
- …
2. Ceph IO Process and Data Distribution
2.1 Normal IO flow chart
Steps.
1. client Create cluster handler.
2. client 读取配置文件。
3. client 连接上monitor,获取集群map信息。
4. The client reads and writes io requests the corresponding primary osd data node according to the crshmap algorithm.
5. The primary osd data node writes data to the other two replica nodes at the same time.
6. Wait for the master node as well as the other two replica nodes to finish writing the data state.
7. After both the master node and the replica node write state are successful, they are returned to the client and the io write completes.
2.2 New master IO flow chart
Description.
If the newly added OSD1 replaces the existing OSD4 as Primary OSD, how does it work since no PG is created on OSD1 , no data exists, then I/O on the PG cannot be performed?
Steps.
1. The client connects to the monitor to get cluster map information.
2. Also the new master osd1 will actively report to the monitor to let osd2 take over as master temporarily due to the lack of pg data.
3. Temporary master osd2 will synchronize the full amount of data to the new master osd1.
4. client IO读写直接连接临时主osd2进行读写。
5. osd2 receives read/write io and writes to the other two replica nodes at the same time.
6. Wait for osd2 and the other two copies to be written successfully.
7. osd2 three copies of data are successfully written back to the client, at this point the client io read and write is complete.
8. If osd1 data synchronization is complete, temporary master osd2 will surrender the master role.
9. osd1成为主节点,osd2变成副本。
2.3 Ceph IO algorithm flow
1. File用户需要读写的文件。File->Object映射:
a. ino (File's metadata, File's unique id).
b. ono (the serial number of a particular object generated by File cut, defaulted to 4M cut a block size).
c. oid(object id: ino + ono)。
2. Object是RADOS需要的对象。Ceph指定一个静态hash函数计算oid的值,将oid映射成一个近似均匀分布的伪随机值,然后和mask按位相与,得到pgid。Object->PG映射:
a. hash(oid) & mask-> pgid 。
b. mask = total number of PGs m (m is an integer power of 2) - 1 .
3. PG(Placement Group),用途是对object的存储进行组织和位置映射, (类似于redis cluster里面的slot的概念) 一个PG里面会有很多object。采用CRUSH算法,将pgid代入其中,然后得到一组OSD。PG->OSD映射:
a. CRUSH(pgid)->(osd1,osd2,osd3) 。
2.4 Ceph IO Pseudocode Flow
locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg) # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]
2.5 Ceph RBD IO流程
Steps.
1. The client creates a pool and needs to specify the number of pg's for this pool.
2. Create pool/image rbd device for mounting.
3. The data written by the user is chunked, the size of each chunk is 4M by default and each chunk is given a name, the name is object + serial number.
4. Assign each object to a copy location via pg.
5. pg will look for 3 osd's according to the cursh algorithm and save this object on each of these three osd's.
6. The osd on actually formats the underlying disk, which is normally formatted as an xfs file system by the deployment tools.
7. The storage of object then becomes the storage of a text rbd0.object1.file.
2.6 Ceph RBD IO Framework Diagram
Client write data osd process.
1. 采用的是librbd的形式,使用librbd创建一个块设备,向这个块设备中写入数据。
2. 在客户端本地同过调用librados接口,然后经过pool,rbd,object、pg进行层层映射,在PG这一层中,可以知道数据保存在哪3个OSD上,这3个OSD分为主从的关系。
3. The client establishes SOCKET communication with the PRIMARY OSD and passes the data to be written to the PRIMARY OSD, which in turn sends the data to other REPLICA OSD data nodes.
2.7 Ceph Pool和PG分布情况
Description.
- The pool is a logical partition for ceph when storing data, it acts as namespace.
- Each pool contains a certain (configurable) number of PGs.
- The objects in PG are mapped to different Objects.
- pool is distributed to the entire cluster.
- The pool can do fault isolation domains, which vary according to different user scenarios.
2.8 Ceph 数据扩容PG分布
** Scenario data migration process:**
- 现状3个OSD, 4个PG
- Expansion to 4 OSD, 4 PG
Status.
After expansion.
说明
Many PGs are distributed on each OSD, and each PG is automatically scattered on a different OSD. If the capacity is expanded then the corresponding PGs will be migrated to the new OSD to ensure a balanced number of PGs.
3. Ceph Heartbeat Mechanism
3.1 Introduction to Heartbeat
Heartbeats are used between nodes to detect if each other is faulty so that the faulty node can be detected in time to enter the appropriate fault handling process.
Question.
- trade-off between fault detection time and the load imposed by heartbeat telegrams.
- Too many heartbeats can affect system performance if the heartbeat frequency is too high.
- A low heartbeat frequency prolongs the time to discover a failed node, which affects the availability of the system.
The fault detection strategy should be able to do.
- In time.节点发生异常如宕机或网络中断时,集群可以在可接受的时间范围内感知。
- Appropriate stress: both on the nodes, and on the network.
- Tolerates network jitter: occasional network delays.
- Diffusion mechanism: changes in meta-information due to changes in the survival status of nodes need to be diffused to the whole cluster by some mechanism.
3.2 Ceph Heartbeat Detection
The OSD node will listen to the public, cluster, front and back ports
- public port: listens for connections from Monitor and Client.
- cluster port: listens for connections from OSD Peer.
- front port: the NIC for clients to connect to the cluster, here temporarily for heartbeats between clusters.
- back port: a network card for internal use by the guest cluster. Heartbeats are performed between clusters.
- hbclient:发送ping心跳的messenger。
3.3 Mutual heartbeat detection between Ceph OSDs
Steps.
- The OSDs within the same PG heartbeat each other and they send PING/PONG messages to each other.
- Detect every 6s (a random time will actually be added to this to avoid spikes).
- 20s no heartbeat reply detected, join failure queue.
3.4 Ceph OSD with Mon heartbeat detection
OSD reports to Monitor.
- OSD when an event occurs (e.g., failure, PG change).
- Self-activation within 5 seconds.
- OSD periodically reported to Monito
- OSD checks for partner OSD failure messages in the failure_queue.
- Sends a failure report to Monitor and adds the failure message to the failure_pending queue, then removes it from the failure_queue.
- When a heartbeat is received from an OSD in failure_queue or failure_pending, it is removed from both queues and the Monitor is informed to cancel the previous failure report.
- When a reconnection to the Monitor network occurs, the error report from the failure_pending is added back to the failure_queue and sent to Monitor again.
- Monitor statistics down the line OSD
- Monitor收集来自OSD的伙伴失效报告。
- When an error report points to an OSD that fails above a certain threshold and enough OSDs report its failure, take that OSD offline.
3.5 Ceph Heartbeat Detection Summary
Ceph通过伙伴OSD汇报失效节点和Monitor统计来自OSD的心跳两种方式判定OSD节点失效。
In time.伙伴OSD可以在秒级发现节点失效并汇报Monitor,并在几分钟内由Monitor将失效OSD下线。
适当的压力: The heartbeat count between Monitor and OSD is more of an insurance measure thanks to the partner OSD reporting mechanism, so the OSD can send heartbeats to Monitor at intervals of up to 600 seconds and Monitor's detection threshold can be up to 900 seconds. Ceph actually spreads the pressure of the central node during failure detection to all OSDs as a way to improve the reliability of the central node Monitor and thus the scalability of the entire cluster.
容忍网络抖动: Instead of taking the target OSD offline immediately after receiving a report from the OSD on its partner OSD, Monitor periodically waits for several conditions.
- The failure time of the target OSD is greater than a threshold dynamically determined by a fixed amount of osd_heartbeat_grace and historical network conditions.
- 来自不同主机的汇报达到mon_osd_min_down_reporters。
- Failure to report before the first two conditions are met is not cancelled by the source OSD.
Proliferation. Instead of trying to broadcast a notification to all OSDs and Clients after updating the OSDMap, the Monitor as the central node waits inertly for the OSDs and Clients to fetch it. to reduce Monitor stress and simplify interaction logic.
4. Ceph Communication Framework
4.1 Ceph通信框架种类介绍
Three different implementations of the network communication framework.
- Simple threaded mode
Features: for each network link, two threads are created, one for receiving and one for sending.
Disadvantages.大量的链接会产生大量的线程,会消耗CPU资源,影响性能。
- I/O multiplexing mode for Async events
What: This is a widely used approach in network communications today. The k version already uses Asnyc by default.
- XIO方式使用了开源的网络通信库Accelio来实现
Features: this approach relies on a third-party library Accelio stability and is currently in the experimental phase.
4.2 Ceph Communication Framework Design Patterns
Design pattern (Subscribe/Publish).
The Subscribe to Publish pattern, also known as the Observer pattern, is intended to "define a one-to-many dependency between objects
When the state of an object changes, all objects that depend on it are notified and are automatically updated."
4.3 Ceph Communication Framework Flowchart
Steps.
- accepter listens for requests from peer, calls SimpleMessenger::add_accept_pipe() to create a new Pipe to SimpleMessenger::pipes to handle the request.
- Pipe is used to read and send messages. This class has two main components, Pipe::Reader, and Pipe::Writer to handle message reading and sending.
- Messenger作为消息的发布者, 各个 Dispatcher 子类作为消息的订阅者, Messenger 收到消息之后, 通过 Pipe 读取消息,然后转给 Dispatcher 处理。
- Dispatcher is the base class for subscribers, the concrete subscription backend inherits from this class, and is registered to Messenger::dispatchers through Messenger::add_dispatcher_tail/head during initialization. Upon receipt of the message, the class is notified of the processing.
- DispatchQueue This class is used to cache incoming messages, then wake up the DispatchQueue::dispatch_thread thread to find the back-end Dispatch to process the message.
4.4 Ceph Communication Framework Class Diagram
4.5 Ceph communication data format
The communication protocol format requires both parties to agree on the data format.
The content of the message is divided into three main parts.
- header // message header, envelope for type messages
- user data // the actual data to be sent
- payload // operation save metadata
- middle // reserved field
- data // Read and write data
- footer // end marker of the message
class Message : public RefCountedObject {
protected:
ceph_msg_header header; // Message header
ceph_msg_footer footer;// end of message
bufferlist payload; // "front" unaligned blob
bufferlist middle; // "middle" unaligned blob
bufferlist data; // data payload (page-alignment will be preserved where possible)
/* recv_stamp is set when the Messenger starts reading the
* Message off the wire */
utime_t recv_stamp;// timestamp of when to start receiving data
/* dispatch_stamp is set when the Messenger starts calling dispatch() on
* its endpoints */
utime_t dispatch_stamp;//Dispatch's timestamp
/* throttle_stamp is the point at which we got throttle */
utime_t throttle_stamp;//Get the timestamp of throttle's slot
/* time at which message was fully read */
utime_t recv_complete_stamp;// timestamp of completion of reception
ConnectionRef connection;//network connection
uint32_t magic = 0;//the magic word of the message
bi::list_member_hook dispatch_q;//boost::intrusive member field
};
struct ceph_msg_header {
__le64 seq; // Unique serial number of the message in the current session
__le64 tid; // A globally unique id for the message
__le16 type; // 消息类型
__le16 priority; // 优先级
__le16 version; // Version number
__le32 front_len; // length of the payload
__le32 middle_len;// length of middle
__le32 data_len; // length of data
__le16 data_off; // Data offset of the object
struct ceph_entity_name src; //Source
/* oldest code we think can decode this. unknown if zero. */
__le16 compat_version;
__le16 reserved;
__le32 crc; /* header crc32c */
} __attribute__ ((packed));
struct ceph_msg_footer {
__le32 front_crc, middle_crc, data_crc; //crc校验码
__le64 sig; // 64-bit signature of the message
__u8 flags; // Closing sign
} __attribute__ ((packed));
5. Ceph CRUSH算法
5.1 Data distribution algorithm challenges
- Data distribution and load balancing.
a. 数据分布均衡,使数据能均匀的分布到各个节点上。
b. Load balancing, so that the load of data access read and write operations is balanced across nodes and disks.
- 灵活应对集群伸缩
a. The system can easily add or remove node devices and handle node failures.
b. 增加或者删除节点设备后,能自动实现数据的均衡,并且尽可能少的迁移数据。
- Support for large-scale clusters
a. requires that the data distribution algorithm maintains relatively small metadata and is not too computationally intensive. As the cluster size increases, the data distribution algorithm overhead is relatively small.
5.2 Description of the Ceph CRUSH algorithm
- The full name of the CRUSH algorithm is: Controlled Scalable Decentralized Placement of Replicated Data, Controlled, Scalable, Distributed Replica Data Placement Algorithm.
- The algorithm for the process of mapping pg to OSD is called the CRUSH algorithm. (Three copies of an Object need to be saved, i.e. they need to be saved on three osd's).
- The CRUSH algorithm is a pseudo-random process where he can randomly select a set of OSDs from all the OSDs, but the result of each random selection by the same PG is constant, i.e., the set of mapped OSDs is fixed.
5.3 Ceph CRUSH算法原理
CRUSH算法因子:
- Hierarchical Cluster Map
Reflects the physical topology of the storage system hierarchy. defines the static topology of an OSD cluster with hierarchical relationships. The OSD hierarchy enables the CRUSH algorithm to be rack-aware in the selection of OSDs, i.e., the rules are defined so that replicas can be distributed across different racks and rooms, providing data security.
- Placement Rules
决定了一个PG的对象副本如何选择的规则,通过这些可以自己设定规则,用户可以自定义设置副本在集群中的分布。
5.3.1 Hierarchical Cluster Map
CRUSH Map is a tree structure and OSDMap records more of the attributes of OSDMap (epoch/fsid/pool information and ip of osd etc).
The leaf node is device (also known as osd), the other nodes are called bucket nodes, these buckets are fictitious nodes and can be abstracted according to the physical structure, of course the tree structure has only one final root node called root node, the intermediate virtual bucket nodes can be data center abstraction, server room abstraction, rack abstraction, host abstraction, etc.
5.3.2 Data Distribution StrategyPlacement Rules
** The data distribution strategy Placement Rules has the following main features:**
a. From which node in the CRUSH Map to start the lookup
b. Use that node as the fault isolation domain
c. Search model for locating copies (breadth-first or depth-first)
rule replicated_ruleset # rule set naming, you can specify the rule set when creating the pool
{
ruleset 0 #ruleset number, just number them sequentially
type replicated #define the pool type as replicated (and erasure mode)
min_size 1 #The minimum number of specified copies in the pool cannot be less than 1
max_size 10 #The maximum number of specified copies in the pool cannot be greater than 10
step take default #Find the bucket entry point, usually a bucket of type root
step chooseleaf firstn 0 type host #choose a host,and recursively choose the leaf node osd
step emit #end
}
## 5.3.3 Bucket randomization algorithm types
- General buckets: suitable for all child nodes with the same weight and few additions and deletions ofitem.
- list buckets: applies to the cluster extension type. Additem, produce optimal data move, finditem with time complexity O(n).
- tree buckets: the lookup responsibility is O (log n), the node_id of other nodes remains unchanged when adding or removing leaf nodes.
- straw buckets: Allows all items to "compete" fairly with other items in a lottery-like manner. When locating a copy, each item in the bucket corresponds to a straw of random length and the straw with the longest length wins (is selected), adding or recomputing, and moving data between subtrees provides the optimal solution.
5.4 CRUSH算法案例
Description.
There are some sas and ssd disks in the cluster, now one business line has higher performance and availability priority than the others, can we have all the data from this high priority business line stored on ssd disks.
Regular users.
High-performance users.
Configuration rules.
6. Customized Ceph RBD QOS
6.1 Introduction to QOS
QoS (Quality of Service,服务质量)起源于网络技术,它用来解决网络延迟和阻塞等问题,能够为指定的网络通信提供更好的服务能力。
Question.
Our total Ceph cluster has limited iIO capabilities, such as bandwidth, IOPS. How to avoid users fighting for resources, if high availability of all user resources in the cluster is guaranteed, and how to guarantee the availability of high optimal user resources. So we need to allocate the limited IO capacity wisely.
6.2 Ceph IO操作类型
- ClientOp:来自客户端的读写I/O请求。
- SubOp: I/O requests between osd. It mainly includes inter-replica data read/write requests generated by client I/O, and I/O requests caused by data synchronization, data scanning, load balancing, etc.
- SnapTrim: Snapshot data deletion. Once the snapshot delete command is sent from the client, the relevant metadata is deleted and returned directly, after which the real snapshot data is deleted by a background thread. Indirectly control the deletion rate by controlling the rate of snaptrim.
- Scrub: Scrub for finding silent data errors in objects, Scrub for scanning metadata and deep Scrub for scanning objects as a whole.
- Recovery: Data recovery and migration. Cluster expansion/downsizing, osd failure/rejoining from new, etc. processes.
6.3 Ceph Official QOS Principles
mClock is a timestamp-based I/O scheduling algorithm first proposed by Vmware for centrally managed storage systems. (The official QOS module is currently half-baked).
基本思想:
- reservation 预留,表示客户端获得的最低I/O资源。
- weight weight, indicating the client's share of the shared I/O resources.
- limit Upper limit, indicating the maximum I/O resources available to the client.
6.4 Customized QOS Principle
6.4.1 Introduction to the token bucket algorithm
基于令牌桶算法(TokenBucket)实现了一套简单有效的qos功能,满足了云平台用户的核心需求。
基本思想:
- 按特定的速率向令牌桶投放令牌。
- Messages are first classified according to the preset matching rules, and messages that do not match the matching rules are sent directly without going through the token bucket.
- 符合匹配规则的报文,则需要令牌桶进行处理。当桶中有足够的令牌则报文可以被继续发送下去,同时令牌桶中的令牌量按报文的长度做相应的减少。
- When there are not enough tokens in the token bucket, the message will not be sent, and only when a new token is generated in the bucket can the message be sent. This limits the flow of messages to only be less than or equal to the rate at which the token is generated, for the purpose of limiting traffic.
6.4.2 RBD token bucket algorithm flow
Steps.
- The user initiates a request for an asynchronous IO to arrive in the Image.
- The request arrives in the ImageRequestWQ queue.
- Add the token bucket algorithm TokenBucket when ImageRequestWQ is out of the queue.
- The speed is limited by a token bucket algorithm and then sent to ImageRequest for processing.
### 6.4.3 Framework diagram of the RBD token bucket algorithm
Existing framework diagram.
Framework diagram of the token graph algorithm.
作者信息
Author. Li Hang
Bio: many years of experience in underlying development, with extensive experience in high performance nginx development and distributed cache redis cluster, currently working on Ceph for about two years.
He has worked for 58 Tongcheng, AutoZone and Youku Tudou Group.
Currently working in Dropshipping Infrastructure Operations - Technical Specialist position Mainly responsible for distributed Ceph systems.
Personal focus on technical areas: high-performance Nginx development, distributed caching, distributed storage.