<<Back to Cassandra Main Page
What is Gossip Protocol.
A peer-to-peer communication protocol to discover and share location and state information about the other nodes in a Cassandra cluster.Understanding Indepth of Cassandra Gossip
The gossiper is responsible for making sure every node in the system eventually knows important information about every other node's state, including those that are unreachable or not yet in the cluster when any given state change occurs. Gossip timer task runs every second. This means that a node initiates gossip exchange with one to three nodes every second (round) or ( zero if it is alone in the cluster). Information to gossip is wrapped in an ApplicationState object in a key/value pair. The information exchanged in a GossipMessage looks something like below.HeartBeatState
Each node in Cassandra has one HeartBeatState associated with it which Consists of generation and version number.
Generation grows at every time the node is started.
Version number is shared with application states and guarantees ordering.
ApplicationState
Consists of state and version number (Its the same version number present in HeartBeatState) and represents a state of single "component" or "element" within Cassandra.
For instance application state for "load information" could be (5.2, 45), which means that node load is 5.2 at version 45.
EndPointState
Includes all ApplicationStates and HeartBeatState for certain endpoint (node). EndPointState can include only one of each type of ApplicationState, so if EndPointState already includes, say, load information, new load information will overwrite the old one. ApplicationState version number guarantees that old value will not overwrite the new one.
EndPointStateMap
Internal structure in Gossiper that has EndPointState for all nodes (including itself) that it has heard about.
The Gossip messaging takes place in a three-way handshake format.
GossipDigestSynMessage:
HostA sends out GossipeDigestSynMessage every second. It contains the largest version of local states.
HostB receives this message, it sort and compare the versions with local ones. Then it knows which local states are out of date, which ones are newer than remote status (HostA). Then it build following message, tells HostA these information:
GossipDigestAckMessage:
Basically digest_list contains status that HostB need from HostA. state_map contains status information that NodeA need. In this example, HostB doesn't have the latest info from HostA, HostA doesn't have the latest info from HostB, HostC status are up to date for both hosts.
When HostA gets the message, it apply the state_map to local states. And then build following message with the list from digest_list.
GossipDigestAck2Message
HostB receives the message, and simply update local state with the information from state_map.
so far so good
well So how does a new node get the idea of whom to start gossiping with? Well, Cassandra has many seed provider implementations that provide a list of seed addresses to the new node and starts gossiping with one of them right away. After its first round of gossip, it will now possess cluster membership information about all the other nodes in the cluster and can then gossip with the rest of them.
Cassandra Failure detection
The gossip process tracks state from other nodes both directly (nodes gossiping directly to it) and indirectly (nodes communicated about secondhand, third-hand, and so on). Rather than have a fixed threshold for marking failing nodes, Cassandra uses an accrual detection mechanism to calculate a per-node threshold that takes into account network performance, workload, and historical conditions. During gossip exchanges, every node maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster. Configuring the phi_convict_threshold property adjusts the sensitivity of the failure detector. Lower values increase the likelihood that an unresponsive node will be marked as down.
As shown in Figure above HostA did not respond with in time interval of mean() * ln(10) * phi_convict_threshold the host will be marked as dead.
Default phi_convict_threshold is 8, ln(10) x 8 = 2.3 x 8 = 18.4. Which means even in best case, the failure detection delay is 18seconds, typically it's about 20 seconds 18.4 x 1.1 = 20.3.
Comments
Post a Comment