Circular Replication Cluster Topology in ClickHouse

Data distribution refers to splitting the very large dataset into multiple shards (a smaller portion of the dataset) which are stored on different servers. Each shard holds and processes a part of the data, the query results from multiple shards are then combined together to give the final result.

Data replication refers to keeping a copy of the data on the other server nodes for ensuring availability in case of server node failure. This can also improve performance by allowing multiple servers to process the data queries in parallel.

Target:

1. Creating a Cluster in 3 Nodes with 3 Shards and 2 Replicas for each Shard.

2. Replica of each Shard will be created in Another Node in a Circular Manner. i.e Node1 Shard1 Replica1 will create another copy in Node2 as Shard1 Replica2

Procedure to achieve the above requirement

Cluster Architecture and Implementation

In this Architecture need to create “3 Databases for each Database” in Every Node. One Database used to create the tables for the shard data, second one is used to create the tables for replica and Third database is useful to create all distributed tables to access the data from all shards.

Clustername: sample_cluster

Databases Names: sample_shard / sample_replica / sample_dist

Nodes in Cluster: chnode1(192.168.1.11) / chnode2(192.168.1.12) / chnode3(192.168.13)

Procedure:

Step 1: Install Clickhouse Software in All the Nodes

https://clickhouse.com/docs/en/getting-started/install/

Step 2: Install Zookeeper Software in All the Nodes and Configure Zookeeper Cluster. Zookeeper software is required to process data to replicas.

Version: https://archive.apache.org/dist/zookeeper/zookeeper-3.5.6/apache-zookeeper-3.5.6-bin.tar.gz

https://phoenixnap.com/kb/install-apache-zookeeper

Step 3: Configure the Following Files in Every Node

1. /etc/clickhouse-server/config.d/cluster.xml

2. /etc/clickhouse-server/config.d/zookeeper.xml

3. /etc/clickhouse-server/config.d/macro.xml

Note: All the Configuration samples are given at the bottom

Step 4: Check the Configuration

SELECT * FROM system.macros m ;

SELECT * FROM system.clusters c WHERE cluster = 'sample_cluster';

Step 5: Create the Databases in All the Nodes in the Cluster

#chnode1 (192.168.1.11) chnode2(192.168.1.12) chnode3(192.168.1.13)

clickhouse-client --query "CREATE DATABASE IF NOT EXISTS sample_shard"

clickhouse-client --query "CREATE DATABASE IF NOT EXISTS sample_replica”

clickhouse-client --query "CREATE DATABASE IF NOT EXISTS sample_dist"

Step 6: Create Tables in All Nodes with the following Options

#chnode1

clickhouse-client --query “create table if not exists sample_shard.x(n UInt8)ENGINE=ReplicatedMergeTree('/zookeeper/sample_cluster/sample_s1/x','r1') ORDER BY (n)"

clickhouse-client --query “create table if not exists sample_replica.x(n UInt8)ENGINE=ReplicatedMergeTree('/zookeeper/sample_cluster/sample_s3/x','r2') ORDER BY (n)"

clickhouse-client –query "create table if not exists sample_dist.x(n UInt8)ENGINE=Distributed('sample_cluster','','x',rand())"

#chnode2

clickhouse-client --query “create table if not exists sample_shard.x(n UInt8)ENGINE=ReplicatedMergeTree('/zookeeper/sample_cluster/sample_s2/x','r1') ORDER BY (n)"

clickhouse-client --query “create table if not exists sample_replica.x(n UInt8)ENGINE=ReplicatedMergeTree('/zookeeper/sample_cluster/sample_s1/x','r2') ORDER BY (n)"

clickhouse-client –query "create table if not exists sample_dist.x(n UInt8)ENGINE=Distributed('sample_cluster','','x',rand())"

#chnode3

clickhouse-client --query “create table if not exists sample_shard.x(n UInt8)ENGINE=ReplicatedMergeTree('/zookeeper/sample_cluster/sample_s3/x','r1') ORDER BY (n)"

clickhouse-client --query “create table if not exists sample_replica.x(n UInt8)ENGINE=ReplicatedMergeTree('/zookeeper/sample_cluster/sample_s2/x','r2') ORDER BY (n)"

clickhouse-client –query "create table if not exists sample_dist.x(n UInt8)ENGINE=Distributed('sample_cluster','','x',rand())"

The Zookeeper Files Configured in the Tables given below for easy understand and comparison

chnode1zookeeper

/zookeeper/sample_cluster/sample_s1/x','r1'

/zookeeper/sample_cluster/sample_s3/x','r2'

chnode2zookeeper

/zookeeper/sample_cluster/sample_s2/x','r1'

/zookeeper/sample_cluster/sample_s1/x','r2'

chnode3 zookeeper

/zookeeper/sample_cluster/sample_s3/x','r1'

/zookeeper/sample_cluster/sample_s2/x','r2'

/etc/clickhouse-server/config.d/cluster.xml

<?xml version="1.0"?>

<listen_host>::</listen_host>

<remote_servers>

<sample_cluster>

<shard>

<internal_replication>true</internal_replication>

<default_database>sample_shard</default_database>

<host>chnode1</host>

</replica>

<default_database>sample_replica</default_database>

<host>chnode2</host>

</replica>

</shard>

<shard>

<default_database>sample_shard</default_database>

<host>chnode2</host>

</replica>

<default_database>sample_replica</default_database>

<host>chnode3</host>

</replica>

</shard>

<shard>

<default_database>sample_shard</default_database>

<host>chnode3</host>

</replica>

<default_database>sample_replica</default_database>

<host>chnode1</host>

</replica>

</shard>

</sample_cluster>

</remote_servers>

</yandex>

/etc/clickhouse-server/config.d/zookeeper.xml

<?xml version="1.0"?>

<host>chnode1</host>

</node>

<host>chnode2</host>

</node>

<host>chnode3</host>

</node>

</zookeeper>

</yandex>

/etc/clickhouse-server/config.d/macro.xml

Note: The macro is usually server-specific, it is better to store the configuration in separate config files

chnode1

<?xml version="1.0"?>

<cluster01>sample_cluster</cluster01>

<replica01>chnode1_s1_r1</replica01>

<replica02>chnode1_s3_r2</replica02>

</macros>

</yandex>

chnode2

<?xml version="1.0"?>

<cluster01>sample_cluster</cluster01>

<replica01>chnode2_s2_r1</replica01>

<replica02>chnode2_s1_r2</replica02>

</macros>

</yandex>

chnode3

<?xml version="1.0"?>

<cluster01>sample_cluster</cluster01>

<replica01>chnode3_s3_r1</replica01>

<replica02>chnode3_s2_r2</replica02>

</macros>

</yandex>