MySQL Cluster uses the new
NDB Cluster storage engine to enable running
several MySQL servers in a cluster.
NDB Cluster storage engine is available in the BitKeeper from MySQL
release 4.1.2, and in binary releases from MySQL-Max 4.1.3.
Currently, supported operating systems are Linux, Mac OS X, and Solaris.
We are working to make
NDB Cluster run on all of
the operating systems that MySQL itself also runs on, including Windows.
This chapter represents work in progress. Other documents describing MySQL Cluster can be found at http://www.mysql.com/cluster/ and http://dev.mysql.com/doc/#cluster.
You may also wish to subscribe to the MySQL Cluster mailing list. See http://lists.mysql.com/.
MySQL Cluster is a new technology to enable clustering of in-memory
databases in a shared-nothing system. The shared-nothing architecture allows
the system to work with very inexpensive hardware, without any specific
requirement on hardware or software. It also does not have any single
point of failure because each component has its own memory and disk.
MySQL Cluster is an integration of the standard MySQL server with an
in-memory clustered storage engine, called
NDB. In our
documentation, the term
NDB refers to the storage engine specific
part of the setup, whereas
MySQL Cluster refers to the combination
of MySQL and the new storage engine.
A MySQL Cluster consists of computers with a set of processes executing several MySQL servers, storage nodes for NDB Cluster, management servers and possibly also specialized data access programs. All these programs work together to form MySQL Cluster. When data is stored in the NDB Cluster storage engine, the tables are stored in the storage nodes for NDB Cluster. Those tables are directly accessible also from all other MySQL servers in the cluster. Thus, if one application updates the salary of an employee, all other MySQL servers that query this data can see it immediately.
The data stored in the storage nodes for MySQL Cluster can be mirrored and can handle failures of storage nodes with no impact other than that a number of transactions are aborted due to losing the transaction state. This should cause no problems bcause transactional applications should be written to handle transaction failure.
By bringing MySQL Cluster to the open source world, MySQL makes clustered data management with high availability, high performance, and scalability available to all who need it.
NDB is an in-memory storage engine offering high-availability
and data-persistence features.
NDB can (although this requires extensive knowledge) be used as an
independent database system, supporting the traditional relational
data model with full ACID transactions.
NDB storage engine can be configured with a range of fail-over and
load-balancing options, but it is easiest to start with the storage
engine at the cluster level. The
NDB storage engine of MySQL Cluster
contains a complete set of data, dependent only on other data within
the cluster itself.
A MySQL Cluster may also replicate clustered data to other MySQL Clusters,
but this is a complex configuration. Here, we will focus on how to set up
a single MySQL Cluster consisting of an
NDB storage engine and some
The cluster part of MySQL Cluster is currently configured independently
from the MySQL servers. In a MySQL Cluster, each part of the cluster
is considered to be a
Note: A node is in many contexts a computer, but for MySQL Cluster it is a process. There can be any number of nodes on a single computer.
Each node has a type, and there can be multiple nodes in the MySQL Cluster of each type. In a minimal MySQL Cluster configuration, there will be at least three nodes:
MGM) node. The role of this type of node is to manage the other nodes within the MySQL Cluster, such as providing configuration data, starting and stopping nodes, running backup, and so forth. Because this node type manages the configuration of the other nodes, a node of this type must always be started first, before any other node. With a running cluster, the MGM node does necessarily have to be running all the time.
DB) node. This is the type of node that manages and stores the database itself. There are as many DB nodes as you have replicas times the number of fragments. That is, with two fragments, each with two replicas, you need four DB nodes. It is not necessary to have more than one replica, so a minimal MySQL Cluster may contain just one DB node.
API) node. This is the client node that will access the cluster, and in the case of MySQL Cluster, these are traditional MySQL servers with a new storage engine
NDB Clusterwhich enables access to clustered tables. Basically, the MySQL daemon is a client of the NDB cluster. If you have applications that use the NDB API directly, then these are considered API nodes too.
We refer to these cluster processes as nodes in the cluster. Setting up the configuration of the cluster involves configuring each individual node in the cluster and setting up each individual communication link between the nodes in the cluster. MySQL Cluster currently is designed with the intention that storage nodes are homogenous in terms of processor power, memory space, and communication bandwidth. Also, to enable one point of configuration, it was decided to move the entire cluster configuration to one configuration file.
The management server manages the cluster configuration file and the cluster log. All nodes in the cluster contact the management server to retrieve their part of the configuration, so they need a way to determine where the management server resides. When interesting events occur in the storage nodes, they transfer the information of these events to the management server, which then writes the information to the cluster log.
In addition, there are any number of clients to the cluster. These are of two types.
A MySQL server that is part of MySQL Cluster differs in only one aspect
from what we are used to. It has an additional storage engine
NDBCLUSTER), which is initially disabled.
Except for this, the MySQL server is not much different than what we are
used to from previous MySQL releases, except any other new 4.1 features,
of course. By default, the server is configured with the
engine disabled (so as not to allocate resources needlessly). To enable
NDB, you need to modify `my.cnf'.
Also, because the MySQL server is an API client to the
engine, the minimal configuration information needed to access the MGM node
from the MySQL server must be set. When this is done, then all MGM nodes
and DB nodes must be up and running before starting the MySQL server.
One MGM node is sufficient to start.
NDB Cluster is available in binary distributions from MySQL-Max 4.1.3
If you choose to build from a source tarball or the MySQL 4.1 BitKeeper tree,
make sure you use the
--with-ndbcluster option when running
You could also simply use the
BUILD/compile-pentium-max build script.
This script also includes OpenSSL, so you either have to get OpenSSL
of modify the build script to exclude it.
Apart from these things, you can just follow the standard instructions to build your own binaries, run the tests and perform the installation procedure. See section 2.3.3 Installing from the Development Source Tree.
You need to have all the MGM and DB nodes up and running first, and this will probably be the most time-consuming part of the configuration (because we will assume that you are already familiar with MySQL to a certain extent). As for the MySQL configuration and the `my.cnf' file, this is very straightforward, and this section only covers the differences from configuring MySQL without clustering.
This section describes how to quickly configure and start the simplest possible MySQL Cluster setup. This will make you familiar with the basic concepts. Then, read the other sections and design your desired setup.
One directory must be created. The example uses `/var/lib/mysql-cluster'.
Execute the following command as
shell> mkdir /var/lib/mysql-cluster
In this directory, create a file `config.ini' with the
following contents. Substitute
DataDir values that
are appropriate for your system.
# file "config.ini" - showing minimal setup consisting of 1 DB node, # 1 management server, and 3 MySQL servers. # The empty default sections are not needed, and are shown only for clarity. # Storage nodes are required to provide a host name but MySQL Servers # are not. Thus the configuration can be dynamic as to setting up the # MySQL Servers. # If you don't know the host name of your machine, use localhost. # The DataDir parameter also has a default value, but it is recommended to # set it explicitly. [NDBD DEFAULT] NoOfReplicas: 1 [MYSQLD DEFAULT] [NDB_MGMD DEFAULT] [TCP DEFAULT] [NDB_MGMD] HostName = myhost.example.com [NDBD] HostName = myhost.example.com DataDir: /var/lib/mysql-cluster/ [MYSQLD] [MYSQLD] [MYSQLD]
You can now start the management server. Do this as follows:
shell> cd /var/lib/mysql-cluster shell> ndb_mgmd
Then start a single DB node in this simple setup by running the
program. If this is the very first time you are starting
shell> ndbd --initial
ndbd starts, do not use that option:
ndbd will look for the management server at
localhost at port 2200.
Please note that if you installed from the binary tarball, you will need
to explicitly specify the path of the
servers. They would be in the `/usr/local/mysql/bin' directory.
Finally, go to the MySQL data directory (this might be a location such as
`/usr/local/mysql/data'). Ensure that the `my.cnf' file contains
the option necessary to enable the
NDB Cluster storage engine:
You can now start the MySQL server as usual:
shell> mysqld_safe --user=mysql &
Wait a moment to make sure the MySQL server is running properly. If you see a notice ``mysql ended'', check the server's `.err' file to find out what went wrong.
If all went well so far, you now can start using the cluster:
shell> mysql Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 1 to server version: 4.1.3-beta-max-log Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> SHOW STORAGE ENGINES; +------------+---------+-------------------------------------------... | Engine | Support | Comment ... +------------+---------+-------------------------------------------... ... | NDBCLUSTER | YES | Clustered, fault tolerant memory based tables | NDB | YES | Alias for NDBCLUSTER ... ... mysql> USE test; Database changed mysql> CREATE TABLE ctest (i INT) ENGINE=NDBCLUSTER; Query OK, 0 rows affected (0.59 sec) mysql> SHOW CREATE TABLE ctest \G *************************** 1. row *************************** Table: ctest Create Table: CREATE TABLE `ctest` ( `i` int(11) default NULL ) ENGINE=ndbcluster DEFAULT CHARSET=latin1 1 row in set (0.00 sec)
If you like to check that your nodes was set-up properly you can then start the management client.
Try using the
SHOW command to get an overview of the cluster.
NDB> show Cluster Configuration --------------------- 1 [ndbd] node(s) [ndbd] id=2 (Version 3.5.0, Nodegroup: 0, Master) 1 [ndb_mgmd] node(s) [ndb_mgmd] id=1 (Version 3.5.0) 3 [mysqld] node(s) [mysqld] id=3 (Version 3.5.0) [mysqld] id=4 (not connected, accepting connect from any host) [mysqld] id=5 (not connected, accepting connect from any host)
Clustering is yours. Enjoy.
The configuration of MySQL Cluster is contained in a configuration file read by the management server and distributed to all processes involved in the cluster. This file contains a description of all involved nodes in the cluster, configuration parameters for the storage nodes and configuration parameters for all connections between the nodes in the cluster.
Currently the configuration file is in INI format, and is named
`config.ini' by default. It is placed in the start directory of
ndb_mgmd (the management server).
Preset defaults are available for most parameters, and defaults can be
specified in the `config.ini' file itself. To create a default value
section, add the word
DEFAULT to the section name. For example, DB
nodes are coconfigured using
[DB] sections. If all DB nodes use the
same data memory size, and this size is not the same as the preset default
size, create a
[DB DEFAULT] section that contains a
line to specify the default data memory size for all DB nodes.
The INI format consists of sections preceded by section headings (surrounded by square brackets), followed by settings and values. One deviation from the standard format being that the setting name and value can be separated by a colon (`:') in addition to the equal sign (`='), and another is that the sections aren't unique. Instead, unique entries (such as two nodes of the same type) are identified by a unique ID.
A minimalistic configuration file needs to define the computers involved in the cluster and which nodes are involved in the cluster and what computers these nodes are placed on.
An example of a still fairly minimalistic configuration file for a cluster with one management server, two storage nodes and two MySQL servers is shown below:
# file "config.ini" - 2 DB nodes and 2 mysqld # This file is placed in the start directory of ndb_mgmd, # the management server. # The first MySQL Server can be started from any host and the second # can only be started at the host mysqld_5.mysql.com [NDBD DEFAULT] NoOfReplicas: 2 [MGM] Hostname: ndb_mgmd.mysql.com [NDBD] HostName: ndbd_2.mysql.com DataDir: /var/lib/mysql-cluster/ [NDBD] HostName: ndbd_3.mysql.com DataDir: /var/lib/mysql-cluster/ [MYSQLD] [MYSQLD] HostName: mysqld_5.mysql.com
There are six different sections in the config file.
defines the computers in the cluster.
[DB] defines the storage nodes
in the cluster.
[API] defines the MySQL server nodes in the cluster.
[MGM] defines the management server node in the cluster.
defines TCP/IP connections between nodes in the cluster, TCP/IP is the
default connection mechanism between two nodes.
shared-memory connections between nodes. This is only available in binaries
that have been built with the
For each section one can define DEFAULTs. Parameters are case insensitive as of MySQL 4.1.5.
[COMPUTER] section has no real significance other than serving as
a way to avoid the need of defining host names for each node in the system.
All parameters mentioned here are mandatory.
[MGM] section (with its alias
[NDB_MGMD]) is used to
configure the behavior of the management server in various aspects. The
mandatory parameter is either
All other parameters can be left out and will in that case receive the
FILE. One assigns all of these three in one string with each part separated by a ';'.
CONSOLEmeans putting it to stdout, no more parameters are needed.
SYSLOGmeans sending it to a syslog facility. It is necessary to specify the facility for this parameter. The poossible facilities are
local7. Note that every facility is not necessarily supported by every operating system.
FILEmeans sending the cluster log to a regular file on the machine. It is necessary to specify the name of this file, the maximum size of the file until a new file is opened and the old is renamed with filename extended by `.x' where
xis the next number not used yet on the file. It is also necessary to specify maximum number of rolled files.
FILE:filename=cluster.log,maxsize=1000000,maxfiles=6Multiple log destinations can be specified as in the following example.
CONSOLE;SYSLOG:facility=local0;FILE:filename=/var/log/mgmdThe default value of this parameter is
[DB] section (with its alias
[NDBD]) is used to configure
the behavior of the storage nodes. There are many parameters specified that
controls the buffer sizes, pool sizes, timeout parameters and so forth. The
only mandatory parameter is either
HostName and the parameter
NoOfReplicas which need to be
defined in the
[DB DEFAULT] section. Most parameters should be set in
[DB DEFAULT] section. Only parameters explicitly stated as
possible to have local values are allowed to be changed in the
ExecuteOnComputer needs to be
defined in the local
Id value, or the node identity of the storage node can now be
allocated at start of the node. It is still possible to assign a node ID in
the config file.
For each parameter it is possible to use k, M, or G as suffixes to indicate units of 1024, (1024*1024), or (1024* 1024*1024). For example, 100k means 102400. Parameters and values are currently case sensitive.
[DB DEFAULT]section because it is a global parameter. It defines the number of replicas for each table stored in the cluster. This parameter also specifies the size of node groups. A node group is a set of nodes that all store the same information. Node groups are formed implicitly. The first node group is formed by the storage nodes with the lowest node identities. And the next by the next lowest node identities. As an example presume we have 4 storage nodes and
NoOfReplicasis set to 2. The four storage nodes have node IDs 2, 3, 4 and 5. Then the first node group will be formed by node 2 and node 3. The second node group will be formed by node 4 and node 5. It is important to configure the cluster in such a manner such that nodes in the same node groups are not placed on the same computer. This would cause a single HW failure to cause a cluster crash. If no node identities are provided then the order of the storage nodes will be the determining factor for the node group. The actual node group assigned will be printed by the
SHOWcommand in the management client. There is no default value and the maximum number is 4.
DataDir. The directory must be created before starting the
ndbdprocess. If you use the recommended directory hierarchy, you will use a directory `/var/lib/mysql-cluster'. Under this directory a directory `ndb_2_fs' will be created (if node ID was 2) which will be the file system for that node.
FileSystemPath/`BACKUP' will be choosen.
IndexMemory are the parameters that specify
the size of memory segments used to store the actual records and their
indexes. It is important to understand how
IndexMemory are used to understand how to set these parameters.
For most uses, they need to be updated to reflect the usage of the
DataMemorywill be allocated in memory so it is important that the machine contains enough memory to handle the
DataMemoryis used to store two things. It stores the actual records. Each record is currently of fixed size. So
VARCHARcolumns are stored as fixed size columns. There is an overhead on each record of 16 bytes normally. Additionally each record is stored in a 32KB page with 128 byte page overhead. There will also be a small amount of waste for each page because records are only stored in one page. The maximum record size for the columns currently is 8052 bytes. The
DataMemoryis also used to store ordered indexes. Ordered indexes uses about 10 bytes per record. Each record in the table is always represented in the ordered index. The
DataMemoryconsists of 32KB pages. These pages are allocated to partitions of the tables. Each table is normally partitioned with the same number of partitions as there are storage nodes in the cluster. Thus for each node there are the same number of partitions (=fragments) as the
NoOfReplicasis set to. Once a page has been allocated to a partition it is currently not possible to bring it back to the pool of free pages. The method to restore pages to the pool is by deleting the table. Performing a node recovery also will compress the partition because all records are inserted into an empty partition from another live node. Another important aspect is that the
DataMemoryalso contains UNDO information for records. For each update of a record a copy record is allocated in the
DataMemory. Also each copy record will also have an instance in the ordered indexes of the table. Unique hash indexes are updated only when the unique index columns are updated and in that case a new entry in the index table is inserted and at commit the old entry is deleted. Thus it is necessary also to allocate memory to be able to handle the largest transactions which are performed in the cluster. Performing large transactions has no advantage in MySQL Cluster other than the consistency of using transactions which is the whole idea of transactions. It is not faster and consumes large amounts of memory. The default
DataMemorysize is 80MB. The minimum size is 1MB. There is no maximum size, but in reality the maximum size has to be adapted so that the process doesn't start swapping when using the maximum size of the memory.
IndexMemoryis the parameter that controls the amount of storage used for hash indexes in MySQL Cluster. Hash indexes are always used for primary key indexes, unique indexes, and unique constraints. Actually when defining a primary key and a unique index there will be two indexes created in MySQL Cluster. One index is a hash index which is used for all tuple accesses and also for lock handling. It is also used to ensure unique constraints. The size of the hash index is 25 bytes plus the size of the primary key. For primary keys larger than 32 bytes another 8 bytes is added for some internal references. Thus for a table defined as
CREATE TABLE example ( a INT NOT NULL, b INT NOT NULL, c INT NOT NULL, PRIMARY KEY(a), UNIQUE(b) ) ENGINE=NDBCLUSTER;We will have 12 bytes overhead (having no nullable columns saves 4 bytes of overhead) plus 12 bytes of data per record. In addition we will have two ordered indexes on a and b consuming about 10 bytes each per record. We will also have a primary key hash index in the base table with roughly 29 bytes per record. The unique constraint is implemented by a separate table with b as primary key and a as a column. This table will consume another 29 bytes of index memory per record in the table and also 12 bytes of overhead plus 8 bytes of data in the record part. Thus for one million records, we will need 58MB of index memory to handle the hash indexes for the primary key and the unique constraint. For the
DataMemorypart we will need 64MB of memory to handle the records of the base table and the unique index table plus the two ordered index tables. The conclusion is that hash indexes takes up a fair amount of memory space but in return they provide very fast access to the data. They are also used in MySQL Cluster to handle uniqueness constraints. Currently the only partitioning algorithm is hashing and the ordered indexes are local to each node and can thus not be used to handle uniqueness constraints in the general case. An important point for both
DataMemoryis that the total database size is the the sum of all
IndexMemoryin each node group. Each node group is used to store replicated information, so if there are four nodes with 2 replicas there will be two node groups and thus the total
DataMemoryavailable is 2*
DataMemoryin each of the nodes. Another important point is about changes of
IndexMemory. First of all, it is highly recommended to have the same amount of
IndexMemoryin all nodes. Since data is distributed evenly over all nodes in the cluster the size available is no better than the smallest sized node in the cluster times the number of node groups.
IndexMemorycan be changed, but it is dangerous to decrease them because that can easily lead to a node that will not be able to restart or even a cluster not being able to restart since there is not enough memory space for the tables needed to restore into the starting node. Increasing them should be quite okay, but it is recommended that such upgrades are performed in the same manner as a software upgrade where first the configuration file is updated, then the management server is restarted and then one storage node at a time is restarted by command. More
IndexMemoryis not used due to updates but inserts are inserted immediately and deletes are not deleted until the transaction is committed. The default
IndexMemorysize is 18MB. The minimum size is 1MB.
The next three parameters are important because they affect the number of
parallel transactions and the sizes of transactions that can be handled by
MaxNoOfConcurrentTransactions sets the number of
parallel transactions possible in a node and
MaxNoOfConcurrentOperations sets the number of records that can be
in update phase or locked simultaneously.
Both of these parameters and particularly
MaxNoOfConcurrentOperations are likely targets for users setting
specific values and not using the default value. The default value is set
for systems using small transactions and to ensure not using too much
memory in the default case.
MaxNoOfConcurrentOperationswill always be used to calculate the number of operation records in the transaction coordinator part of the node. It is also important to have an idea of the memory requirements for those operation records. In MySQL 4.1.5, operation records consume about 1KB per record. This figure will shrink in future 5.x versions.
MaxNoOfConcurrentOperationswhich fits systems with many simultaneous, not very large transactions. If the configuration needs to handle one very large transaction at a time and there are many nodes then it is a good idea to configure this separately.
The next set of parameters are used for temporary storage in the midst of executing a part of a query in the cluster. All of these records will have been released when the query part is completed and is waiting for the commit or rollback.
Most of the defaults for these parameters will be okay for most users. Some high-end users might want to increase those to enable more parallelism in the system and some low-end users might want to decrease them to save memory.
MaxNoOfConcurrentOperations. The default value of this parameter is 8192. Only in rare cases of extremely high parallelism using unique hash indexes should this parameter be necessary to increase. To decrease could be performed for memory savings if the DBA is certain that such high parallelism is not occurring in the cluster.
MaxNoOfFiredTriggersis 4000. Normally this value should be sufficient for most systems. In some cases it could be decreased if the DBA feels certain the parallelism in the cluster is not so high. This record is used when an operation is performed that affects a unique hash index. Updating a column that is part of a unique hash index or inserting/deleting a record in a table with unique hash indexes will fire an insert or delete in the index table. This record is used to represent this index table operation while its waiting for the original operation that fired it to complete. Thus it is short lived but can still need a fair amount of records in its pool for temporary situations with many parallel write operations on a base table containing a set of unique hash indexes.
ZATTRBUF_FILESIZEin Dbtc.hpp. A similar buffer for key info exists which contains 4000*16 bytes, 62.5KB of buffer space. The parameter in this case is
Dbtcis the module for handling the transaction coordination. Similar parameters exist in the
Dblqhmodule taking care of the reads and updates where the data is located. In `Dblqh.hpp' with
ZATTRINBUF_FILESIZEset to 10000*128 bytes (1250KB) and
ZDATABUF_FILE_SIZE, set to 10000*16 bytes (roughly 156KB) of buffer space. No known instances of that any of those compile time limits haven't been big enough has been reported so far or discovered by any of our extensive test suites. The default size of the
MaxNoOfConcurrentScansis 256. The maximum value is 500. This parameter will always specify the number of scans possible in the transaction coordinator. If the number of local scan records is not provided it is calculated as the product of
MaxNoOfConcurrentScansand the number of storage nodes in the system.
ScanBatchSizedefined in the API nodes.
Out of log file space temporarily. This condition will prevail until a checkpoint has completed and the log tail can be moved forward.
The next set of parameters defines the pool sizes for metadata objects. It is necessary to define the maximum number of attributes, tables, indexes, and trigger objects used by indexes, events and replication between clusters.
BLOBdata type an extra table is used to store most of the
BLOBdata. These tables also must be taken into account when defining the number of tables. The default value of this parameter is 128. The minimum is 8 and there is no maximum. There are internal limitations though that limits the maximum number of tables to 1600. Each table object consumes around 20KB in each node.
USING HASHoption in the unique index definition. The default value is 64. Each index will consume around 15KB per node.
MaxNoOfUniqueHashIndexesinstead. This parameter is only used by unique hash indexes. There needs to be one record in this pool for each unique hash index defined in the cluster. The default value of this parameter is 128.
There is a set of boolean parameters affecting the behavior of storage nodes. Boolean parameters can be specified to true by setting it to Y or 1 and to false by setting it to N or 0.
Diskless, in this case even the tables doesn't exist anymore after a crash. Enabling this feature can be done by either setting it to Y or 1. When this feature is enabled, backups will be performed but will not be stored because there is no "disk". In future releases it is likely to make the backup diskless a separate configurable parameter. The default is that this feature is not enabled.
There are quite a few parameters specifying timeouts and time intervals between various actions in the storage nodes. Most of the timeouts are specified in milliseconds with a few exceptions which will be mentioned below.
StartPartialTimeoutbut is still in a possibly partitioned state one waits until also this timeout has passed. The default timeout is 60000 milliseconds (60 seconds).
NoOfDiskPagesToDiskAfterRestartACC. This parameter handles the limitation of writes from the
DataMemory. So this parameter specifies how quickly local checkpoints will be executed. This parameter is important in connection with
IndexMemory. The default value is 40 (3.2MB of data pages per second).
NoOfDiskPagesToDiskAfterRestartTUPbut limits the speed of writing index pages from
IndexMemory. The default value of this parameter is 20 (1.6MB per second).
NoOfDiskPagesToDiskAfterRestartACC, only it does it for local checkpoints executed in the node as part of a local checkpoint when the node is restarting. As part of all node restarts a local checkpoint is always performed. Since during a node restart it is possible to use a higher speed of writing to disk because fewer activities are performed in the node due to the restart phase. This parameter handles the
DataMemorypart. The default value is 40 (3.2MB per second).
IndexMemorypart of local checkpoint. The default value is 20 (1.6MB per second).
A number of new configuration parameters were introduced in MySQL 4.1.5. These correspond to values that previously were compile time parameters. The main reason for this is to enable the advanced user to have more control of the size of the process and adjust various buffer sizes according to his needs.
All of these buffers are used as front-ends to the file system when
writing log records of various kinds to disk. If the node runs with
Diskless then these parameters can most definitely be set to their
minimum values because all disk writes are faked as okay by the file system
abstraction layer in the
NDB storage engine.
NDBstorage engine uses a recovery scheme based on a consistent checkpoint together with an operational REDO log. In order to produce a consistent checkpoint without blocking the entire system for writes, UNDO logging is done while performing the local checkpoint. The UNDO logging is only activated on one fragment of one table at a time. This optimization is possible because tables are entirely stored in main memory. This buffer is used for the updates on the primary key hash index. Inserts and deletes rearrange the hash index and the
NDBstorage engine writes UNDO log records that map all physical changes to an index page such that they can be undone at a system restart. It also logs all active insert operations at the start of a local checkpoint for the fragment. Reads and updates only set lock bits and update a header in the hash index entry. These changes are handled by the page write algorithm to ensure that these operations need no UNDO logging. This buffer is 2MB by default. The minimum value is 1MB. For most applications this is good enough. Applications doing extremely heavy inserts and deletes together with large transactions using large primary keys might need to extend this buffer. If this buffer is too small, the
NDBstorage engine issues the internal error code 677 which will be translated into "Index UNDO buffers overloaded".
UndoIndexBufferbut is used for the data part. This buffer is used during local checkpoint of a fragment and inserts, deletes, and updates use the buffer. Since these UNDO log entries tend to be bigger and more things are logged, the buffer is also bigger by default. It is set to 16MB by default. For some applications this might be too conservative and they might want to decrease this size, the minimum size is 1MB. It should be rare that applications need to increase this buffer size. If there is a need for this it is a good idea to check if the disks can actually handle the load that the update activity in the database causes. If they cannot then no size of this buffer will be big enough. If this buffer is too small and gets congested, the
NDBstorage engine issues the internal error code 891 which will be translated to "Data UNDO buffers overloaded".
NDBstorage engine issues the internal error code 1221 which will be translated into "REDO log buffers overloaded".
For cluster management, it is important to be able to control the amount of log messages sent to stdout for various event types. The possible events will be listed in this manual soon. There are 16 levels possible from level 0 to level 15. Setting event reporting to level 15 means receiving all event reports of that category and setting it to 0 means getting no event reports in that category.
The reason why most defaults are set to 0 and thus not causing any output to stdout is that the same message is sent to the cluster log in the management server. Only the startup message is by default generated to stdout.
A similar set of levels can be set in management client to define what levels to record in the cluster log.
There is a set of parameters defining memory buffers that are set aside for online backup execution.
BackupWriteSizeparameter. When sending data to the disk, the backup can continue filling this buffer until it runs out of buffer space. When running out of buffer space, it will simply stop the scan and wait until some disk writes return and thus free up memory buffers to use for further scanning. The default value is 2MB.
BackupDataBufferSizeexcept that when this part runs out of buffer space, it causes the backup to fail due to lack of backup buffers. Thus the size of this buffer must be big enough to handle the load caused by write activities during the backup execution. The default parameter should be big enough. Actually it is more likely that a backup failure is caused by a disk not able to write as quickly as it should. If the disk subsystem is not dimensioned for the write load caused by the applications this will create a cluster which will have great difficulties to perform the desired actions. It is important to dimension the nodes in such a manner that the processors becomes the bottleneck rather than the disks or the network connections. The default value is 2MB.
BackupLogBufferSize. The default value is 4MB.
[API] section (with its alias
[MYSQLD]) defines the
behavior of the MySQL server. No parameter is mandatory. If no computer or
host name is provided, then any host can use this API node.
TCP/IP is the default transport mechanism for establishing connections in MySQL Cluster. It is actually not necessary to define any connection because there will be a one connection setup between each of the storage nodes, between each storage node, and all MySQL server nodes and between each storage node and the management server.
It is only necessary to define a connection if it is necessary to change
the default values of the connection. In that case it is necessary to
define at least
NodeId2 and the parameters to
It is also possible to change the default values by setting the parameters
[TCP DEFAULT] section.
[TCP DEFAULT]section normally. This parameter should no longer be used. Use the parameter ServerPort on storage nodes instead.
Shared memory segments are currently supported only for special builds of
MySQL Cluster using the
implementation will most likely change. When defining shared memory as the
connection method it is necessary to define at least
ShmKey. All other parameters have default values
that will work out fine in most cases.
There are four processes that are important to know about when using MySQL Cluster. We will cover how to work with those processes, which options to use when starting and so forth.
mysqld is the traditional MySQL server process. To be used with
MySQL Cluster it needs to be built with support for the NDB Cluster storage
engine. If the
mysqld binary has been built in such a manner, the
NDB Cluster storage engine is still disabled by default.
To enable the NDB Cluster storage engine there are two ways. Either use
--ndbcluster as a startup option when starting
insert a line with
ndbcluster in the
[mysqld] section of your
An easy way to verify that your server runs with support for the
NDB Cluster storage engine is to issue the command
SHOW ENGINGS from a
You should see
YES for the row listing
If you see
NO, you are not running a
mysqld that is compiled
NDB Cluster support enabled. If you see
DISABLED, then you
need to enable it in the
my.cnf configuration file.
The MySQL server needs to know how to get the configuration of the cluster. To access this configuration, it needs to know three things:
The node ID can be skipped from MySQL 4.1.5 on, because a node ID can be dynamically allocated.
There are currently three possible ways to provide this information to the
mysqld process. The first option is to include this information in a
file called `Ndb.cfg'. This file should reside in the data directory
of the MySQL Server. The second option is to set an environment variable
NDB_CONNECTSTRING. The third option is to set the connect
string as a
connect-string either when
mysqld config file. The string is the same in both cases. If no
information is supplied then it will default to "host=localhost:2200".
shell> mysqld --ndb-connectstring="host=ndb_mgmd.mysql.com:2200"
ndb_mgmd.mysql.com is the host where the management server resides,
and it is listening to port 2200.
With this setup the MySQL server will be a full citizen of MySQL Cluster and will be fully aware of all storage nodes in the cluster and their status. It will setup connection to all storage nodes and will be able to use all storage nodes as transaction coordinator and to access their data for reading and updating.
ndbd, the Storage Engine Node Process
ndbd is the process that is used to handle all the data in the
tables using the NDB Cluster storage engine. This is the process that
contains all the logic of distributed transaction handling, node recovery,
checkpointing to disk, online backup, and lots of other functionality.
In a cluster there is a set of
ndbd processes cooperating in
handling the data. These processes can execute on the same computer or on
different computers, in a completely configurable manner.
Before MySQL 4.1.5,
ndbd process should start from a separate
directory. The reason for this was that
ndbd generates a set of log
files in its starting directory.
In MySQL 4.1.5, this was changed such that the files are placed in the
directory specified by
DataDir in the configuration file. Thus
ndbd can be started from anywhere.
These log files are (the 2 is the node ID):
ndbdprocess has encountered and a smaller error string and reference to a trace file for this crash. An entry could like this:
Date/Time: Saturday 31 January 2004 - 00:20:01 Type of error: error Message: Internal program error (failed ndbrequire) Fault ID: 2341 Problem data: DbtupFixAlloc.cpp Object of reference: DBTUP (Line: 173) ProgramName: NDB Kernel ProcessID: 14909 TraceFile: ndb_2_trace.log.2 ***EOM***
MySQL Cluster Troubleshooting. There can be a configurable number of those trace files in the directory before old files are overwritten. 1 in this context is simply the number of the trace file.
ndbdprocess when executing as a daemon process. 2 in this context is the node ID. This file only exists when starting
ndbdas a daemon because then stdout and stderr are redirected to this file.
ndbdas a daaemon process.
ndbdwhere it is possible to trace all incoming, outgoing and internal messages with their data in the
It is recommended to not use a directory mounted through NFS because in some environments that can cause problems with the lock on the pid-file remaining even after the process has stopped.
Also when starting the
ndbd process it is necessary to specify which
node ID the process is to use, the hostname of the management server and the
port it is listening to. Again there are three ways of specifying this
information. Either in a string in the file `Ndb.cfg', this file
should be stored in the starting directory of the
ndbd process. The
second option is to set the environment variable
before starting the process. The third option is to use a command parameter
ndbd starts it will actually start two processes. The starting
process is called the "angel" and its only job is to discover when the
execution process has completed, and then restart the
ndbd process if
configured to do so. Thus if one attempts to kill
ndbd through the
kill command in Unix, it is necessary to kill both processes. A
more proper way to handle the stopping of
ndbd processes is to use
the management client and stop the process from there.
The execution process uses one thread for all activities in reading,
writinga, and scanning data and all other activities. This thread is designed
with asynchronous programming so it can easily handle thousands of
concurrent activites. In addition there is a watch-dog thread supervising
the execution thread to ensure it doesn't stop in an eternal loop or other
problem. There is a pool of threads handling file I/O. Each thread can
handle one open file. In addition threads can be used for connection
activities of the transporters in the
ndbd process. Thus in a system
that performs a large number of activities including update activities the
ndbd process will consume up to about 2 CPUs if allowed to. Thus in
a large SMP box with many CPUs it is recommended to use several
ndbd processes which are configured to be part of different node
ndb_mgmd, the Management Server Process
The management server is the process that reads the configuration file of the cluster and distributes this information to all nodes in the cluster requesting it. It also maintains the log of cluster activities. Management clients can connect to the management server and use commands to check status of the cluster in various aspects.
As of MySQL 4.1.5, it is no longer to specify a connect string when starting the management server.
The following files are created or used by
ndb_mgmd in its starting
ndb_mgmd. From MySQL 4.1.5, the log and PID files will
be placed in the
DataDir specified in the configuration file:
MySQL Cluster Configuration.
ndb_mgm, the Management Client Process
The final important process to know about is the management client. This process is not needed to run the cluster. Its value lies in its ability to check status of the cluster, start backups, and perform other management activities. It does so by providing access to a set of commands.
Actually the management client is using a C API that is used to access the management server so for advanced users it is also possible to program dedicated management processes which can do similar things as the management client can do.
When starting the management client, it is necessary to state the hostname and port of the management server as in the example below. The default is localhost as host and port number 2200.
shell> ndb_mgm localhost 2200
NDB Clusterstorage engine the default disabling of support for the
NDBstorage engine can be overruled by using this option. Using the
NDB Clusterstorage engine is necessary for using MySQL Cluster.
NDB Clusterstorage engine. This is disabled by default for binaries where it is included. So this option only applies if the server was configured to use the
NDB Clusterstorage engine.
NDBstorage engine, it is possible to point out the management server that distributes the cluster configuration by setting the connect string option.
-c connect_string, --connect-string connect_string
ndbdit is also possible to set the connect string to the management server as a command option.
shell> ndbd --connect-string="nodeid=2;host=ndb_mgmd.mysql.com:2200"
ndbdto execute as a daemon process. From MySQL 4.1.5 on, this is the default behavior.
ndbdnot to start as a daemon process. Useful when
ndbdis debugged and one wants printouts on the screen.
ndbdto perform an initial start. An initial start will erase any files created by earlier
ndbdinstances for recovery. It will also recreate recovery log files which on some operating systems can take a substantial amount of time. An initial start is only to be used at the very first start of the
ndbdprocess. It removes all files from the file system and creates all REDO log files. When performing a software upgrade which has changed the file contents on any files it is also necessary to use this option when restarting the node with a new software version of
ndbd. Finally it could also be used as a final resort if for some reason the node restart or system restart doesn't work. In this case be aware that destroying the contents of the file system means that this node can no longer be used to restore data. This option does not affect any backup files created. The previous possibility to use
-ifor this option was removed to ensure that this option is not used by mistake.
ndbdnot to automatically start.
ndbdwill connect to the management server and get the configuration and initialise communication objects. It will not start the execution engine until requested to do so by the management server. The management server can request by command issued by the management client.
ndbdprocess. The version number is the MySQL Cluster version number. It is important because at startup the MySQL Cluster processes verifies that the versions of the processes in the cluster can co-exist in the cluster. It is also important for online software upgrade of MySQL Cluster (see section
Software Upgrade of MySQL Cluster).
-c filename --config-file=filename
ndb_mgmdto start as a daemon process. This is the default behavior.
localhostand the default port is 2200.
Managing a MySQL Cluster involves a number of activities. The first activity is to configure and startup MySQL Cluster. This is covered by the sections section 17.3 MySQL Cluster Configuration and section 17.4 Process Management in MySQL Cluster. This section covers how to manage a running MySQL Cluster.
There are essentially two ways of actively managing a running MySQL Cluster.
The first is by commands entered into the management client where status of
cluster can be checked, log levels changed, backups started and stopped and
nodes can be stopped and started. The second method is to study the output
in the cluster log. The cluster log is directed towards the
`ndb_2_cluster.log' in the
DataDir directory of the management
server. The cluster log contains event reports generated from the
ndbd processes in the cluster. It is also possible to send the
cluster log entries to a Unix system log.
In addition to the central configuration file, the management servers are also controlled through a command line interface. The command line interface is available in the same terminal window as the started management server or through a separate management client process. This is the main administrative interface to a running cluster.
The management server has the following basic commands.
<id> denotes either a database node id (e.g. 21) or the
ALL that indicates that the command should be applied
to all database nodes in the cluster.
<id>or all database nodes.
<id>or all database nodes.
<id> RESTART [-N] [-I]
<id>or all database nodes.
ENTER SINGLE USER MODE <id>
<id>is allowed to access the database system.
EXIT SINGLE USER MODE
Commands for the event logs are given in the next section and commands for backup and restore are given in a separate section on these topics.
MySQL Cluster has two event logs, the cluster log and the node log.
Note: The cluster log is the recommended log. The node log is only intended to be used during application development or for debugging application code.
Each reportable event has the following properties:
The two logs (the cluster log and the node log) can be filtered on these properties.
The following management commands are related to the cluster log:
<id> CLUSTERLOG <category>=<threshold>
CLUSTERLOG FILTER <severity>
The following table describes the default setting (for all database nodes) of the cluster log category threshold. If an event has a priority with a value lower than or equal to the priority threshold, then it is reported in the cluster log.
Note that the events are reported per database node and that the thresholds can be set differently on different nodes.
|Category||Default threshold (All database nodes)|
The threshold is used to filter events within each category.
For example: a
STARTUP event with a priority of 3 is never sent
unless the threshold for
STARTUP is changed to 3 or lower.
Only events with priority 3 or lower are sent if the threshold is 3.
The event severities are (corresponds to UNIX syslog levels):
|1||ALERT||A condition that should be corrected immediately, such as a corrupted system database|
|2||CRITICAL||Critical conditions, such as device errors or out of resources|
|3||ERROR||Conditions that should be corrected, such as configuration errors|
|4||WARNING||Conditions that are not error conditions but might require handling|
|6||DEBUG||Messages used during development of NDB Cluster|
LOG_NOTICE are not used/mapped.
The event severities can be turned on or off. If the severity is on then all events with priority less than or equal to the category thresholds are logged. If the severity is off then no events belonging to the severity are logged.
The following commands are related to the node log:
<id> LOGLEVEL <levelnumber>Set logging level for database process with
idto the value of
All reportable events are listed below.
|DB nodes connected||CONNECTION||8||INFO|
|DB nodes disconnected||CONNECTION||8||INFO|
|Communication closed||CONNECTION||8||INFO||API & DB nodes connection closed|
|Communication opened||CONNECTION||8||INFO||API & DB nodes connection opened|
|Global checkpoint started||CHECKPOINT||9||INFO||Start of a GCP, i.e., REDO log is written to disk|
|Global checkpoint completed||CHECKPOINT||10||INFO||GCP finished|
|Local checkpoint started||CHECKPOINT||7||INFO||Start of local check pointing, i.e., data is written to disk. LCP Id and GCI Ids (keep and oldest restorable)|
|Local checkpoint completed||CHECKPOINT||8||INFO||LCP finished|
|LCP stopped in calc keep GCI||CHECKPOINT||0||ALERT||LCP stopped!|
|Local checkpoint fragment completed||CHECKPOINT||11||INFO||A LCP on a fragment has been completed|
|Report undo log blocked||CHECKPOINT||7||INFO||Reports undo logging blocked due buffer near to overflow|
|DB node start phases initiated||STARTUP||1||INFO||NDB Cluster starting|
|DB node all start phases completed||STARTUP||1||INFO||NDB Cluster started|
|Internal start signal received STTORRY||STARTUP||15||INFO||Internal start signal to blocks received after restart finished|
|DB node start phase X completed||STARTUP||4||INFO||A start phase has completed|
|Node has been successfully included into the cluster||STARTUP||3||INFO||President node, own node and dynamic id is shown|
|Node has been refused to be included into the cluster||STARTUP||8||INFO|
|DB node neighbours||STARTUP||8||INFO||Show left and right DB nodes neighbours|
|DB node shutdown initiated||STARTUP||1||INFO|
|DB node shutdown aborted||STARTUP||1||INFO|
|New REDO log started||STARTUP||10||INFO||GCI keep X, newest restorable GCI Y|
|New log started||STARTUP||10||INFO||Log part X, start MB Y, stop MB Z|
|Undo records executed||STARTUP||15||INFO|
|Completed copying of dictionary information||NODERESTART||8||INFO|
|Completed copying distribution information||NODERESTART||8||INFO|
|Starting to copy fragments||NODERESTART||8||INFO|
|Completed copying a fragment||NODERESTART||10||INFO|
|Completed copying all fragments||NODERESTART||8||INFO|
|Node failure phase completed||NODERESTART||8||ALERT||Reports node failure phases|
|Node has failed, node state was X||NODERESTART||8||ALERT||Reports that a node has failed|
|Report whether an arbitrator is found or not||NODERESTART||6||INFO||7 different cases|
|- President restarts arbitration thread [state=X]|
|- Prepare arbitrator node X [ticket=Y]|
|- Receive arbitrator node X [ticket=Y]|
|- Started arbitrator node X [ticket=Y]|
|- Lost arbitrator node X - process failure [state=Y]|
|- Lost arbitrator node X - process exit [state=Y]|
|- Lost arbitrator node X <error msg>[state=Y]|
|Report arbitrator results||NODERESTART||2||ALERT||8 different results|
|- Arbitration check lost - less than 1/2 nodes left|
|- Arbitration check won - node group majority|
|- Arbitration check lost - missing node group|
|- Network partitioning - arbitration required|
|- Arbitration won - positive reply from node X|
|- Arbitration lost - negative reply from node X|
|- Network partitioning - no arbitrator available|
|- Network partitioning - no arbitrator configured|
|GCP take over started||NODERESTART||7||INFO|
|GCP take over completed||NODERESTART||7||INFO|
|LCP take over started||NODERESTART||7||INFO|
|LCP take completed (state = X)||NODERESTART||7||INFO|
|Report transaction statistics||STATISTICS||8||INFO||# of: transactions, commits, reads, simple reads, writes, concurrent operations, attribute info, aborts|
|Report operations||STATISTICS||8||INFO||# of operations|
|Report table create||STATISTICS||7||INFO|
|Report job scheduling statistics||STATISTICS||9||INFO||Mean Internal job scheduling statistics|
|Sent # of bytes||STATISTICS||9||INFO||Mean # of bytes sent to node X|
|Received # of bytes||STATISTICS||9||INFO||Mean # of bytes received from node X|
|Memory usage||STATISTICS||5||INFO||Data and Index memory usage (80%, 90% and 100%)|
|Missed heartbeats||ERROR||8||WARNING||Node X missed heartbeat # Y|
|Dead due to missed heartbeat||ERROR||8||ALERT||Node X declared dead due to missed heartbeat|
|General warning events||ERROR||2||WARNING|
|Sent heartbeat||INFO||12||INFO||Heartbeat sent to node X|
|Create log bytes||INFO||11||INFO||Log part, log file, MB|
|General info events||INFO||2||INFO|
An event report has the following format in the logs:
<date & time in GMT> [<any string>] <event severity> -- <log message> 09:19:30 2003-04-24 [NDB] INFO -- Node 4 Start phase 4 completed
Single user mode allows the database administrator to restrict access to the database system to only one application (API node). When entering single user mode all connections to all APIs will be gracefully closed and no transactions are allowed to be started. All running transactions are aborted.
When the cluster has entered single user mode (use the all status command to see when the state has entered the single user mode), only the allowed API node is granted access to the database.
ENTER SINGLE USER MODE 5
After executing this command and after cluster has entered the single user mode, the API node with node id 5 becomes the single user of the cluster.
The node specified in the command above must be a MySQL Server node. Any attempt to specify any other type of node will be rejected.
Note: if the node with id 5 is running when executing
ENTER SINGLE USER MODE 5, all transactions running on node 5
will be aborted, the connection is closed, and the server must be
EXIT SINGLE USER MODE alters the state of the
cluster DB nodes from ``single user mode'' to ``started''.
MySQL Servers waiting for a connection, i.e. for the cluster to become
ready, are now allowed to connect. The MySQL Server denoted as the
single user continues to run, if it is connected, during and after the
EXIT SINGLE USER MODE
Best practice in case of node failures when running in single user mode is to:
Or restart database nodes prior to entering single user mode.
This section describes how to create a backup and later restore the backup to a database.
A backup is a snapshot of the database at a given time. The backup contains three main parts:
Each of these parts is saved on all nodes participating in a backup.
During backup each node saves these three parts to disk into three files:
Above <BackupId> is an identifier for the backup and <NodeId> is the node id of the node creating the file.
Before starting make sure that the cluster is properly configured for backups.
Using the management server to abort a backup:
ABORT BACKUP <BACKUPID>. The number <BackupId> is the identifier of the backup that is included in the response of the management server when the backup is started, i.e. in the message ``Backup <BackupId> started''. The identifier is also saved in the cluster log (cluster.log).
Note that if there is not any backup with id <BackupId> running when it is aborted, the management server will not reply anything. However there will be a line in the cluster.log mentioning that an ``invalid'' abort command has been filed.
The restore program is implemented as seperate command line utility. It reads the files created from the backup and inserts the stored information into the database. The restore program has to be executed once for each set of backup files, i.e. as many times as there were database nodes running when the backup we created.
The first time you run the restore program you also need to restore the meta data, i.e. create tables. The restore program needs to be started in a directory containing an `Ndb.cfg' file. The backup files must be present in the same directory. The backup can be restored to a database with a different configuration than it was created from. For example, consider if a backup (with id 12) created in a cluster with two database nodes (with node id 2 and node id 3) that should be restored to a cluster with four nodes. The restore program then has to be executed two times (one for each database node in the cluster where the backup was taken) as described in the box below.
Note: the cluster should have an empty database when starting to restore a backup.
There are four configuration parameters for backup:
If an error code is returned when issuing a backup request, then check that there is enough memory allocated for the backup (i.e. the configuration parameters). Also check that there is enough space on the hard drive partition of the backup target.
Go to the first, previous, next, last section, table of contents.