Understanding the data behind Bitcoin Core

Overview

In this tutorial, we will be taking a closer look at the data directory and files behind the Bitcoin core reference client. Having a better understanding of how this is managed allows us to overcome probing bitcoin's remote procedure call (RPC) and REST based interfaces for insights into the data maintained by the client.

Prerequisites

You will need access to a bitcoin node. We suggest executing against a node configured in regtest mode so that we can have the freedom of playing with various scenarios without having to loose real money. You can however execute these against either the testnet or mainnet configurations.

Note:
If you don't currently have access to a bitcoin development environment set up, dont' worry, we have your back! We've setup a web based mechanism which provisions your very own private session that includes these tools and comes preconfigured with a bitcoin node in regtest mode. https://bitcoindev.network/bitcoin-cli-sandbox/
Alternatively, we have also provided a simple docker container configured in regtest mode that you can install for testing purposes.

gr0kchain:~ $ docker volume create --name=bitcoind-data
gr0kchain:~ $ docker run -v bitcoind-data:/bitcoin --name=bitcoind-node -d \
     -p 18444:18444 \
     -p 127.0.0.1:18332:18332 \
     bitcoindevelopernetwork/bitcoind-regtest

Getting started

Before we get started, let's have a look at the data directory of an existing running bitcoin core node.

gr0kchain@bitcoindev $ tree ~/.bitcoin/
/home/gr0kchain/.bitcoin/
├── banlist.dat
├── bitcoin.conf
├── blocks
│   ├── blk00000.dat
│   ├── index
│   │   ├── 000003.log
│   │   ├── 000004.log
│   │   ├── 000005.ldb
│   │   ├── CURRENT
│   │   ├── LOCK
│   │   └── MANIFEST-000002
│   └── rev00000.dat
├── chainstate
│   ├── 000003.log
│   ├── CURRENT
│   ├── LOCK
│   └── MANIFEST-000002
├── db.log
├── debug.log
├── fee_estimates.dat
├── mempool.dat
├── peers.dat
└── wallet.dat

10 directories, 49 files

Note
By default, bitcoind will manage files in the following locations.

Windows %APPDATA%\Bitcoin
Linux ~/.bitcoin/
Mac OS X ~/Library/Application\ Support/Bitcoin/

This default location can be overridden using the -datadir configuration parameter or by adding a datadir parameter to the bitcoin.conf file.

A similar data directory is created for either the testnet and regtest configuration in sub directories assuming either of these have been configured to avoid conflicting with the mainnet files.

Filename Description
banlist.dat stores the IPs/Subnets of banned nodes
bitcoin.conf contains configuration settings for bitcoind or bitcoin-qt
bitcoind.pid stores the process id of bitcoind while running
blocks/blk000??.dat block data (custom, 128 MiB per file); since 0.8.0
blocks/rev000??.dat block undo data (custom); since 0.8.0 (format changed since pre-0.8)
blocks/index/* block index (LevelDB); since 0.8.0
chainstate/* blockchain state database (LevelDB); since 0.8.0
database/* BDB database environment; only used for wallet since 0.8.0; moved to wallets/ directory on new installs since 0.16.0
db.log wallet database log file; moved to wallets/ directory on new installs since 0.16.0
debug.log contains debug information and general logging generated by bitcoind or bitcoin-qt
fee_estimates.dat stores statistics used to estimate minimum transaction fees and priorities required for confirmation; since 0.10.0
indexes/txindex/* optional transaction index database (LevelDB); since 0.17.0
mempool.dat dump of the mempool's transactions; since 0.14.0
peers.dat peer IP address database (custom format); since 0.7.0
wallet.dat personal wallet (BDB) with keys and transactions; moved to wallets/ directory on new installs since 0.16.0
wallets/database/* BDB database environment; used for wallets since 0.16.0
wallets/db.log wallet database log file; since 0.16.0
wallets/wallet.dat personal wallet (BDB) with keys and transactions; since 0.16.0
.cookie session RPC authentication cookie (written at start when cookie authentication is used, deleted on shutdown): since 0.12.0
onion_private_key cached Tor hidden service private key for -listenonion: since 0.12.0
guisettings.ini.bak backup of former GUI settings after -resetguisettings is used

Only Only used in pre-0.8.0

  • blktree/; block chain index (LevelDB); since pre-0.8, replaced by blocks/index/ in 0.8.0
  • coins/; unspent transaction output database (LevelDB); since pre-0.8, replaced by chainstate/ in 0.8.0

Only used before 0.8.0

  • blkindex.dat: block chain index database (BDB); replaced by {chainstate/,blocks/index/,blocks/rev000??.dat} in 0.8.0
  • blk000?.dat: block data (custom, 2 GiB per file); replaced by blocks/blk000??.dat in 0.8.0

Only used before 0.7.0

  • addr.dat: peer IP address database (BDB); replaced by peers.dat in 0.7.0

Files

As we can see, there are various files and directories which organise data behind our node, so let's take a closer look at each of these.

Some background on key store

For the purpose of this tutorial, we'll be having a closer look at the blocks and chainstate directories and files.

We will be using LevelDB, a light-weight, single-purpose library for persistence with bindings to many platforms used by bitcoin core for storing this data.

By default, LevelDB stores entries lexicographically sorted by keys. The sorting is one of the main distinguishing features of LevelDB amongst similar embedded data storage libraries and comes in very useful for querying as we’ll see later.

A primer on leveldb

Before we look at these in more details, let's first familiarise ourselves with leveldb using nodejs.

  1. Create a directory for hosting our code
gr0kchain@bitcoindev $ mkdir code && cd code
  1. Install the leveldb package.
gr0kchain@bitcoindev $ npm install level
  1. Create a file called index.js that contains the following code.
var level = require('level')

// 1) Create our database, supply location and options.
//    This will create or open the underlying store.
var db = level('my-db')

// 2) Put a key & value
db.put('name', 'Satoshi Nakamoto', function (err) {
  if (err) return console.log('Ooops!', err) // some kind of I/O error

  // 3) Fetch by key
  db.get('name', function (err, value) {
    if (err) return console.log('Ooops!', err) // likely the key was not found

    // Ta da!
    console.log('name=' + value)
  })
})
  1. Run the script.
gr0kchain@bitcoindev $ node ./index.js
name=Satoshi Nakamoto

Great, you've just created your first level database!

A closer at the data behind leveldb

An interesting observation here will be checking the data directory created by our code.

gr0kchain@bitcoindev $ tree ./my-db/
./my-db/
├── 000003.log
├── CURRENT
├── LOCK
├── LOG
└── MANIFEST-000002

0 directories, 5 files

Note
For more information on these files consult the LevelDB Documentation.

Here you should notice a similar structure as seen previously for our chainstate and blocks/index directories.

Using the level is great for developing applications, however, let's use a leveldb read–eval–print loop REPL utility called lev for exploring our data.

  1. Install lev
gr0kchain@bitcoindev $ npm install -g lev
  1. Invoke our my-db files created
gr0kchain@bitcoindev $ lev ./my-db/
/>
  1. Obtain a list of current keys stored in our database.
gr0kchain@bitcoindev $ lev ./my-db/
/>ls

name

/>
  1. Obtain the value for the key name
/>get name
'Satoshi Nakamoto'
  1. Add another key value pair to the database
/>put bitcoin "rocks"
'OK'
/>ls
bitcoin  name
/>get bitcoin
'rocks'
  1. Exit the interactive repl
/>.exit

Nice! Some additional commands we can use with lev include.

  • GET - Get a key from the database.
  • PUT - Put a value into the database. If you have keyEncoding or valueEncoding set to json, these values will be parsed from strings into json.
  • DEL - Delete a key from the database.
  • LS - Get all the keys in the current range.
  • START - Defines the start of the current range. You can also use GT or GTE.
  • END - Defines the end of the current range. You can also use LT or LTE.
  • LIMIT - Limit the number of records in the current range (defaults to 5000).
  • REVERSE - Reverse the records in the current range.

Looking at the data behind bitcoin core

Now that we've looked how level db works, let's take a closer look at our block and chainstate directories.

Warning
It is recommended that you make a backup of your chaindata to avoid any accidental corruption..

Bitcoin core developer Pieter Wuille gives us a good explanation of these sections as follows.

Bitcoind since version 0.8 maintains two databases, the block index (in $DATADIR/blocks/index) and the chainstate (in $DATADIR/chainstate). The block index maintains information for every block, and where it is stored on disk. The chain state maintains information about the resulting state of validation as a result of the currently best known chain.

Inside the block index, the used key/value pairs are:

  • 'b' + 32-byte block hash -> block index record. Each record stores:
    • The block header.
    • The height.
    • The number of transactions.
    • To what extent this block is validated.
    • In which file, and where in that file, the block data is stored.
    • In which file, and where in that file, the undo data is stored.
  • 'f' + 4-byte file number -> file information record. Each record stores:
    • The number of blocks stored in the block file with that number.
    • The size of the block file with that number ($DATADIR/blocks/blkNNNNN.dat).
    • The size of the undo file with that number ($DATADIR/blocks/revNNNNN.dat).
    • The lowest and highest height of blocks stored in the block file with that number.
    • The lowest and highest timestamp of blocks stored in the block file with that number.
  • 'l' -> 4-byte file number: the last block file number used.
  • 'R' -> 1-byte boolean ('1' if true): whether we're in the process of reindexing.
  • 'F' + 1-byte flag name length + flag name string -> 1 byte boolean ('1' if true, '0' if false): various flags that can be on or off. Currently defined flags include:
    • 'txindex': Whether the transaction index is enabled.
  • 't' + 32-byte transaction hash -> transaction index record. These are optional and only exist if 'txindex' is enabled (see above). Each record stores:
    • Which block file number the transaction is stored in.
    • Which offset into that file the block the transaction is part of is stored at.
    • The offset from the start of that block to the position where that transaction itself is stored.

Inside the chain state database, the following key/value pairs are stored:

  • 'c' + 32-byte transaction hash -> unspent transaction output record for that transaction. These records are only present for transactions that have at least one unspent output left. Each record stores:
    • The version of the transaction.
    • Whether the transaction was a coinbase or not.
    • Which height block contains the transaction.
    • Which outputs of that transaction are unspent.
    • The scriptPubKey and amount for those unspent outputs.
  • 'B' -> 32-byte block hash: the block hash up to which the database represents the unspent transaction outputs.
    Latest version of bitcoind(please add version compatibility) uses obfuscation of the value in key/value pair . So you need to XOR with the obfuscation key to get the real value.

Understanding the chainstate leveldb

Let's start by looking at the chainstate folder. The chainstate directory contains the state as of the latest block. In simplified terms, it stores every spendable coin, who owns it, and how much it's worth.

Note
Using this against your data appears to corrupt the file which requires restarting bitcoind with -reindex or -reindex-chainstate. It is suggested that you execute these against a backup of your bitcoin datadir.

  1. LevelDB doesn't support concurrent access from multiple applications, so we'll first need to stop bitcoind.
gr0kchain@bitcoindev $ bitcoin-cli stop
  1. Make a backup of your chain data
gr0kchain@bitcoindev $ rsync -va ~/.bitcoin/chainstate/ ~/.bitcoin/chainstate_bk/
  1. Open the chainstate using the lev repl command.
gr0kchain@bitcoindev $ lev ~/.bitcoin/chainstate_bk/
/>
  1. Run the ls command.
/> ls

obfuscate_key  B

Interesting, here we see a key called obfuscate_key and another called B. Some background on this can be found due to a pull request introduced into bitcoin core which helps overcome issues with Anti-Virus software from flagging bitcoin data as being hostile through intentionally adding virus signatures to the time chain. The obfuscation key is a 64-bit value identified by 0e00obfuscation_key that should be XORed with each data value from the database.

Note
When setting the bitcoind debug field to leveldb or 1, we will notice the obfuscation key log entry from our debug.log file.

gr0kchain@bitcoindev $ grep obfuscate ~/.bitcoin/regtest/debug.log
2019-03-14 12:06:16 Wrote new obfuscate key for /home/gr0kchain/.bitcoin/regtest/chainstate: eac3d71013881b79

Writing a script for reading from the chainstate leveldb

Due to my experience with LevelDB's level library causing corruption to the database, I'd suggest making a backup of the data before executing any of these commands. I'll also be using a fresh copy of regtest where we'll need to generate some blocks to get us going.

  1. First, let's create a backup of our database.
gr0kchain@bitcoindev $ bitcoind
gr0kchain@bitcoindev $ bitcoin-cli generate 1
gr0kchain@bitcoindev $ bitcoin-cli stop
gr0kchain@bitcoindev $ rsync -va ~/.bitcoin/regtest/ ~/.bitcoin/backup/
  1. Next, we create a javascript file that works based on details covered.

    var level = require('level')
    var db = level('/home/gr0kchain/.bitcoin/regtest_backup/chainstate/',{ keyEncoding: 'hex', valueEncoding: 'hex' })
    
    var obfkey;
    db.createReadStream({ gte: '\x63', lt: '\x64' })
    .on('data', function (data) {
      if (data.key == '0e006f62667573636174655f6b6579') {
        console.log("obfuscate_key", data)
      } else {
        console.log("record", data)
      }
    })
    .on('error', function (err) {
      console.log('Oh my!', err)
    })
    .on('close', function () {
      console.log('Stream closed')
    })
    .on('end', function () {
      console.log('Stream ended')
    })
    
  2. We can then run this against our backup database.

    gr0kchain@bitcoindev $ node ./chainstate.js
    obfuscate_key { key: '0e006f62667573636174655f6b6579',
      value: '08eac3d71013881b79' }
    record { key: '42',
      value: '335c93f941bd7479dce90d3f50e423a3125b2bfdf641f7d78823a9df39f1c673' }
    record { key: '638db7b33143173127aff1473ac15501cfc75ebce965546b391de114034d33c237',
      value: 'ebc0e513374284367c3a5730b652bfc7400329de72f2dbf6d79cb8f5ceaeea183bf1855112' }
    Stream ended
    Stream closed
    

    Note
    The value for our obfuscate_key should match that we saw earlier in our debug.log. In my local instance, this is 08eac3d71013881b79 which in the leveldb is prefixed with the value 08 representing the ascii value for backspace and is not reflected in the log output.

  3. Start our bitcoind server, and check one of our previous blocks.

    gr0kchain@bitcoindev $ bitcoind
    Bitcoin server starting
    gr0kchain@bitcoindev $ bitcoin-cli getblockchaininfo | grep hash
    "bestblockhash": "0add792acf7ee062aeecc9e5edfc98f8da386c432fda2a36006f3552e9449fd9",
    gr0kchain@bitcoindev $ bitcoin-cli getblock 0add792acf7ee062aeecc9e5edfc98f8da386c432fda2a36006f3552e9449fd9
    {
    "hash": "0add792acf7ee062aeecc9e5edfc98f8da386c432fda2a36006f3552e9449fd9",
    "confirmations": 1,
    "size": 179,
    "height": 1,
    "version": 536870912,
    "merkleroot": "37c2334d0314e11d396b5465e9bc5ec7cf0155c13a47f1af2731174331b3b78d",
    "tx": [
      "37c2334d0314e11d396b5465e9bc5ec7cf0155c13a47f1af2731174331b3b78d"
    ],
    "time": 1552565182,
    "mediantime": 1552565182,
    "nonce": 3,
    "bits": "207fffff",
    "difficulty": 4.656542373906925e-10,
    "chainwork": "0000000000000000000000000000000000000000000000000000000000000004",
    "previousblockhash": "0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206"
    }
    
    { key: '638db7b33143173127aff1473ac15501cfc75ebce965546b391de114034d33c237',
      value: 'ebc0e513374284367c3a5730b652bfc7400329de72f2dbf6d79cb8f5ceaeea183bf1855112' }
    

In the above example, we can see the utxo represented by its txid 37c2334d0314e11d396b5465e9bc5ec7cf0155c13a47f1af2731174331b3b78d in little endian format leaded by a c or 63 in hex.

The value in this case is still obfuscated using the value of our 0e006f62667573636174655f6b6579 keys value 08eac3d71013881b79.

The reason for this is that the on disk storage files are often specially designed to be compact on disk, and not really intended to be easily usable by other applications (LevelDB doesn't support concurrent access from multiple applications anyway). There are several RPC methods for querying data from the databases (getblock, gettxoutsetinfo, gettxout) without needing direct access.

As you can see, only headers are stored inside this database. The actual blocks and transactions are stored in the block files, which are not databases, but just raw append-only files that contain the blocks in network format.

Decoding the values

To decode these values, using the obfuscation key.

  1. Install bigi to work with large numbers in javascript
gr0kchain@bitcoindev $ npm install bigi
  1. Start a node in interactive mode
gr0kchain@bitcoindev $ node
>
  1. Use the big integer package to assign our previous value and obfuscate_key key value. You need to pop the B character from this value and repeat it for the length of the value being decoded.
gr0kchain@bitcoindev $ node
> var bigi = require("bigi")
undefined
> var k = bigi.fromHex('eac3d71013881b79eac3d71013881b79eac3d71013881b79eac3d71013881b79eac3d71013')
undefined
> var v = bigi.fromHex('ebc0e513374284367c3a5730b652bfc7400329de72f2dbf6d79cb8f5ceaeea183bf1855112')
undefined
> var decode = v.xor(k)
undefined
> decode.toHex()
'0103320324ca9f4f96f98020a5daa4beaac0fece617ac08f3d5f6fe5dd26f161d132524101'
>

We now have the decoded version of our UTXO which can be decoded as per the instruction from here.

/** pruned version of CTransaction: only retains metadata and unspent transaction outputs
 *
 * Serialized format:
 * - VARINT(nVersion)
 * - VARINT(nCode)
 * - unspentness bitvector, for vout[2] and further; least significant byte first
 * - the non-spent CTxOuts (via CTxOutCompressor)
 * - VARINT(nHeight)
 *
 * The nCode value consists of:
 * - bit 1: IsCoinBase()
 * - bit 2: vout[0] is not spent
 * - bit 4: vout[1] is not spent
 * - The higher bits encode N, the number of non-zero bytes in the following bitvector.
 *   - In case both bit 2 and bit 4 are unset, they encode N-1, as there must be at
 *     least one non-spent output).
 *
 * Example: 0104835800816115944e077fe7c803cfa57f29b36bf87c1d358bb85e
 *          <><><--------------------------------------------><---->
 *          |  \                  |                             /
 *    version   code             vout[1]                  height
 *
 *    - version = 1
 *    - code = 4 (vout[1] is not spent, and 0 non-zero bytes of bitvector follow)
 *    - unspentness bitvector: as 0 non-zero bytes follow, it has length 0
 *    - vout[1]: 835800816115944e077fe7c803cfa57f29b36bf87c1d35
 *               * 8358: compact amount representation for 60000000000 (600 BTC)
 *               * 00: special txout type pay-to-pubkey-hash
 *               * 816115944e077fe7c803cfa57f29b36bf87c1d35: address uint160
 *    - height = 203998
 *
 *
 * Example: 0109044086ef97d5790061b01caab50f1b8e9c50a5057eb43c2d9563a4eebbd123008c988f1a4a4de2161e0f50aac7f17e7f9555caa486af3b
 *          <><><--><--------------------------------------------------><----------------------------------------------><---->
 *         /  \   \                     |                                                           |                     /
 *  version  code  unspentness       vout[4]                                                     vout[16]           height
 *
 *  - version = 1
 *  - code = 9 (coinbase, neither vout[0] or vout[1] are unspent,
 *                2 (1, +1 because both bit 2 and bit 4 are unset) non-zero bitvector bytes follow)
 *  - unspentness bitvector: bits 2 (0x04) and 14 (0x4000) are set, so vout[2+2] and vout[14+2] are unspent
 *  - vout[4]: 86ef97d5790061b01caab50f1b8e9c50a5057eb43c2d9563a4ee
 *             * 86ef97d579: compact amount representation for 234925952 (2.35 BTC)
 *             * 00: special txout type pay-to-pubkey-hash
 *             * 61b01caab50f1b8e9c50a5057eb43c2d9563a4ee: address uint160
 *  - vout[16]: bbd123008c988f1a4a4de2161e0f50aac7f17e7f9555caa4
 *              * bbd123: compact amount representation for 110397 (0.001 BTC)
 *              * 00: special txout type pay-to-pubkey-hash
 *              * 8c988f1a4a4de2161e0f50aac7f17e7f9555caa4: address uint160
 *  - height = 120891
 */

Personally identifiable data [v0.8 and above]

This section may be of use to you if you wish to send a friend the blockchain, avoiding them a hefty download.

  • wallet.dat
    ** Contains addresses and transactions linked to them. Please be sure to make backups of this file. It contains the keys necessary for spending your bitcoins. You should not transfer this file to any third party or they may be able to access your bitcoins.
  • db.log
    ** May contain information pertaining to your wallet. It may be safely deleted.
  • debug.log
    ** May contain IP addresses and transaction ID's. It may be safely deleted.
  • database/ folder
    ** This should only exist when bitcoin-qt is currently running. It contains information (BDB state) relating to your wallet.
  • peers.dat
    ** Unknown whether this contains personally identifiable data. It may be safely deleted.

Other files and folders (blocks, blocks/index, chainstate) may be safely transferred/archived as they contain information pertaining only to the public blockchain.

Transferability

The database files in the "blocks" and "chainstate" directories are cross-platform, and can be copied between different installations. These files, known collectively as a node's "block database", represent all of the information downloaded by a node during the syncing process. In other words, if you copy installation A's block database into installation B, installation B will then have the same syncing percentage as installation A. This is usually ''far'' faster than doing the normal initial sync over again. However, when you copy someone's database in this way, you are trusting them '''absolutely'''. Bitcoin Core treats its block database files as 100% accurate and trustworthy, whereas during the normal initial sync it treats each block offered by a peer as invalid until proven otherwise. If an attacker is able to modify your block database files, then they can do all sorts of evil things which could cause you to lose bitcoins. Therefore, you should only copy block databases from Bitcoin installations under your personal control, and only over a secure connection.

Each node has a unique block database, and all of the files are highly connected. So if you copy just a few files from one installation's "blocks" or "chainstate" directories into another installation, this will almost certainly cause the second node to crash or get stuck at some random point in the future. If you want to copy a block database from one installation to another, you have to delete the old database and copy ''all'' of the files at once. Both nodes have to be shut down while copying.

Only the file with the highest number in the "blocks" directory is ever written to. The earlier files will never change. Also, when these blk*.dat files are accessed, they are usually accessed in a highly sequential manner. Therefore, it's possible to symlink the "blocks" directory or some subset of the blk*.dat files individually onto a magnetic storage drive without much loss in performance (see Splitting the data directory), and if two installations start out with identical block databases (due to the copying described previously), subsequent runs of rsync will be very efficient.

Conclusion

In this tutorial, we had a look at the files and directories behind how the bitcoin core reference client manages it's own data.

References