Mockingbird Documentation

Complete guide to generating realistic mock data with perfect referential integrity

Installation

Install via pip

Mockingbird is available on PyPI and can be installed using pip

# Install Mockingbird

pip install mockingbird-cli

# Verify installation

mockingbird version

Requirements:

• Python 3.8 or higher
• pip package manager

Quick Start

1. Create a sample blueprint.yaml

Create a sample blueprint file

mockingbird init

2. Generate Data

Create your mock dataset

mockingbird generate blueprint.yaml --out-dir ./output_data

Blueprint Syntax

Basic Structure

Blueprints define entities and their fields using YAML format

# Blueprint.yaml
MyEntity:
  count: 10      # Number of records to generate
  fields:
    MyField 1:
      generator: generator_name
      config:
                # ... generator-specific options

Data Generators

Sequence Generator

sequence

Produces auto-incrementing integer sequences, perfect for IDs

Configuration:

start_at (optional, default: 1) - Starting integer value

increment (optional, default: 1) - The value to increment by


 users:
  count: 5
  fields:
    user_id:
      generator: sequence
      config:
        start_at: 1

Faker Generator

faker

Leverages the Faker library to generate a wide variety of realistic-looking data.

Configuration:

generator (required) - Faker provider (name, email, address, etc.)

locale (optional, default: en_US) - Language/region (de_DE, fr_FR, ja_JP)

... - Additional provider-specific arguments

Basic Usage:


users:
  count: 5
  fields:
    full_name:
      generator: faker
      config:
        generator: "name"

With Locale:


users:
  count: 5
  fields:
    full_name:
      generator: faker
      config:
        generator: "name" 
        locale: "de_DE"

Common Faker Providers:

namefirst_namelast_nameemailaddressphone_numbercompanyjobtextsentenceurlipv4date_of_birthpyintpyfloatpydecimaluuid4color_namefile_name

Choice Generator

choice

Randomly selects a value from a predefined list of options with optional weights

Configuration:

choices (required) - List of values to choose from

weights (optional) - Probability weights for each choice

Basic Usage:


MyEntity:
  count: 5
  fields:
    status:
      generator: choice
      config:
        choices: ["pending", "active", "completed"]

With Weights:


MyEntity:
  count: 5
  fields:
    status:
      generator: choice
      config:
        choices: ["pending", "active", "completed"]
        weights: [0.3, 0.5, 0.2]

Timestamp Generator

timestamp

Generates random timestamps within a specified date range with custom formatting

Configuration:

start_date (required, String) - Start of date range

end_date (required, String) - End of date range. Must be after start_date

format (optional, String) - Output format using strftime directives. If not provided, the timestamp will be in ISO 8601 format

ISO Format:


MyEntity:
  count: 5
  fields:
    created_at:
      generator: timestamp
      config:
        start_date: "2023-01-01"
        end_date: "2023-12-31"

Custom Format:


MyEntity:
  count: 5
  fields:
    created_at:
      generator: timestamp
      config:
        start_date: "2023-01-01:00:00"
        end_date: "2024-01-01 23:59:59"
        format: "%Y-%m-%d %H:%M"

Reference Generator

ref

The ref generator is the primary tool for creating one-to-one or one-to-many relationships, similar to foreign keys in a database. It works by "looking up" a value from a field in another entity that has already been generated.

There are two distinct modes of usage: Primary Reference and Secondary Reference.

Primary Reference:

This is the most common use case. You use it to pick a random value from a column in another entity. For example, you can assign a `user_id` to an order, or a `product_id` to a review.

ref: (Required, String) - Primary reference value in the form EntitiyName.Field Name (Ex: Users.id)

Example:


users:
  count: 2
  fields:
    id:
      generator: sequence
      config:
        start_at: 1
    name
      generator: faker
      config:
        generator: "name"

orders
  count: 5
  fields:
    order_id:
      generator: sequence
      config:
        start_at: 101
    customer_id:            # Primary reference
      generator: ref
      config:
          ref: "users.id"   # Customer Id is filled with a random user id

Secondary Reference:

This mode is used to pull additional, related data from a record you have already referenced using primary reference. This is crucial for ensuring data consistency.

use_record_from: (Required, String) - Primary reference entity whose additional values should be extracted

field_to_get: (Required, String) - Field value to extract

Example:


users:
  count: 2
  fields:
    id:
      generator: sequence
      config:
        start_at: 1
    name
      generator: faker
      config:
        generator: "name"

orders
  count: 5
  fields:
    order_id:
      generator: sequence
      config:
        start_at: 101
    customer_id:            # Primary reference
      generator: ref
      config:
          ref: users.id   # Customer Id is filled with a random user id
    customer_name:            # Secondary reference
      generator: ref
      config:
          user_record_from: customer_id   # The User record used for customer_id field is used
          field_to_get: name

Another Example:


Products:
  count: 20
  fields:
    product_id:
      generator: sequence
      config:
        start_at: 201
    name:
      generator: faker
      config:
        generator: catch_phrase
    price:
      generator: faker
      config:
        generator: pydecimal
        left_digits: 2
        right_digits: 2
        positive: true
OrderItems:
  count: 100
  fields:
    item_id:
      generator: sequence
      config:
        start_at: 7001
    order_id:
      generator: ref
      config:
        ref: Orders.order_id
    product_id:
      generator: ref
      config:
        ref: Products.product_id
    unit_price:
      generator: ref
      config:
        use_record_from: product_id
        field_to_get: price

Expression Generator

expr

Evaluates a Python-like expression to generate a value. This is a powerful generator for calculations, conditional logic, and complex data manipulation.

Configuration:

expression (required) - Python expression to evaluate

A special key 'current' is available to refer to the current record being processed. This is a dictionary of values generated for this record. This can be used to reference other fields in the current record.

Available Context:

currentrandommathdatetimeuuid4sumlenminmaxstrintfloat

Calculation:


OrderItems:
  count: 5
  fields:
    price:
      generator: faker
      config:
        generator: pydecimal
        left_digits: 2
        right_digits: 2
        positive: true
    quantity:
      generator: expr
      config:
        expression: random.randint(1, 25)                     # Random quantity
    total_price:
      generator: expr
      config:
        expression: current['price'] * current['quantity']    # Calculate price based on generated quantity
    category:
      generator: expr
      config:
        expression: "'bulk' if current['quantity'] > 5 else 'regular'"   # Conditional logic

Enum Generator

enum

Cycles through a predefined list of values in a fixed, repeating order.

Configuration:

values (required) - List of values to cycle through


Items:
  count: 5
  fields:
    id:
      generator: sequence
      config:
        start_at: 100
    status:
      generator: enum
      config:
        values: ["active", "pending", "expired"]

Output Pattern:

id: 100, status: "active"
id: 101, status: "pending"
id: 102, status: "expired"
id: 103, status: "active" (cycle repeats)
id: 104, status: "pending"

CLI Commands

mockingbird generate

Generate mock data from Blueprint.yaml

# Basic usage

mockingbird generate [OPTIONS] PATH_TO_BLUEPRINT_FILE

PATH_TO_BLUEPRINT_FILE: This is the path to your .yaml file that defines the data you want to generate.

# Example

mockingbird generate Blueprint.yaml --format parquet --seed 42 --output ./data

Options:

--format - Output format. Allowed values are csv (default), json, parquet
--seed - Random seed for reproducible data
--output - Output directory

mockingbird init

Create a sample blueprint file

mockingbird init

Creates a sample Blueprint.yaml file to get you started.

Options:

--output - The name of the blueprint file to generate. Defaule is 'Blueprint.yaml'.

Reproducibility

Seed

When developing and testing applications, consistency is key. You need to be able to reliably reproduce bugs, validate fixes, and ensure that your tests run against the same data every time. Mockingbird achieves this through the use of a "seed."

What is a seed:

In computing, a "random" number generator isn't truly random; it produces a sequence of numbers that appears random but is actually deterministic. The sequence is determined by an initial value called a seed. If you provide the same seed to a random number generator, it will produce the exact same sequence of "random" numbers every single time.

This is the principle behind Mockingbird's --seed option. When you provide a seed, Mockingbird ensures that all of its internal random processes start from the same point, resulting in identical data output for the same blueprint.

Using seed option:

You can provide a seed to the generate command using the --seed option.
Using the same seed value with the same Blueprint file produces identical data output every time you run the generate command.

# Seed usage

mockingbird generate Blueprint.yaml --format parquet --seed 42 --output ./data

Back to Home

Examples

Need help?Contact Support

Getting Started

Generators

CLI Reference

Mockingbird Documentation

Installation

Requirements:

Quick Start

Blueprint Syntax

Data Generators

Configuration:

Configuration:

Basic Usage:

With Locale:

Common Faker Providers:

Configuration:

Basic Usage:

With Weights:

Configuration:

ISO Format:

Custom Format:

Primary Reference:

Example:

Secondary Reference:

Example:

Another Example:

Configuration:

Available Context:

Calculation:

Configuration:

Output Pattern:

CLI Commands

Options:

Options:

Reproducibility

What is a seed:

Using seed option: