Mockingbird Documentation

Complete guide to generating realistic mock data with perfect referential integrity

Installation

Install via pip
Mockingbird is available on PyPI and can be installed using pip
# Install Mockingbird
pip install mockingbird-cli
# Verify installation
mockingbird version

Requirements:

  • • Python 3.8 or higher
  • • pip package manager

Quick Start

1. Create a sample blueprint.yaml
Create a sample blueprint file
mockingbird init
2. Generate Data
Create your mock dataset
mockingbird generate blueprint.yaml --out-dir ./output_data

Blueprint Syntax

Basic Structure
Blueprints define entities and their fields using YAML format
# Blueprint.yaml
MyEntity:
  count: 10      # Number of records to generate
  fields:
    MyField 1:
      generator: generator_name
      config:
                # ... generator-specific options

Data Generators

Sequence Generator
sequence
Produces auto-incrementing integer sequences, perfect for IDs

Configuration:

start_at (optional, default: 1) - Starting integer value
increment (optional, default: 1) - The value to increment by

 users:
  count: 5
  fields:
    user_id:
      generator: sequence
      config:
        start_at: 1
Faker Generator
faker
Leverages the Faker library to generate a wide variety of realistic-looking data.

Configuration:

generator (required) - Faker provider (name, email, address, etc.)
locale (optional, default: en_US) - Language/region (de_DE, fr_FR, ja_JP)
... - Additional provider-specific arguments
Basic Usage:

users:
  count: 5
  fields:
    full_name:
      generator: faker
      config:
        generator: "name"
With Locale:

users:
  count: 5
  fields:
    full_name:
      generator: faker
      config:
        generator: "name" 
        locale: "de_DE"
Common Faker Providers:
namefirst_namelast_nameemailaddressphone_numbercompanyjobtextsentenceurlipv4date_of_birthpyintpyfloatpydecimaluuid4color_namefile_name
Choice Generator
choice
Randomly selects a value from a predefined list of options with optional weights

Configuration:

choices (required) - List of values to choose from
weights (optional) - Probability weights for each choice
Basic Usage:

MyEntity:
  count: 5
  fields:
    status:
      generator: choice
      config:
        choices: ["pending", "active", "completed"]
With Weights:

MyEntity:
  count: 5
  fields:
    status:
      generator: choice
      config:
        choices: ["pending", "active", "completed"]
        weights: [0.3, 0.5, 0.2]
Timestamp Generator
timestamp
Generates random timestamps within a specified date range with custom formatting

Configuration:

start_date (required, String) - Start of date range
end_date (required, String) - End of date range. Must be after start_date
format (optional, String) - Output format using strftime directives. If not provided, the timestamp will be in ISO 8601 format
ISO Format:

MyEntity:
  count: 5
  fields:
    created_at:
      generator: timestamp
      config:
        start_date: "2023-01-01"
        end_date: "2023-12-31"
Custom Format:

MyEntity:
  count: 5
  fields:
    created_at:
      generator: timestamp
      config:
        start_date: "2023-01-01:00:00"
        end_date: "2024-01-01 23:59:59"
        format: "%Y-%m-%d %H:%M"
Reference Generator
ref
The ref generator is the primary tool for creating one-to-one or one-to-many relationships, similar to foreign keys in a database. It works by "looking up" a value from a field in another entity that has already been generated.
There are two distinct modes of usage: Primary Reference and Secondary Reference.
Primary Reference:
This is the most common use case. You use it to pick a random value from a column in another entity. For example, you can assign a `user_id` to an order, or a `product_id` to a review.
ref: (Required, String) - Primary reference value in the form EntitiyName.Field Name (Ex: Users.id)
Example:

users:
  count: 2
  fields:
    id:
      generator: sequence
      config:
        start_at: 1
    name
      generator: faker
      config:
        generator: "name"

orders
  count: 5
  fields:
    order_id:
      generator: sequence
      config:
        start_at: 101
    customer_id:            # Primary reference
      generator: ref
      config:
          ref: "users.id"   # Customer Id is filled with a random user id
          
Secondary Reference:
This mode is used to pull additional, related data from a record you have already referenced using primary reference. This is crucial for ensuring data consistency.
use_record_from: (Required, String) - Primary reference entity whose additional values should be extracted
field_to_get: (Required, String) - Field value to extract
Example:

users:
  count: 2
  fields:
    id:
      generator: sequence
      config:
        start_at: 1
    name
      generator: faker
      config:
        generator: "name"

orders
  count: 5
  fields:
    order_id:
      generator: sequence
      config:
        start_at: 101
    customer_id:            # Primary reference
      generator: ref
      config:
          ref: users.id   # Customer Id is filled with a random user id
    customer_name:            # Secondary reference
      generator: ref
      config:
          user_record_from: customer_id   # The User record used for customer_id field is used
          field_to_get: name
          
          
Another Example:

Products:
  count: 20
  fields:
    product_id:
      generator: sequence
      config:
        start_at: 201
    name:
      generator: faker
      config:
        generator: catch_phrase
    price:
      generator: faker
      config:
        generator: pydecimal
        left_digits: 2
        right_digits: 2
        positive: true
OrderItems:
  count: 100
  fields:
    item_id:
      generator: sequence
      config:
        start_at: 7001
    order_id:
      generator: ref
      config:
        ref: Orders.order_id
    product_id:
      generator: ref
      config:
        ref: Products.product_id
    unit_price:
      generator: ref
      config:
        use_record_from: product_id
        field_to_get: price          
          
Expression Generator
expr

Evaluates a Python-like expression to generate a value. This is a powerful generator for calculations, conditional logic, and complex data manipulation.

Configuration:

expression (required) - Python expression to evaluate

A special key 'current' is available to refer to the current record being processed. This is a dictionary of values generated for this record. This can be used to reference other fields in the current record.
Available Context:
currentrandommathdatetimeuuid4sumlenminmaxstrintfloat
Calculation:

OrderItems:
  count: 5
  fields:
    price:
      generator: faker
      config:
        generator: pydecimal
        left_digits: 2
        right_digits: 2
        positive: true
    quantity:
      generator: expr
      config:
        expression: random.randint(1, 25)                     # Random quantity
    total_price:
      generator: expr
      config:
        expression: current['price'] * current['quantity']    # Calculate price based on generated quantity
    category:
      generator: expr
      config:
        expression: "'bulk' if current['quantity'] > 5 else 'regular'"   # Conditional logic
    
Enum Generator
enum

Cycles through a predefined list of values in a fixed, repeating order.

Configuration:

values (required) - List of values to cycle through

Items:
  count: 5
  fields:
    id:
      generator: sequence
      config:
        start_at: 100
    status:
      generator: enum
      config:
        values: ["active", "pending", "expired"]

          
Output Pattern:
id: 100, status: "active"
id: 101, status: "pending"
id: 102, status: "expired"
id: 103, status: "active" (cycle repeats)
id: 104, status: "pending"

CLI Commands

mockingbird generate
Generate mock data from Blueprint.yaml
# Basic usage
mockingbird generate [OPTIONS] PATH_TO_BLUEPRINT_FILE
PATH_TO_BLUEPRINT_FILE: This is the path to your .yaml file that defines the data you want to generate.
# Example
mockingbird generate Blueprint.yaml --format parquet --seed 42 --output ./data

Options:

  • --format - Output format. Allowed values are csv (default), json, parquet
  • --seed - Random seed for reproducible data
  • --output - Output directory
mockingbird init
Create a sample blueprint file
mockingbird init

Creates a sample Blueprint.yaml file to get you started.

Options:

  • --output - The name of the blueprint file to generate. Defaule is 'Blueprint.yaml'.

Reproducibility

Seed

When developing and testing applications, consistency is key. You need to be able to reliably reproduce bugs, validate fixes, and ensure that your tests run against the same data every time. Mockingbird achieves this through the use of a "seed."

What is a seed:


In computing, a "random" number generator isn't truly random; it produces a sequence of numbers that appears random but is actually deterministic. The sequence is determined by an initial value called a seed. If you provide the same seed to a random number generator, it will produce the exact same sequence of "random" numbers every single time.

This is the principle behind Mockingbird's --seed option. When you provide a seed, Mockingbird ensures that all of its internal random processes start from the same point, resulting in identical data output for the same blueprint.

Using seed option:


You can provide a seed to the generate command using the --seed option.
Using the same seed value with the same Blueprint file produces identical data output every time you run the generate command.
# Seed usage
mockingbird generate Blueprint.yaml --format parquet --seed 42 --output ./data