Platform Architecture & Reliability

Understanding how DataMagik stays fast, secure, and always available

Built for Reliability

DataMagik is built on enterprise-grade infrastructure designed to keep your documents flowing 24/7. Our platform automatically handles failures, scales with demand, and protects your data across multiple data centers worldwide.

This page explains our architecture in plain terms so you can understand exactly how we keep your business running smoothly.

System Architecture

Your Browser Chrome, Safari, etc. Internet Fly.io Edge Network 20+ Global Locations Auto Load Balancer Web Application (Auto-scales 1-6 instances) App #1 Virginia App #2 Virginia+ more as needed Health Checked Every 10s Document Workers (Auto-scales 2-20 instances) Worker #1 Processing Worker #2 Processing+ scales with queue depth Redis Task Queue PDF Generator (Auto-scales 0-4 instances) Chrome Headless Starts on-demand CockroachDB Cluster (Distributed SQL Database) DB Node 1 Virginia Primary DB Node 2 Oregon Replica DB Node 3 Frankfurt Replica Real-time Replication Data synced across all nodes Object Storage (Tigris / AWS S3) Your Documents Replicated globally Redis Cache & Queue Fast session & task management 24/7 Monitoring • Health checks • Auto-recovery • Performance metrics Automated Backups • Hourly incremental • Daily full backups • 30-day retention Legend: Handles web requests Processes documents Stores your data File storage Fast cache Routes traffic

What you're looking at: This diagram shows all the different parts of DataMagik and how they work together. Each colored box represents a different service, and the arrows show how information flows through the system.

How It Works

1

You Make a Request

When you generate a document or access the platform, your request travels through the internet to Fly.io's global edge network. This network has servers in over 20 locations worldwide, so your request automatically goes to the closest one for the fastest response.

2

Smart Load Balancing

Fly.io's automatic load balancer checks which of our web application servers is healthiest and has the most capacity. It then routes your request there. If one server is overwhelmed or experiencing problems, your request automatically goes to a different one.

3

Document Generation

When you create a document, our web app adds it to a queue managed by Redis (a super-fast in-memory database). Worker processes constantly monitor this queue and pick up jobs. If the queue gets long, Fly.io automatically starts more workers (up to 20). When it's quiet, workers shut down to save resources.

4

PDF Creation

Workers send document templates to our PDF service, which uses Chrome's rendering engine to create high-quality PDFs. The PDF service only runs when needed—it automatically starts when there's work and stops when idle to minimize costs.

5

Secure Storage

Your generated documents are stored in object storage (Tigris or AWS S3) that automatically replicates files across multiple data centers. Your business data and settings are stored in CockroachDB, which keeps three copies of every piece of data across different geographic regions.

Fault Tolerance: What Happens When Things Go Wrong

❌ If a Web Server Crashes

What happens: Fly.io's health checks detect the problem within 10 seconds. The load balancer immediately stops sending traffic to that server. Within seconds, a replacement server automatically starts up.

✓ Impact: Zero. Your requests automatically go to healthy servers.

❌ If a Database Node Goes Down

What happens: CockroachDB's distributed architecture means every piece of data is stored on at least 3 different nodes in different locations. If one node fails, the other two continue serving your data without interruption. A replacement node automatically joins the cluster and syncs up.

✓ Impact: Zero. The database continues operating normally.

❌ If a Document Worker Fails

What happens: Jobs in the Redis queue have a "lock" that expires if a worker doesn't complete them. If a worker crashes mid-job, the lock expires and another worker automatically picks up the job. The system scales up additional workers to maintain processing capacity.

✓ Impact: Minimal. Your document might take a few extra seconds, but it will be generated.

❌ If an Entire Data Center Goes Offline

What happens: Fly.io automatically routes all traffic to healthy data centers. CockroachDB continues operating with nodes in other regions. Your documents in object storage are replicated globally, so they remain accessible from other locations.

✓ Impact: Minimal. Requests may be slightly slower due to increased distance, but the system remains fully operational.

❌ If Network Connections Fail

What happens: All services have automatic retry logic with exponential backoff. If a connection fails, they wait a short time and try again. Critical paths have multiple redundant routes. Database queries automatically fail over to replica nodes.

✓ Impact: Temporary delays (typically 1-5 seconds) while connections are re-established.

How We Protect Your Data

📊

Triple Replication

Every piece of your data exists in at least 3 locations simultaneously. CockroachDB automatically keeps these copies in sync in real-time, so if one copy is lost, two others remain available.

💾

Automated Backups

Full database backups run daily, and incremental backups run hourly. We retain 30 days of backups, so we can restore your data to any point in time if needed. Backup files are encrypted and stored in separate systems.

🔒

Encryption Everywhere

All data is encrypted in transit (using TLS/SSL) and at rest (using AES-256). Your S3 storage credentials are encrypted before being stored in the database, so even if someone gained access to the database, they couldn't read your storage keys.

👥

Strict Access Control

Database connections require both username/password and SSL certificates. Internal services authenticate with API keys. All administrative access is logged and monitored. Role-based permissions ensure users only see their company's data.

Performance & Auto-Scaling

Fast Response Times

Typical API response time: <100ms. Document generation: 2-5 seconds depending on complexity. Redis caching ensures frequently accessed data loads instantly.

📈

Dynamic Scaling

Web servers: Scale 1-6 instances based on traffic. Workers: Scale 2-20 based on queue depth. PDF service: Scales 0-4 (starts on-demand, stops when idle). Scaling decisions happen automatically in seconds.

🌍

Global Distribution

Fly.io deploys applications close to users automatically. Requests are routed to the nearest available server. Static assets are cached at the edge for instant delivery.

💰

Cost-Efficient

Services automatically scale to zero when not in use. PDF generation servers only run when needed. This "serverless" approach means you get enterprise-grade infrastructure at a fraction of the cost of running dedicated servers 24/7.

24/7 Monitoring & Recovery

🔍

Health Checks

Every service is checked every 10 seconds. Unhealthy services are automatically restarted. Failed health checks trigger immediate failover.

🚨

Automatic Alerts

Critical issues trigger immediate notifications. Error rates, response times, and resource usage are continuously monitored. Anomalies are detected and investigated.

🔄

Self-Healing

Failed processes automatically restart. Stuck jobs are retried. Database connections reconnect automatically. Most issues resolve without human intervention.

Our Reliability Commitment

99.9%

Target Uptime

We target 99.9% uptime, which means less than 9 hours of downtime per year. In practice, our distributed architecture typically achieves higher availability because individual component failures don't affect the overall system.

<5 min

Recovery Time

In the rare event of a component failure, our automatic recovery systems typically restore service within 5 minutes. Database failovers happen in seconds. Application restarts take under a minute.

Note: These are operational targets, not contractual SLAs. Actual availability depends on factors including internet connectivity, client configuration, and external service dependencies. We continuously work to exceed these targets.

Built on Proven Technology

Fly.io

Global application platform with edge computing, automatic scaling, and built-in load balancing across 20+ regions worldwide.

CockroachDB

Distributed SQL database designed for cloud applications. Provides PostgreSQL compatibility with automatic sharding and replication.

Tigris / AWS S3

Enterprise object storage with global replication, 99.999999999% durability, and automatic data redundancy across availability zones.

Questions about our architecture or reliability? Contact our team

Last updated: November 2025