Platform Architecture & Reliability
Understanding how DataMagik stays fast, secure, and always available
Built for Reliability
DataMagik is built on enterprise-grade infrastructure designed to keep your documents flowing 24/7. Our platform automatically handles failures, scales with demand, and protects your data across multiple data centers worldwide.
This page explains our architecture in plain terms so you can understand exactly how we keep your business running smoothly.
System Architecture
What you're looking at: This diagram shows all the different parts of DataMagik and how they work together. Each colored box represents a different service, and the arrows show how information flows through the system.
How It Works
You Make a Request
When you generate a document or access the platform, your request travels through the internet to Fly.io's global edge network. This network has servers in over 20 locations worldwide, so your request automatically goes to the closest one for the fastest response.
Smart Load Balancing
Fly.io's automatic load balancer checks which of our web application servers is healthiest and has the most capacity. It then routes your request there. If one server is overwhelmed or experiencing problems, your request automatically goes to a different one.
Document Generation
When you create a document, our web app adds it to a queue managed by Redis (a super-fast in-memory database). Worker processes constantly monitor this queue and pick up jobs. If the queue gets long, Fly.io automatically starts more workers (up to 20). When it's quiet, workers shut down to save resources.
PDF Creation
Workers send document templates to our PDF service, which uses Chrome's rendering engine to create high-quality PDFs. The PDF service only runs when needed—it automatically starts when there's work and stops when idle to minimize costs.
Secure Storage
Your generated documents are stored in object storage (Tigris or AWS S3) that automatically replicates files across multiple data centers. Your business data and settings are stored in CockroachDB, which keeps three copies of every piece of data across different geographic regions.
Fault Tolerance: What Happens When Things Go Wrong
❌ If a Web Server Crashes
What happens: Fly.io's health checks detect the problem within 10 seconds. The load balancer immediately stops sending traffic to that server. Within seconds, a replacement server automatically starts up.
✓ Impact: Zero. Your requests automatically go to healthy servers.
❌ If a Database Node Goes Down
What happens: CockroachDB's distributed architecture means every piece of data is stored on at least 3 different nodes in different locations. If one node fails, the other two continue serving your data without interruption. A replacement node automatically joins the cluster and syncs up.
✓ Impact: Zero. The database continues operating normally.
❌ If a Document Worker Fails
What happens: Jobs in the Redis queue have a "lock" that expires if a worker doesn't complete them. If a worker crashes mid-job, the lock expires and another worker automatically picks up the job. The system scales up additional workers to maintain processing capacity.
✓ Impact: Minimal. Your document might take a few extra seconds, but it will be generated.
❌ If an Entire Data Center Goes Offline
What happens: Fly.io automatically routes all traffic to healthy data centers. CockroachDB continues operating with nodes in other regions. Your documents in object storage are replicated globally, so they remain accessible from other locations.
✓ Impact: Minimal. Requests may be slightly slower due to increased distance, but the system remains fully operational.
❌ If Network Connections Fail
What happens: All services have automatic retry logic with exponential backoff. If a connection fails, they wait a short time and try again. Critical paths have multiple redundant routes. Database queries automatically fail over to replica nodes.
✓ Impact: Temporary delays (typically 1-5 seconds) while connections are re-established.
How We Protect Your Data
Triple Replication
Every piece of your data exists in at least 3 locations simultaneously. CockroachDB automatically keeps these copies in sync in real-time, so if one copy is lost, two others remain available.
Automated Backups
Full database backups run daily, and incremental backups run hourly. We retain 30 days of backups, so we can restore your data to any point in time if needed. Backup files are encrypted and stored in separate systems.
Encryption Everywhere
All data is encrypted in transit (using TLS/SSL) and at rest (using AES-256). Your S3 storage credentials are encrypted before being stored in the database, so even if someone gained access to the database, they couldn't read your storage keys.
Strict Access Control
Database connections require both username/password and SSL certificates. Internal services authenticate with API keys. All administrative access is logged and monitored. Role-based permissions ensure users only see their company's data.
Performance & Auto-Scaling
Fast Response Times
Typical API response time: <100ms. Document generation: 2-5 seconds depending on complexity. Redis caching ensures frequently accessed data loads instantly.
Dynamic Scaling
Web servers: Scale 1-6 instances based on traffic. Workers: Scale 2-20 based on queue depth. PDF service: Scales 0-4 (starts on-demand, stops when idle). Scaling decisions happen automatically in seconds.
Global Distribution
Fly.io deploys applications close to users automatically. Requests are routed to the nearest available server. Static assets are cached at the edge for instant delivery.
Cost-Efficient
Services automatically scale to zero when not in use. PDF generation servers only run when needed. This "serverless" approach means you get enterprise-grade infrastructure at a fraction of the cost of running dedicated servers 24/7.
24/7 Monitoring & Recovery
Health Checks
Every service is checked every 10 seconds. Unhealthy services are automatically restarted. Failed health checks trigger immediate failover.
Automatic Alerts
Critical issues trigger immediate notifications. Error rates, response times, and resource usage are continuously monitored. Anomalies are detected and investigated.
Self-Healing
Failed processes automatically restart. Stuck jobs are retried. Database connections reconnect automatically. Most issues resolve without human intervention.
Our Reliability Commitment
Target Uptime
We target 99.9% uptime, which means less than 9 hours of downtime per year. In practice, our distributed architecture typically achieves higher availability because individual component failures don't affect the overall system.
Recovery Time
In the rare event of a component failure, our automatic recovery systems typically restore service within 5 minutes. Database failovers happen in seconds. Application restarts take under a minute.
Note: These are operational targets, not contractual SLAs. Actual availability depends on factors including internet connectivity, client configuration, and external service dependencies. We continuously work to exceed these targets.
Built on Proven Technology
Fly.io
Global application platform with edge computing, automatic scaling, and built-in load balancing across 20+ regions worldwide.
CockroachDB
Distributed SQL database designed for cloud applications. Provides PostgreSQL compatibility with automatic sharding and replication.
Tigris / AWS S3
Enterprise object storage with global replication, 99.999999999% durability, and automatic data redundancy across availability zones.
Questions about our architecture or reliability? Contact our team
Last updated: November 2025