Lessons in Building a Digital Asset Exchange
An digital asset exchange has two main modules: (1) core order matching module and (2) on-chain transaction processing module.
Order matching
We applied the “Reactor pattern”: the exchange (order matching engine) is built similarly to how Redis works, except that the command we support are representing the order operation.
- Async event-driven I/O multi-plexing with
epoll
- our server is reactive to new connections or request from the existing connection.
- Any I/O is prohibited in the actual request handler to make each PRC call finish within milli-second.
- Correctness
- Check against integer overflow
- Order list (index is the integer price used for query a C++
map
which is sorted BST)- Each element is a FIFO queue of orders with the same price
- Caching the summary about the orders in the same queue at the value node for faster analytical computation
- End-to-end testing: build simpler but much slower “reference implementation” and random test if the optimized one matches it
- Also, randomly inject a lot of extreme and illegal values even for internal unit API.
- End-to-end Benchmarking: very very high, I think it is at least with million level avg QPS (microseconds level latency for pure CPU computation)
- State: pure memory (which might take a lot of space, definitely more than a typical instance, we actually set up a big memory instance of at least 100GB). Before sending back response, make sure that the request is appended to the reqlog (request log) on disk
- Replay during restart which takes a few minutes to load reqlog of multiple GBs. The entire Reqlog is async sent to cold backup daily. Optimization: snapshot.
- Advantage: speed, simplicity, consistency, performance, easy to rollback, 100% control of its behavior
- Disadvantage: SPoF, resource hungry, lack of analytics (so candlestick query can be painful and later cached) and other tooling compared to using existing database (in-memory or not)
On-chain transaction processing
- Why Python + MySQL
- The Python binding to Bitcoin/ETH API was pretty mature.
- The latency requirement is very low (minutes to hours), bound by bitcoin network anyway
- Again, we found it is easier to connect to MySQL using Python binding within the expertise of our team
- Again, a LOT of I/O happens during the processing, so a script language like Python makes it much easier to debug what is going on.
- Gotchas
- Needs carefully state-machine like modeling of the transaction to ensure consistency between MySQL, block chain and order state. For example, a deposit must go from pending, validated (with unique blockchain transaction id), to fulfilled (with engine receipt id). It can scan the same blockchain transaction twice or send fulfillment request twice, but it doesn’t matter! (replay-safe)
- Testing: instantiate a new private net blockchain each time with some automatic randomness in latency etc..
MySQL
- InnoDB: mostly transactional information that increasing crazy, except for a table that logs all web API calls for security auditing (but that is rarely read and never update)
- Index: we don’t have much analytical things going on, I think we build some simple index for speeding up search of recent transactions (order by data and indexed by user), but not super important.
- Optimization: to solve SPoF problem, we launched another backup instance that will take over the traffic if the primary one dies for some reason. Not distributed (we don’t have a lot of data, and even if it is not difficult to shard). At some point its CPU is super busy with reads (turns out that our internal transaction processor queries too much w/o rate limiting in place).
DevOps
- Environment: testing + staging + production
- Necessary with a larger team that commits code
- Less strict code review
- Larger system that require more integration of implicit inter-depending modules
- Fix staging takes lot of time but also prevents a lot of trouble
- Most developer also has a set of isolated testing environment.
- Pressure testing: distributed locust. hundred-ish PQS. Blocked on the number of web server (
nginx php-fpm
) → LB + replication (tens of thousand-ish)- Not really for users. Mostly for automated requests.
- Automation
- Automatic deployment with easy triggering
- Ansible-based rolling deployment
- centralize log
- Goal: avoid the need to make developer log in prod machine at all (and even if so, avoid root)
- GCP
- From manual creation to infrastructure (almost) as code
- Important for instance replication
- Capacity planning: monitor when things run out!
- Front-end + LB: automatic DDoS prevention, simplify deployment (no need for https intranet)
- What is important → many proxies in between that might rewrite certain routing information, must be kept consistent
- From manual creation to infrastructure (almost) as code
- SRE
- Playbook: “Protecting the Crime Scene” -> restore service (have safe ways to rollback) → reproduce → debug → add test → release → summarize and share
- Monitoring and alerting
- Prometheus + grafana
- Sometimes a lot of FPs due to network issue
- Try to avoid SRE during mid-night…
- Security
- Hardware isolation, hot/cold wallet, human operator, disk encryption
- Run important services on-premise (later moved some to cloud…)
- Audit: log all operations
- Avoid MIMM by encrypted connection to MySQL if not on within a secure intranet environment