About Tetsuo
Tetsuo SOL MK-VIII is a digital quant that uses machine learning to analyze the stock market, processing currently about 3500-5500 stock symbols per day.
More specifically this is a time-series forecasting system that performs whole-market analysis across the NYSE and NASDAQ exchanges. It pulls the active common-stock universe each day (around 3,500 to 5,500 symbols), trains a per-symbol directional model on minute-grade history, ranks the high-confidence calls, and executes a daily buy at 11:00 ET against the resulting list of companies whose stock it thinks will go up.
MK-VII, this version's predecessor, was the generation that transitioned the project to a true distributed system; that move earned it the SOL label, which carries forward. MK-VIII is the transition from MK-VII's pure-batch architecture to an API-driven web platform: every component is now a long-running service exposing an HTTP API, coordinated by ORK and observable through this dashboard.
From generation I to VIII represents the span of about two years of solitary work by its creator, Chris Punches.
How It Works
The daily pipeline is a one-way fan-in: each stage's output is the next stage's input. The components in trading-day order:
- UIP (Universe Inventory Processor) builds the master list of stock symbols for the current trading day from the configured exchange feeds, drops anything blacklisted, and publishes the inventory. It will keep a symbol blacklisted for a configured number of days to allow missing data to self-heal.
- TDM (Tetsuo Data Manager) reads UIP's universe, fetches the configured retention window of one-minute OHLCV bars per symbol from the configured market-data provider, and computes a normalised daily-wide derivative dataset that downstream forecasters train on.
- SAP (Sentiment Analysis Processor) runs a parallel ETL from the configured news sentiment feed into per-symbol-per-day sentiment scores. These sentiment scores are AI-translated by a data broker from financial journals. These scores are later used to avoid symbols with recent poor investor sentiment.
- SIG (Sign Forecaster) trains a per-symbol classifier on TDM's derivative data and emits an UP / DOWN call for each symbol's next trading day, optimized for 11:00 ET on T+1 (the next trading day) to 11:00 ET on T+2 (the trading day after that).
- MAG (Magnitude Forecaster) takes SIG's UP calls (gated by a minimum SIG accuracy threshold) and forecasts the magnitude of next-day movement for each.
- WIN (Winners) ranks MAG's magnitude calls, drops candidates whose SAP sentiment falls in a configured veto band, and emits the day's flat ordered buy list.
- BUY sells the current holdings at 11:00 ET each trading day, takes the resulting purchasing power, divides by two, and buys equal-share positions across the day's WIN list.
- DSH (Dashboard) is the operator-facing financial dashboard, surfacing current balance, balance history, current holdings, and per-holding percent change since purchase.
ORK: The Orchestrator
ORK is the centre of the system. It does not do data work itself; it exposes a proprietary job-scheduling and execution subsystem that performs coordination between the components of Tetsuo. It owns the job schedules, fires HTTP triggers against peer endpoints when those schedules come due, polls each peer's status to keep a live picture of the distributed system, records every job run, and serves this dashboard.
Other components run as persistent services that wait for trigger events from ORK, do the work it tells them to do, and report back. Cross-component dependencies (e.g. SIG needing TDM's derivative data, TDM needing UIP's inventory) are enforced peer-side: the dependent component queries ORK's recent-jobs endpoint to check its upstream's state before doing work.
Distribution
Each component is its own service, runs on its own port, has its own database, owns its own filesystem, and is independently deployable. The components are coupled only by HTTP. There is no shared database, no shared module, no direct filesystem access across services. A component can run on a different host from any other.
The orchestrator ORK is the single integration hub: peers never know each other's URLs, only ORK's. When TDM needs UIP's inventory, it asks ORK for it; ORK proxies the call to UIP. The same pattern applies in every direction.
Tetsuo's Technology Use Highlights
Tetsuo draws on a wide range of systems-design and engineering disciplines, but what really sets it apart is compute efficiency: it processes 3,000 to 5,000 stock symbols per day on a single consumer-hardware server, which means it would handle substantially more on commercial-grade infrastructure. The architecture follows the Unix design philosophy (small, single-purpose services composed over a network), so it is both modular and distributed, yet still highly compute-efficient and scalable.
- Python 3: primary implementation language across every component.
- Flask + flask-smorest: HTTP service framework for each component;
flask-smorestprovidesmarshmallow-validated request/response schemas and auto-generates theOpenAPIspecification rendered throughSwagger UIon every component, so the API documentation is the code rather than a separately maintained artifact. - SQLAlchemy + SQLite: per-component ORM-backed persistence; each service owns its own isolated database file. No shared schema across services.
- pandas + NumPy: core
time-seriesand numerical processing throughoutTDM,SIG,MAG, andWIN. - pandas-market-calendars: NYSE trading-day awareness for every dated operation (training windows, T+1 forecast targets, blackout handling).
- XGBoost: gradient-boosted directional classifier in
SIG(UP / DOWN per symbol, optimised for the 11:00 ET to 11:00 ET window). - LightGBM + scikit-learn: magnitude regression in
MAGand downstream ranking utilities inWIN. - HTMX: operator dashboard interactivity without an SPA framework; server-rendered
Jinja2partials swapped in place. - Custom job orchestration subsystem: proprietary in-house
scheduler,trigger dispatcher,concurrency gates, andrecent-jobs feedthat coordinate the per-trading-day pipeline across the distributed peers; lives inORK. - API Gateway pattern:
ORKfronts every peer component, exposing a single uniform per-peer URL surface to the dashboard and operators; peers never call each other directly, all cross-service traffic transits the gateway. - Distributed system design: independently-deployable HTTP services, single integration hub (
ORK), no shared database / module / filesystem; one-wayfan-indaily pipeline. - MVC architecture: every component is internally split into
views(HTTP / template layer),controllers(domain logic), andmodels(persistence), keeping the request surface, business rules, and data layer cleanly separated. - Workflow orchestration:
ORKschedules and fires HTTP triggers against peer endpoints when due, polls each peer's status, records every run, and gates each stage on its declared upstream's completion since the last market close. - Data pipeline orchestration: the daily one-way fan-in pipeline (
UIP → TDM / SAP → SIG → MAG → WIN → BUY) is defined as scheduled jobs against the peers, with each stage's output forming the next stage's input. - ETL design and orchestration:
UIPextractsthe active common-stock universe,TDMextractsandtransformsone-minute OHLCV bars into a normalised daily derivative dataset,SAPloadsvendor-translatedsentimentscores per-symbol-per-day. - ML pipelines:
SIGtrains a per-symbolXGBoostdirectional classifier onTDM's derivative dataset;MAGtrains aLightGBMmagnitude regressor onSIG's UP calls;WINconsumes the resulting magnitude rankings underSAPsentiment gating. - MLOps: model lifecycle is fully operationalised through
ORK: per-symbol training,accuracy gating, periodicpassive-configuration refresh, scheduledre-forecasting, and downstream consumption are all observable scheduled jobs with run history and currency tracking. - Algorithm design & time-series forecasting: per-symbol training pipelines,
derivative-feature engineering,sentiment veto bands,magnitude-ranked selection, and a daily execution routine that converts forecasts into orders.
Site Reliability Engineering Aspects
There are demonstrated Site Reliability Engineering aspects to this product.
- Observability is built in, not bolted on: every component writes through a single
structured logger(mask, level, timestamp in the system's reference timezone),rotatesto dated files, and exposes a per-componentlog-tailAPI that the dashboard consumes.ORKaggregates that into a livestatus gridplus per-peer log views so an operator can see what every service is doing right now from a single page, without SSH-ing anywhere. - Health checking and liveness probing:
ORK'sstatus pollerhits a lightweighthealth endpointon each peer on a configurable cadence, records the last-successful-check timestamp, tracksunreachable-tolerancewindows before flagging a peer as down, and surfaces the result on the dashboard'sstatus grid. The same pattern services thejob poller, which polls every in-flight job to keepORK's view of distributed work up to date. - Self-healing patterns:
UIPkeeps a symbol on itsblacklistfor a configured number of days and then auto-clears it so transient data-quality issues recover without operator intervention.TDM's update pipeline tries a narrowincremental fetchfirst and automaticallyescalates to a full pullon any failure, so missing-data conditions self-correct on the next scheduled fire. - Clean shutdown hygiene per service: every component registers a
process-exit shutdown hookthat writes a closing log line on its way out, so the post-mortem timeline always shows whether a service exited cleanly or was killed mid-flight. Pairing that with a normalsystemd(or equivalent) supervisor at deploy time gives the operator a clear "did this come down on purpose?" signal without any custom supervisor in the codebase. - Fault isolation and small blast radius: every component owns its own database file, its own filesystem, its own port, its own process. There is no shared schema, no shared module, no direct filesystem access across services. A failure or restart of one component cannot corrupt or stall another; the worst case is a stale last-checked timestamp on the
status griduntil the next poll. - Auditability: every job fire (manual or scheduled) is persisted with a stable identity, parameters, start and end timestamps,
terminal state, and the peer it targeted. Therecent-jobs feedis what downstream components consult to know whether their upstream has run since the last market close, so the audit log is also a load-bearing piece of thedependency model, not a write-only artifact. - Idempotent, safe-to-re-run operations: daily kickoffs (
TDMupdate,SAPdaily,SIGconfigurate / forecast,MAGditto,WINselection,BUYexecute) are written to be safely re-fireable. Re-running a stage against the same trading day overwrites the report for that date rather than appending, and thedependency gates(peer-side checks againstORK'srecent-jobs feed) prevent the pipeline from running stages out of order. - Configuration management: every component reads its deployment-time values from a per-component INI file (URLs, ports, schedules, IPA realm, sentiment vendor, retention windows, etc.) loaded once at startup. Runtime-tunable values (poll cadences, log-tail line counts, training thresholds) live in a single
Configurationrow per component, edited via the dashboard, with cadence changeshot-applied, requiring no service restart. - Concurrency control under load:
ORK's kickoff path dispatches through a per-job-name lookup withconcurrency gatesso two operators (or an operator plus a scheduled fire) can never accidentally double-trigger the same routine. Thejob poller's per-jobstate machineensures terminal transitions are recorded exactly once even under retry. - Compute and capacity awareness: the whole pipeline (universe build, minute-grade fetch for 3,000 to 5,000 symbols, per-symbol training across two model families, ranking, and execution) completes inside the daily trading window on a single consumer-hardware server. Concurrency uses
bounded worker poolssized to the host rather than unboundedfan-out, and the daily fetch window is tunable so an operator can trade history depth for runtime when the universe grows. - Operational tooling for the operator: the dashboard exposes per-component status, per-component log tail with level / mask filtering, downloadable per-date log files, per-component
currency badges(last-successful-run timestamp), and a single page from which any scheduled job can be manually kicked off, paused, or rescheduled. The same surface an operator uses for day-to-day driving is what anon-callwould use to triage an incident at 3am.
The Math
While I'm not a math person, or trained as one, or capable of being one, and don't enjoy doing math, the feature-engineering layer that sits between the normalised daily-wide bars and the model fit is where most of the actual contribution lives. The classifiers and regressors at the bottom of the stack are off-the-shelf gradient-boosted trees; everything above them is bespoke: there are roughly fifty independent per-symbol feature transforms with an explicit dependency graph between them.
These features are not added universally to every symbol. The configuration process that runs in SIG and MAG runs a greedy selection of features during testing in a series battery that adds features that increase accuracy during the regression walk-forward testing to provide a confidence metric SMAPE represented as a normalized percentage.
- Intraday-anchored time bucketing. All math is anchored to the 11:00 ET intraday close because that is the prediction-execution anchor (T+1 11:00 to T+2 11:00 is the model's target window). Minute bars are collapsed into a single daily-wide row keyed by trading date and exposing price and volume at every half-hour (9:30, 10:00, 10:30, …, 3:30 PM) plus full per-bucket OHLC. Every downstream feature is computed against that fixed time grid rather than against raw minute-grade ticks.
- Returns at the prediction anchor. Daily-return and multi-lag carry features are computed on the 11:00 close at lags 1, 3, and 5 so the training distribution aligns with the 11 to 11 target. Log-returns are used wherever distributional properties matter (intraday realised volatility) and arithmetic percent change wherever interpretation matters (gap features, momentum). Overnight-gap returns are computed against the prior-day 11:00 anchor rather than the prior-day close because that is the carry the model actually has to forecast.
EMA,SMA, andMACD-style spreads. Short-span exponential moving averages on the 11:00 close (spans 3 and 5) and on daily returns, with an explicitwarm-up guardto suppress half-converged values, are converted into deviation ratiosprice ÷ EMA − 1and sign-extracted into discretecross flagssign(EMA_3 − EMA_5). The MACD-style spread is computed onprice ÷ VWAPand on thez-scored VWAP deviationrather than on raw price so the signal reads asregimerather than absolute level.Wilder RSIwith tail clipping.RSIis computed on a short rolling window of length 3 of gain and loss with a smallεguard on the denominator, clipped to [5, 95] to suppress saturated tails, and emitted as both the continuous lagged value and an RSI > 50 sign flag. A secondRSIruns on the overnight-gap return series for the same reason.VWAP. A full-dayVWAPis built from per-bucketΣ (price × volume) ÷ Σ volume. The output is reused as a denominator: aprice ÷ VWAPratio, a 5-period rollingz-scoreof the deviation, an intraday-drift slope(VWAP_15:00 − VWAP_11:00) ÷ VWAP_11:00, and anexpanding-quantile clipwith a dynamic warm-up window for cross-symbol robustness.Order-flow imbalance.sign(price change) × volumeis summed across morning buckets (separately for the full day and the morning-only window) and normalised by total window volume, thenexpanding-quantile clipped. The anchor for the first bucket's signed price change is the prior-day 11:00 close, again so the carry aligns with the prediction window.Realised volatility. Per-row morningrealised volatilityis the populationstandard deviationoflog-returnsacross the morning buckets (9:30 to 10:00 to 10:30 to 11:00). It is normalised by its own 20-day rolling median (aMAD-style scale-free transform), capped to a fixed band, and emitted both as the current value and a one-dayregime-changeratiovol_t ÷ vol_{t-1}.Cumulative delta. Abid-ask-volume proxy: each bucket gets a +1 / −1 sign fromsign(price_t − price_{t-1})(with the prior-day 11:00 as the seed), multiplied by the bucket's volume. Per-bucket normalised deltas, cumulative delta through 11:00, first-differencedelta acceleration, andshape skewacceleration ÷ |cumulative|are all emitted as lagged features.Range position,rejection, andexpansion. Where the 11:00 close sits inside the morning's true OHLC range is computed as(close − low) ÷ (high − low), clamped to [0, 1], withquartile one-hot dummiesemitted so the tree models can split on regime explicitly.High-andlow-rejectionscores measure how far the close pulled back from the morning extreme, with binary rejection > 0.66 flags and a signed scorelow_rejection − high_rejection.Range expansionis the ratiotoday_range_size ÷ yesterday_range_size.- Volume profile, participation, and surge. Morning share of full-day volume, 11:00-bucket
concentration, day-over-day morning-share change, per-bucket volume share, and avolume-accelerationratiovolume_{11:00} ÷ volume_{9:30}are all included. A separate volumez-score(10-day rolling) multiplied bysign(price momentum)gives a directionalvolume-surgefeature that is also clipped to its upper expanding quantile. Tick distribution. The count of positive bucket returns across the morning is converted into anuptick ratio, a magnitude-weighted variantΣ positive |r| ÷ Σ |r|, a day-over-day delta, a zero-centred form uptick − 0.5, and a momentum formuptick_t − uptick_{t-1}.Momentum accelerationandexhaustion.Accelerationis approximated as a discretesecond-derivative of returnsΔr_1 − 0.5 · Δr_2, clipped to a sane band.Consistencyis the rolling count of same-sign morning-momentum days.Exhaustionis the product(consecutive_up + consecutive_down) × |price_extension|againstSMA-5and againstVWAP,z-scoredon a 20-day rolling window and clipped: the standard "stretched and tired" reversal heuristic written out explicitly.- Regime classification via closed-form
OLS. A 5-daylinear-regression slopeof the 11:00 close is computed by hand using the closed-form coefficientΣ(x − x̄)(y − ȳ) ÷ Σ(x − x̄)²rather than calling out to a regression library; this is both faster on a per-symbol loop and free of additional dependency surface. The slope is paired with atrue-range volatility ratiotoday_range ÷ 10-day mean rangeand discretised into fourregime states:range-bound,expanding-uptrend,expanding-downtrend, andmixed / chop. True rangeandATR.True rangeisWilder'smax(high − low, |high − prev_close|, |low − prev_close|).ATRis theEWMof true range with the explicitWilder alphaα = 2 ÷ (n + 1)rather than the defaultspan / commapping. This is important so the smoothing matches the standard published definition.- Gap mechanics.
Overnight gapis computed assigned,absolute, andamplitudevariants against the prior-day 3:30 PM close.Gap-fill stateis computed by walking each bucket's high and low across the day and recording the first bucket whose extreme touches the prior close in the gap-fill direction, then binning that index into afill-speed categorical(immediate/morning/afternoon/late/unfilled). - Multi-window
VWAPand afternoon-vs-morningvolatility mix. ThreeVWAPsare built per row (through 11:00, through 3:00 PM, and full-day), withclose ÷ VWAP_fullday − 1and(VWAP_15:00 − VWAP_11:00) ÷ VWAP_11:00as separate features. An afternoon-vs-morningvolatility ratiocaptures whether the day's variance was front- or back-loaded. Lagged shelfof every numeric column. A single transform appends at-1copy of every numeric column the rest of the pipeline produces, so the model sees a stablelag-1 echoof the whole feature set without having to special-case shifts for individual indicators. Combined with aper-feature dependency graphthat the configurator resolves before dispatch, this makesanti-leakage hygienethe default rather than a per-feature obligation.- Per-symbol feature-set search. The configurator runs an
incremental feature-selectionloop per symbol against a held-outvalidation window, scoring each candidate set, and persists the winning subset per symbol. The forecast path then loads each symbol's optimal subset of the ~50-feature universe rather than running the same global pipeline against every symbol. - Anti-leakage at the shape level. The feature pipeline accepts a
history-trimparameter that drops the most recent N rows before any feature math runs, so thetest-qualitypath can simulate "going back in time" by training on data the live model would not yet have seen. Combined with thelag-1 shelfand thedependency resolver, this is the structural guarantee thattest-qualitynumbers reflect what the model could actually have produced on the date in question.
The Source Code
As it is unclear to me what the impact on the stock market would be if a more competent derivative of this were released open source, I have decided to keep it closed source and proprietary. I am receptive to reviewing the source under an NDA.
I Need A Job
I'm currently seeking 1099 or W2 employment. If someone like me would be useful to you, please email me at chris.punches@silogroup.org