# general

Lennard Berger

12/20/2022, 12:19 PM
Does anyone have a take on streaming processing of documents? Our current architecture comprises Kafka + “pipelines”. We stream from one topic to the next, running our ML models in between. While this is fairly reliable, the “from-scratch-approach” is fairly expensive. It’s on our bucket list next year to reevaluate this. Maybe it’s possible to use an iterative process? Then again, our current setup has very few limitations in terms of reliability and scalability.

Stefan Avram

12/20/2022, 6:33 PM
Can you explain a bit more of the architecture?

Lennard Berger

12/20/2022, 6:35 PM
Sure, we have texts that we run from “raw-text” through various NLP steps to get numeric output. So every stage helps resolve another stage (e.g first spaCy, then entity recognition / linking, sentiment etc etc). At the moment every microservice just consumes the previous stage. In theory this could be done in an append fashion as well, adding keys to the document
Nice thing about the current setup is that using topic + offset you get very good debugging