Re: Talksearch.io - Advanced Bitcointalk Search Engine

Quote from: Wouter Mense on May 19, 2025, 12:10:51 PM

Offtopic, I hope you appreciate getting more questions instead of more answers. I do believe asking the right questions is more helpful to start your research. I can't vouch for the quality of ai answers, just that it looked interesting. I'm not a programmer, but it does offer to write your code as well.

I appreciate it greatly.

I have done some looking around over the past few days, and I found a machine learning model called BERT that was made by Google in 2018 for search engines.

Can you believe that. An AI model from before AI models were a thing.

I do have sort of a background in machine learning models, so I can summarize it briefly here: Instead of vectorizing words, and thus relying on keywords to search, it vectorizes entire phrases. Words that are adjacent to each other in a sentence. This makes natural language search possible (example: "block size wars" returning debates about segwit and bcash instead of only posts with "block size" in them).

There are many improved versions BERT nowadays, large ones and small ones. However, the models require dedicated hardware to run.

The good news is, Elasticsearch makes it painfully easy to deploy a model. You literally just have to press the "Run" button next to it. And then search algorithms will be using the model automatically. The bad news is, they don't come cheap. There is one ML node in my cluster, which I receive at no additional cost, but it only has 1GB of RAM and can't store any model, so it's pretty useless.

Upgrading to the next hardware tier that has 2GB is going to bump the total monthly bill to around $300. And I am already hounded enough by Google with biweekly invoices. Therefore I want to wait until all the new post content is uploaded before I delete the old, incomplete post content, which will allow me to slash the storage size by about half. Then adding a larger ML node will make Talksearch's running cost somewhat lower than they are right now. It will be a wise investment, though. GPUs on dedicated servers are not plentiful, and are much expensive than this.

Unfortunately, despite thousands and thousands of post chunks a day being uploaded, I am only about 10% of the way there. I can't experiment with BERT search until it's done. And pray my server doesn't run out of memory mid-upload, because my disk is the primary bottleneck. But move to an SSD or something and the Elasticsearch server gets overwhelmed with requests and runs out of memory itself.

I imagine this whole process becomes much faster with even larger hardware, but that is not an amount I'm willing to spend, especially on a beta product.