$ ele - The Imaginary Open Source Web Search - A Discussion Proposal
Dear HN, this page is for you. Please treat it as a discussion proposal.
Imagine if there was an open source search engine that you can just run anywhere anytime however you want. And now imagine that it’s spam-free, ad-free, and easy to share with your friends. And it would have a working "" operator!!!! That would be f-ing great, wouldn’t it?
Of course, I’m not deluded. There is no way you can run a full web-scale search engine for free. Crawling the billions of pages on the internet is a lot of work and eats up pricey bandwidth and CPU time. But:
- we don’t want the entire internet and
- after the index file is created, querying it is cheap and easy.
I believe the financials might work, so read on:
1. Kick out the SEO spam
I don’t know anyone who is happy to see those SEO-spam copies of StackOverflow in their search results. And when I need a recipe, I usually go to the same page each time. When I need an opinion, I search on Reddit.
I believe that search results should be opt-in by the user. Similar to how AdBlock has filter lists maintained by volunteers, we could have “useful website lists”. And then our search engine only needs to search domains on those lists, meaning it becomes a lot cheaper to operate simply because we search a lot fewer domains.
The key for making this useful is to make it easy to publish your own list and to subscribe to someone else’s list. As an AI researcher, I can subscribe to an AI search index which will include arXiv, but my friends probably don’t want arXiv in their results.
In short, how about a domain only ever shows up in search if you either a) added that domain to your search index yourself or b) subscribed to someone else’s search index which includes the domain. That should get rid of all the SEO spam.
2. Pay only for data creation
Most of the costs of operating a search engine occur when you are crawling the web to generate search indices. That also includes NLP AI processing like word vectors, stemming, and vocabulary coding. Once the index file has been produced, querying it is cheap. Of course, it would be great if one would always have the latest search results, but in practice for searching Stack Overflow it’s probably good enough if your search index gets updated once per week.
How about everyone could set up a search index configuration (see #1) and then only pay for the data creation and then they can freely share the resulting search index with whoever they want? We could have a “Coder’s Guild” Patreon where everyone donates $1 monthly to index the most popular programming domains. And then everyone can download those files to have free, private, local search.
3. Care for your friends
Not everyone is technical enough to run their own local search engine. That’s OK. With an open source file format and open source search client, though, you (the tech person) could easily set up your own search engine to be used by friends and family. They won’t get to choose which search indices to subscribe to, but chances are they will be just fine with re-using your choices.
Also, nobody will feel bad if you charge for it and install ads on your server. Those who dislike them are free to rent their own $5 VPS and install their own copy.
Also, imagine all the automation and integration opportunities if you had a web search command line tool… StackOverflow as an Eclipse plugin ;)
How can this work technically?
- The user uses a website or the command line tool to call
libEle
with a text string. For example$ ele "git undo last commit"
. libEle
will use NLP processing and a vocabulary fileenglish-vocabulary.elv
to convert text toword IDs
(numbers).- It’ll then read your subscription config
coders-guild.els
. That’s a JSON config file which contains a list of index filenames but it can also refer to other subscription configs to be included. Let’s say this one only listsstackoverflow.eli
libEle
will mmap thestackoverflow.eli
into memory, static_cast it into a bloom hash table, and then look up allword IDs
, to get the matchingdocument IDs
.- Those
document IDs
that were found for all words are ranked and their URLs are returned. - The command-line tool or the website formats the resulting URL list and displays it.
In practice, files will probably be named a bit more complex, to make updating easier.
Like stackoverflow-2022.eli
and stackoverflow-2022-02.eli
and stackoverflow-2022-02-14.eli
for an index that is updated daily. That way, the large base 2022 index only needs to be updated once per year,
and the daily updates can be small deltas to the monthly index update.
If we assume that index files are considered free data which the user is allowed to get their hands of - in short, we’re not a VC megacorp that needs to extract maximum revenue through privacy invasion and ads - then updating things is actually quite simple.
A free open source updater tool could check your .els
subscription files once per day
and issue HTTP HEAD
against the distribution URLs for all referenced files.
Files that were updated are downloaded through HTTP(s) and you’re good to go.
Now this might potentially be quite a lot of data to download if you want to have a large search index. But the beauty of an open source client is that you have the choice. Do I want large downloads to my laptop? Or do I want to update my indices less often? Or do I want to offload things to a cheap VPS?
How can this work financially?
That’s easy, people pay for creation of the index files.
If you want to have the most popular coding websites in your search results,
then most likely someone else will already have created a matching .els
config
and you can just link to their config to incorporate their search indices into your configuration.
If not, then you and your friends can create the “Coder’s Guild”,
front the $1000 per month to pay for creation of the search index,
and then setup a Patreon to ask for others to join in.
That approach seems to work well for AdBlock lists.
If the file format is open, you (the user) are protected because once the current providers become too expensive, someone else can just create a new company and offer better pricing. An open file format is mandatory for free market competition. And we urgently need competition in web search.
That said, I expect this to be exponentially cheaper than running Google/Bing because you only need to index those domains that you actually want to see in your search results. It’s opt-in, after all.
That means for a US-based coder, you probably have 200 English-speaking domains that you need to index. StackOverflow has 31mio answers to 21mio questions. That probably fits into 300 GB. I would be surprised if all of my usual work-related search queries wouldn’t fit onto one 8TB drive. But that’s because I (the user) deliberately exclude most of the internet out of my results.
What’s with the name?
Elephants are famous for their good memory, and “ele” would be a good name for a commandline tool. Easy to remember, easy to spell, easy to type. And it doesn’t conflict with any existing Linux tool. Also, I just happen to own the domain “elefound.com” so I have a proof of first use should there be any trademark disputes.
How can this become reality?
I just happen to work on running a large-scale private search engine, so I have experience when it comes to indexing, NLP, word embeddings, and file format design. And my napkin math suggests that I can pay for the expenses of creating a StackOverflow index and running a testing search server.
But this is a federated approach to web searching.
So for things to work out truly well,
others will need to run their own search servers,
package the library and tools for all the common OS distributions,
and maintain their own .els
configs of which domains they consider useful/trustworthy.
For the creation of the web indices, a company will be needed. I would be happy to operate such a service but I’m unsure if enough people would be willing to pay for data creation so that it can survive as a business.
So that’s the question I would like to see discussed:
Would you be willing to pay for the creation of search index data files?
Those files would then “belong” to you. They would be in an open file format, to be used with open source tools. And you could share them with whoever you want, or even resell them. Or create your own search engine, curated to only list websites that you like. You could be the hero of your non-techie friends by giving them web search that actually works. (I heard that for medical research papers, Google has become basically useless, thanks to spam.)