Incoherent rant.
I’ve, once again, noticed Amazon and Anthropic absolutely hammering my Lemmy instance to the point of the lemmy-ui container crashing. Multiple IPs all over the US.
So I’ve decided to do some restructuring of how I run things. Ditched Fedora on my VPS in favour of Alpine, just to start with a clean slate. And started looking into different options on how to combat things better.
Behold, Anubis.
“Weighs the soul of incoming HTTP requests to stop AI crawlers”
From how I understand it, it works like a reverse proxy per each service. It took me a while to actually understand how it’s supposed to integrate, but once I figured it out all bot activity instantly stopped. Not a single one got through yet.
My setup is basically just a home server -> tailscale tunnel (not funnel) -> VPS -> caddy reverse proxy, now with anubis integrated.
I’m not really sure why I’m posting this, but I hope at least one other goober trying to find a possible solution to these things finds this post.
I am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.
It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.
Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.
Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)
git.blob42.xyz { @bot <<CEL header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))') CEL abort @bot defender garbage { ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8 } rate_limit { zone dynamic_botstop { match { method GET # to use with defender #header X-RateLimit-Apply true #not header LetMeThrough 1 } key {remote_ip} events 1500 window 30s #events 10 #window 1m } } reverse_proxy upstream.server:4242 handle_errors 429 { respond "429: Rate limit exceeded." } }
If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud