Anubis is awesome! Stopping (AI)crawlbots

zoey@lemmy.librebun.com · 2 days ago

Anubis is awesome! Stopping (AI)crawlbots

Charlxmagne@lemmy.world · 22 hours ago

Been seeing this on people’s invidious instances

fossilesque@mander.xyz · 1 day ago

I’ve been planning on seeing this up for ages. Love the creators vibe. Thanks for this.

reddeadhead@awful.systems · edit-2 1 day ago

Anubis just released the no-JS challenge in a update. Page loads for me with JS disabled. https://anubis.techaro.lol/blog/release/v1.20.0/

blob42@lemmy.ml · edit-2 1 day ago

I am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.

It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.

Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.

Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)

git.blob42.xyz {
    @bot <<CEL
        header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))')
    CEL


    abort @bot
    

    defender garbage {

        ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8
      
    }

    rate_limit {
        zone dynamic_botstop {
            match {
                method GET
                 # to use with defender
                 #header X-RateLimit-Apply true
                 #not header LetMeThrough 1
            }
            key {remote_ip}
            events 1500
            window 30s
            #events 10
            #window 1m
        }
    }

    reverse_proxy upstream.server:4242

    handle_errors 429 {
        respond "429: Rate limit exceeded."
    }

}

If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

dan@upvote.au · 2 days ago

The Anubis site thinks my phone is a bot :/

tbh I would have just configured a reasonable rate limit in Nginx and left it at that.

Won’t the bots just hammer the API instead now?

NotSteve_@piefed.ca · 2 days ago

I love Annubis just because the dev is from my city that’s never talked about (Ottawa)

The Hobbyist@lemmy.zip · edit-2 2 days ago

@demigodrick@lemmy.zip

Perhaps of interest? I don’t know how many bots you’re facing.

sic_semper_tyrannis@lemmy.today · 2 days ago

Futo gave them a micro-grant this month

lambalicious@lemmy.sdf.org · 2 days ago

Positives: nice uwu art.

Negatives: requires javascript, intrinsically ableist.

ohshit604@sh.itjust.works · 2 days ago

How is the art a positive?

lambalicious@lemmy.sdf.org · 22 hours ago

What do you mean, how?

Cute anime catgirl, a staple of the internet, without having to be showy or anything. And there are hooks to change it.

(Was actually half-surprised they didn’t go with “anime!stereotypical egyptian priestess” given the context of the software, but I feel that would have ended up too thematically overloaded in the end)

AmbitiousProcess@piefed.social · 2 days ago

Could you elaborate on how it’s ableist?

As far as I’m aware, not only are they making a version that doesn’t even require JS, but the JS is only needed for the challenge itself, and the browser can then view the page(s) afterwards entirely without JS being necessary to parse the content in any way. Things like screen readers should still do perfectly fine at parsing content after the browser solves the challenge.

phase@lemmy.8th.world · 2 days ago

There’s another challenge available, without javascript.

Daniel Quinn@lemmy.ca · 2 days ago

I’ve been thinking about setting up Anubis to protect my blog from AI scrapers, but I’m not clear on whether this would also block search engines. It would, wouldn’t it?

zoey@lemmy.librebun.com · 2 days ago

I’m not entirely sure, but if you look here https://github.com/TecharoHQ/anubis/tree/main/data/bots
They have separate configs for each bot. https://github.com/TecharoHQ/anubis/blob/main/data/botPolicies.json

Possibly linux@lemmy.zip · 2 days ago

It doesn’t stop bots

All it does is make clients do as much or more work than the server which makes it less temping to hammer the web.

zoey@lemmy.librebun.com · 2 days ago

Yeah, from what I understand it’s nothing crazy for any regular client, but really messes with the bots.
I don’t know, I’m just so glad and happy it works, it doesn’t mess with federation and it’s barely visible when accessing the sites.

Possibly linux@lemmy.zip · 2 days ago

Personally my only real complaint is the lack of wasm. Outside if that it works fairly well.

Anubis is awesome! Stopping (AI)crawlbots

Anubis is awesome! Stopping (AI)crawlbots

Incoherent rant.

Behold, Anubis.

“Weighs the soul of incoming HTTP requests to stop AI crawlers”