Blocking Bad Bots - .htaccess (Apache2)


Block bad users based on their User-Agent string

Sometimes your website can be attacked from different IP addresses and it is impossible to block all such users. If there is fixed user-agent in the request, you can block such access using the following rule:

Block multiple bad User-Agents:

<VirtualHost XXX.140.234.34:80>

    ServerName www.example.us

    LogLevel debug

    ErrorLog /var/log/apache2/example.log

    CustomLog /var/log/apache2/example.log combined

    AddDefaultCharset utf-8


    RewriteEngine On


    # Block requests from Amazonbot - this rule sends a 403 Forbidden response ([F]) and stops processing further rewrite rules ([L]).

    RewriteCond %{HTTP_USER_AGENT} Bytespider|ClaudeBot [NC]

    RewriteRule ^ - [F,L]


    RewriteCond %{SERVER_NAME} XXX.140.234.34

    RewriteRule /(.*) http://www.example.us/$1 [R=301,L]


    LimitRequestBody 5120000

    WSGIDaemonProcess www.example.us processes=10 threads=10 maximum-requests=1000 display-name=%{GROUP}

    WSGIProcessGroup www.example.us

    WSGIScriptAlias / /home/admin/deployment/example/wsgi.py

    ....


</VirtualHost>

Blocking unwanted user agents can help prevent certain bots and crawlers from accessing your site, which can save resources and improve security. Here’s an expanded explanation and additional considerations for blocking user agents with Apache:

How it Works

  1. RewriteCond:

    RewriteCond %{HTTP_USER_AGENT} Bytespider|ClaudeBot [NC]
    • This condition checks the User-Agent header of incoming HTTP requests.
    • The Bytespider|ClaudeBot part is a regular expression that matches any user agent containing "Bytespider" or "ClaudeBot".
    • The [NC] flag makes the match case-insensitive (NC stands for No Case).
  2. RewriteRule:

    RewriteRule ^ - [F,L]
    • The ^ is a regular expression that matches the start of the request URI.
    • The - means no substitution should be performed.
    • The [F,L] flags mean:
      • F: Return a 403 Forbidden status code.
      • L: Stop processing further rewrite rules.

Example with More User Agents

You might want to block multiple known bots. Here’s an example with a more comprehensive list:

RewriteCond %{HTTP_USER_AGENT} (Bytespider|ClaudeBot|AhrefsBot|MJ12bot|SemrushBot|Baiduspider) [NC]
RewriteRule ^ - [F,L]

Testing Your Configuration
  1. Restart Apache: After updating the configuration, restart Apache to apply changes:

    sudo systemctl restart apache2
  2. Check Logs: Monitor your log files to ensure the rules are working as expected:

    tail -f /var/log/apache2/example.log
    

  3. Use cURL: Test the blocking rule using cURL with different user agents:

    curl -I -A "Bytespider" http://www.example.us
    curl -I -A "Mozilla/5.0" http://www.example.us

    The first command should return a 403 Forbidden, while the second should return 200 OK or the appropriate status for your site.

Advanced Considerations

  1. Robots.txt: While the robots.txt file politely asks well-behaved bots to avoid certain areas of your site, it does not enforce compliance. Blocking via Apache ensures compliance.

  2. Dynamic Blocking: For a more dynamic approach, consider using mod_security with a rule set designed to block bad bots.

  3. Whitelist Approach: Instead of blocking specific user agents, you could adopt a whitelist approach, allowing only known good bots (like Googlebot, Bingbot) and blocking everything else.

  4. Rate Limiting: In addition to blocking, you can also implement rate limiting to prevent abuse from bots that make too many requests.

Comments

Popular posts from this blog

Installing the Certbot Let’s Encrypt Client for NGINX on Amazon Linux 2

psql: error: connection to server at "localhost" (127.0.0.1), port 5433 failed: ERROR: failed to authenticate with backend using SCRAM DETAIL: valid password not found

Deploy Nuxt.js app using Apache 2