Blocking Bad Bots - .htaccess (Apache2)
Block bad users based on their User-Agent string
Sometimes your website can be attacked from different IP addresses and it is impossible to block all such users. If there is fixed user-agent in the request, you can block such access using the following rule:
Block multiple bad User-Agents:
<VirtualHost XXX.140.234.34:80>
ServerName www.example.us
LogLevel debug
ErrorLog /var/log/apache2/example.log
CustomLog /var/log/apache2/example.log combined
AddDefaultCharset utf-8
RewriteEngine On
# Block requests from Amazonbot - this rule sends a 403 Forbidden response ([F]) and stops processing further rewrite rules ([L]).
RewriteCond %{HTTP_USER_AGENT} Bytespider|ClaudeBot [NC]
RewriteRule ^ - [F,L]
RewriteCond %{SERVER_NAME} XXX.140.234.34
RewriteRule /(.*) http://www.example.us/$1 [R=301,L]
LimitRequestBody 5120000
WSGIDaemonProcess www.example.us processes=10 threads=10 maximum-requests=1000 display-name=%{GROUP}
WSGIProcessGroup www.example.us
WSGIScriptAlias / /home/admin/deployment/example/wsgi.py
....
</VirtualHost>
Blocking unwanted user agents can help prevent certain bots and crawlers from accessing your site, which can save resources and improve security. Here’s an expanded explanation and additional considerations for blocking user agents with Apache:
How it Works
RewriteCond:
RewriteCond %{HTTP_USER_AGENT} Bytespider|ClaudeBot [NC]
- This condition checks the
User-Agent
header of incoming HTTP requests. - The
Bytespider|ClaudeBot
part is a regular expression that matches any user agent containing "Bytespider" or "ClaudeBot". - The
[NC]
flag makes the match case-insensitive (NC stands for No Case).
- This condition checks the
RewriteRule:
RewriteRule ^ - [F,L]
- The
^
is a regular expression that matches the start of the request URI. - The
-
means no substitution should be performed. - The
[F,L]
flags mean:F
: Return a 403 Forbidden status code.L
: Stop processing further rewrite rules.
- The
Example with More User Agents
You might want to block multiple known bots. Here’s an example with a more comprehensive list:
RewriteCond %{HTTP_USER_AGENT} (Bytespider|ClaudeBot|AhrefsBot|MJ12bot|SemrushBot|Baiduspider) [NC]
RewriteRule ^ - [F,L]
Testing Your Configuration
Restart Apache: After updating the configuration, restart Apache to apply changes:
sudo systemctl restart apache2
Check Logs: Monitor your log files to ensure the rules are working as expected:
tail -f /var/log/apache2/example.log
Use cURL: Test the blocking rule using
cURL
with different user agents:curl -I -A "Bytespider" http://www.example.us curl -I -A "Mozilla/5.0" http://www.example.us
The first command should return a
403 Forbidden
, while the second should return200 OK
or the appropriate status for your site.
Advanced Considerations
Robots.txt: While the
robots.txt
file politely asks well-behaved bots to avoid certain areas of your site, it does not enforce compliance. Blocking via Apache ensures compliance.Dynamic Blocking: For a more dynamic approach, consider using
mod_security
with a rule set designed to block bad bots.Whitelist Approach: Instead of blocking specific user agents, you could adopt a whitelist approach, allowing only known good bots (like Googlebot, Bingbot) and blocking everything else.
Rate Limiting: In addition to blocking, you can also implement rate limiting to prevent abuse from bots that make too many requests.
Comments
Post a Comment