Saturday, March 7, 2009

EMail Filtering

Gmail provides a bunch of great ways to filter your e-mail. One of the most accurate is to append a string to the end of your user name (i.e. myaccount+slashdot@gmail.com) when signing up for a service. Gmail strips everything after the + when determining where to deliver a message, however it will still appear in the "Delivered-To" message header. You can then run a search or create a filter by searching for "deliveredto:myaccount+slashdot@gmail.com". This allows you to set up auto labeling or other behaviors based on the origin of the e-mail, without having to know where the message might be coming "From:".

This is also useful for setting up forwarding between Gmail accounts. Using this guarantees you can label forwarded e-mail because Delivered-To will always be the address you specified in the forwarding set up, even if the original e-mail was addressed To: a mailing list or you were originally BCCed.

There's one big problem with this set up. There are a lot of... misguided developers out there who set up their registration forms with bad e-mail validation. I'd say about 50% of the time I can't use this method because + is not a valid character in the form.

http://en.wikipedia.org/wiki/E-mail_address#RFC_specification

I wish people would bother to follow standards. Very annoying.

13 comments:

Brandon Leon said...

I agree, I'm big on making sure to follow standards, they make everyones life so much easier when followed.

Anonymous said...

Gmail ignores periods in usernames but recognizes them in filters, so you can use '.' for sites which disallow '+'. Eg., instead of bogosblog@gmail.com, give out bogos.blog@gmail to certain sites. Then you can create a filter for bogos.blog in the To: field.

You can include multiple periods, and even several periods in a row if you'd like. Eg. bo.gos.bl.og@gmail.com or bogos...blog@gmail.com.

Of course, one loses the semantic advantages possible with the '+', but if there are some sites you really want or need to sign up with that wrongly forbid '+', the period is one way to do so while still keeping filtering possibilities.

Anonymous said...

have you ever tried to comply with the RFC 2822 when writing an email validation routine. it's pure insanity so people take shortcuts.

the only real way to "validate" an email address is to establish a SMTP connection to the host.domain and if the SMTP server says "ok, I'll accept a message to that address," then you've got an email address.

think about it.

Bogo said...

I use the . method when + fails, though my main email address is only six characters, so that will only last me 31 addresses :)

Bogo said...

To be fair, I have never implemented this myself. A quick search turned up what appear to be some fairly decent regular expressions that accept the + sign though.

It's only going to get harder once there's no restrictions on TLDs so maybe you're right and sites should be validating by trying to send an email.

Andrew Theken said...

Agreed, also, "." is often mishandled in the local part of email addresses.

Perhaps we can add regex's that meet the email criteria onto the wiki page in various languages so that people can just copy/paste them and we can be done with this pain.

Anonymous said...

The correct regex for an e-mail adress is basically

.+@.+

Chris said...

If you haven't seen it before, an RFC822-compatible regex is rather complex and so it is amazing that people want to write their own. I think that this job is better done with a proper parser built using the grammar in the RFC. In other words, find a good module in your programming language of choice and use that.

Anonymous said...

Thankfully, most sites are dropping validation and switching to the "enter it twice" method. Pretty much anything, including single characters, is a valid email address under the RFC so attempting to validate is just nuts.

Patrick said...

>The correct regex for an e-mail adress
> is basically
> .+@.+

Unfortunately, it's not that simple. That regex would match nonsense like "@@@Hello@@I AM SURELY NOT??? AN EMAIL ADRESS@@".

The last thing, one should probably do is trying to come up with some clever regular expression based on the most common form of emailaddresses. Here is, what a parser should look like (alternatively, click on my name):

http://www.onyxbits.de/content/blog/patrick/parsing-email-addresses-java-without-having-javamail-api-available

And even that does some simplifications (not allowing comments in addresses)

Anonymous said...

The problem is that MTAs don't necessarily all follow the RFC exactly. So, if you accept addresses that some popular MTA is going to bounce, then you can't contact that user and your error logs fill up with noise.

Case in point, that wikipedia page lists '!' as an acceptable character. However, at least some MTAs puke on addresses with '!', due to its significance for UUCP, apparently.

So you should at least disallow '!'. I ran into some exceptions due to other punctuation chars, but can't remember what they were now. However, we had users complain when we didn't allow '+', so we enabled that and have not had any problems.

Anonymous said...

why do any heavy duty checks at all? The only things i check for are at least a letter on either side of the @ and then a . somewhere after the @.

My point is, if they want to give a fake email, they can make it look real, or just give a real one. The checks are just semantics.

I used to use the + all the time, but some many sites block it that i punted with it.

Shadowhand said...

If anyone needs a PCRE regex to validate an address, Cal Henderson wrote one that you should use too!