2005. november 11., péntek

Extended E-mail Address Verification and Correction


Problem/Question/Abstract:

Have you ever needed to verify that an e-mail address is correct, or have you had to work with a list of e-mail addresses and realized that some had simple problems that you could easily correct by hand?

Answer:

Have you ever needed to verify that an e-mail address is correct, or have you had to work with a list of e-mail addresses and realized that some had simple problems that you could easily correct by hand? Well the functions I present here are designed to do just that. In this article I present two functions, one to check that an e-mail address is valid, and another to try to correct an incorrect e-mail address.

Just what is a correct e-mail address?
The majority of articles I’ve seen on e-mail address verification use an over-simplified approach. For example, the most common approach I’ve seen is to ensure that an ‘@’ symbol is present, or that it’s a minimum size (ex. 7 characters), or a combination of both.  And a better, but less used method is to verify that only allowed characters (based on the SMTP standard) are in the address.

The problem with these approaches is that they only can tell you at the highest level that an address is POSSIBLY correct, for example:

The address: ------@--------
Can be considered a valid e-mail address, as it does contain an @, is at least 7 characters long and contains valid characters.

To ensure an address is truly correct, you must verify that all portions of the e-mail address are valid. The function I present performs the following checks:
a) Ensure an address is not blank
b) Ensure an @ is present
c) Ensure that only valid characters are used
Then splits the validation to the two individual sections:  username (or mailbox) and domain
Validation for the username:
a) Ensure it is not blank
b) Ensure the username is not longer than the current standard (RFC 821)
c) Ensures that periods (.) are used properly, specifically there can not be sequential periods (ex. David..Lederman is not valid) nor can there be a period in the first or last character of an e-mail address
Validation for the domain name:
a) Ensure it is not blank
b) Ensure the domain name is not longer than the current standard
d) Ensure that periods (.) are used properly, specifically there can not be sequential periods (ex. World..net is not valid) nor can there a period in the first or last character of the domain segment
e) Domain segments need to be checked  (ex. in someplace.somewhere.com, someplace, somewhere, and com are considered segments) to ensure that they do not start or end with a hyphen (-) (ex. somewhere.-someplace.com, is not valid)
f) Ensure that at least two domain segments exists (ex. someplace.com is valid, .com is not valid)
g) Ensure that there are no additional @ symbols in the domain portion

With the steps above most syntactically valid e-mail address that are not correct can be detected and invalidated.

The VerifyEmailAddress function:
This function takes 3 parameters:
Email – The e-mail address to check
FailCode – The error code reported by the function if it can’t validate an address
FailPosition – The position of the character (if available) where the validation failure occurred

The function returns a Boolean value that returns True if the address is valid, and False if it is invalid. If a failure does occur the FailCode can be used to determine the exact error that caused the problem:

  flUnknown – An unknown error occurred, and was trapped by the exception handler.
  flNoSeperator – No @ symbol was found.
  flToSmall – The email address was blank.
  flUserNameToLong – The user name was longer than the SMTP standard allows.
  flDomainNameToLong – The domain name was longer than the SMTP standard allows.
  flInvalidChar – An invalid character was found. (FailPosition returns the location of the character)
  flMissingUser – The username section is not present.
  flMissingDomain – The domain name section is not present
  flMissingDomainSeperator – No domain segments where found
  flMissingGeneralDomain – No top-level domain was found
  flToManyAtSymbols – More than one @ symbol was found

For simple validation there is no use for FailCode and FailPosition, but can be used to display an error using the ValidationErrorString which takes the FailCode as a parameter and returns a text version of the error which can then be displayed.

E-mail Address Correction
Since the e-mail validation routine returns detailed error information an automated system to correct common e-mail address mistakes can be easily created.  The following common mistakes can all be corrected automatically:

example2.aol.com – The most common error (at least in my experience) is when entering an e-mail address a user doesn’t hold shift properly and instead enters a 2.
example@.aol.com - This error is just an extra character entered by the user, of course example@aol.com was the intended e-mail address.
example8080 @ aol .com – In this case another common error, spaces.
A Cool Screen name@AOL.com – In this case the user entered what they thought was their e-mail address, except while AOL allows screen names to contain spaces, the Internet does not.
myaddress@ispcom - In this case the period was not entered between ISP and Com.

The CorrectEmailAddress function:
The function takes three parameters:
Email – The e-mail address to check and correct
Suggestion – This string passed by reference contains the functions result
MaxCorrections – The maximum amount of corrections to attempt before stopping (defaults to 5)

This function simply loops up to MaxCorrection times, validating the e-mail address then using the FailCode to decide what kind of correction to make, and repeating this until it find a match, determines the address can’t be fixed, or has looped more than MaxCorrection times.

The following corrections are performed, based on the FailCode (see description above):
flUnknown – Simply stops corrections, as there is no generic way to correct this problem.
flNoSeperator – When this error is encountered the system performs a simple but powerful function, it will navigate the e-mail address until it finds the last 2, and then convert it to an @ symbol. This will correct most genuine transposition errors. If it converts a 2 that was not really an @ chances are it has completely invalidated the e-mail address.
flToSmall - Simply stops corrections, as there is no generic way to correct this problem.
flUserNameToLong – Simply stops corrections, as there is no generic way to correct this problem.
flDomainNameToLong – Simply stops corrections, as there is no generic way to correct this problem.
flInvalidChar – In this case the offending character is simply deleted.
flMissingUser – Simply  stops corrections, as there is no generic way to correct this problem.
flMissingDomain – Simply stops corrections, as there is no generic way to correct this problem.
flMissingDomainSeperator – Simply stops corrections, as there is no generic way to correct this problem.
flMissingGeneralDomain – Simply stops corrections, as there is no generic way to correct this problem.
flToManyAtSymbols – Simply stops corrections, as there is no generic way to correct this problem.

While only a small portion of errors can be corrected the function can correct the most common errors encountered when working with list of e-mail addresses, specifically when the data is entered by the actual e-mail address account holder.

The following is the source code for the functions described above, feel free to use the code in your own programs, but please leave my name and address intact!

// ---------------------------ooo------------------------------ \\
// ©2000 David Lederman
// dlederman@internettoolscorp.com
// ---------------------------ooo------------------------------ \\
unit abSMTPRoutines;

interface

uses
  SysUtils, Classes;

// ---------------------------ooo------------------------------ \\
// These constants represent the various errors validation
// errors (known) that can occur.
// ---------------------------ooo------------------------------ \\
const
  flUnknown = 0;
  flNoSeperator = 1;
  flToSmall = 2;
  flUserNameToLong = 3;
  flDomainNameToLong = 4;
  flInvalidChar = 5;
  flMissingUser = 6;
  flMissingDomain = 7;
  flMissingDomainSeperator = 8;
  flMissingGeneralDomain = 9;
  flToManyAtSymbols = 10;

function ValidateEmailAddress(Email: string; var FailCode, FailPosition: Integer):
  Boolean;
function CorrectEmailAddress(Email: string; var Suggestion: string; MaxCorrections:
  Integer = 5): Boolean;
function ValidationErrorString(Code: Integer): string;

implementation
// ---------------------------ooo------------------------------ \\
// This is a list of error descriptions, it's kept in the
// implementation section as it's not needed directlly
// from outside this unit, and can be accessed using the
// ValidationErrorString which does range checking.
// ---------------------------ooo------------------------------ \\
const
  ErrorDescriptions: array[0..10] of string = ('Unknown error occured!',
    'Missing @ symbol!', 'Data to small!', 'User name to long!',
    'Domain name to long!', 'Invalid character!', 'Missing user name!',
      'Missing domain name!',
    'Missing domain portion (.com,.net,etc)', 'Invalid general domain!',
      'To many @ symbols!');
  AllowedEmailChars: set of Char = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J',
    'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T',
  'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
    'l', 'm', 'n',
    'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3',
      '4', '5', '6', '7',
    '8', '9', '@', '-', '.', '_', '''', '+', '$', '/', '%'];
  MaxUsernamePortion = 64; // Per RFC 821
  MaxDomainPortion = 256; // Per RFC 821

function CorrectEmailAddress;
var
  CurITT, RevITT, ITT, FailCode, FailPosition, LastAt: Integer;
begin
  try
    // Reset the suggestion
    Suggestion := Email;
    CurITT := 1;
    // Now loop through to the max depth
    for ITT := CurITT to MaxCorrections do // Iterate
    begin
      // Now try to validate the address
      if ValidateEmailAddress(Suggestion, FailCode, FailPosition) then
      begin
        // The email worked so exit
        result := True;
        exit;
      end;
      // Otherwise, try to correct it
      case FailCode of //
        flUnknown:
          begin
            // This error can't be fixed
            Result := False;
            exit;
          end;
        flNoSeperator:
          begin
            // This error can possibly be fixed by finding
            // the last 2 (which was most likely transposed for an @)
            LastAt := 0;
            for RevITT := 1 to Length(Suggestion) do // Iterate
            begin
              // Look for the 2
              if Suggestion[RevITT] = '2' then
                LastAt := RevITT;
            end; // for
            // Now see if we found an 2
            if LastAt = 0 then
            begin
              // The situation can't get better so exit
              Result := False;
              exit;
            end;
            // Now convert the 2 to an @ and continue
            Suggestion[LastAt] := '@';
          end;
        flToSmall:
          begin
            // The situation can't get better so exit
            Result := False;
            exit;
          end;
        flUserNameToLong:
          begin
            // The situation can't get better so exit
            Result := False;
            exit;
          end;
        flDomainNameToLong:
          begin
            // The situation can't get better so exit
            Result := False;
            exit;
          end;
        flInvalidChar:
          begin
            // Simply delete the offending char
            Delete(Suggestion, FailPosition, 1);
          end;
        flMissingUser:
          begin
            // The situation can't get better so exit
            Result := False;
            exit;
          end;
        flMissingDomain:
          begin
            // The situation can't get better so exit
            Result := False;
            exit;
          end;
        flMissingDomainSeperator:
          begin
            // The best correction we can make here is to go back three spaces
            // and insert a .
            // Instead of checking the length of the string, we'll let an
            // exception shoot since at this point we can't make things any better
            // (suggestion wise)
            Insert('.', Suggestion, Length(Suggestion) - 2);
          end;
        flMissingGeneralDomain:
          begin
            // The situation can't get better so exit
            Result := False;
            exit;
          end;
        flToManyAtSymbols:
          begin
            // The situation can't get better so exit
            Result := False;
            exit;
          end;
      end; // case
    end; // for
    // If we got here fail
    Result := False;
  except
    // Just return false
    Result := false;
  end;
end;

// ---------------------------ooo------------------------------ \\
// This function will validate an address, much further than
// simply verifying the syntax as the RFC (821) requires
// ---------------------------ooo------------------------------ \\

function ValidateEmailAddress;
var
  DataLen, SepPos, Itt, DomainStrLen, UserStrLen, LastSep, SepCount, PrevSep: Integer;
  UserStr, DomainStr, SubDomain: string;
begin
  try
    // Get the data length
    DataLen := Length(Email);
    // Make sure that the string is not blank
    if DataLen = 0 then
    begin
      // Set the result and exit
      FailCode := flToSmall;
      Result := False;
      Exit;
    end;
    // First real validation, ensure the @ seperator
    SepPos := Pos('@', Email);
    if SepPos = 0 then
    begin
      // Set the result and exit
      FailCode := flNoSeperator;
      Result := False;
      Exit;
    end;
    // Now verify that only the allowed characters are in the system
    for Itt := 1 to DataLen do // Iterate
    begin
      // Make sure the character is allowed
      if not (Email[Itt] in AllowedEmailChars) then
      begin
        // Report an invalid char error and the location
        FailCode := flInvalidChar;
        FailPosition := Itt;
        result := False;
        exit;
      end;
    end; // for
    // Now split the string into the two elements: user and domain
    UserStr := Copy(Email, 1, SepPos - 1);
    DomainStr := Copy(Email, SepPos + 1, DataLen);
    // If either the user or domain is missing then there's an error
    if (UserStr = '') then
    begin
      // Report a missing section and exit
      FailCode := flMissingUser;
      Result := False;
      exit;
    end;
    if (DomainStr = '') then
    begin
      // Report a missing section and exit
      FailCode := flMissingDomain;
      Result := False;
      exit;
    end;
    // Now get the lengths of the two portions
    DomainStrLen := Length(DomainStr);
    UserStrLen := Length(UserStr);
    // Ensure that either one of the sides is not to large (per the standard)
    if DomainStrLen > MaxDomainPortion then
    begin
      FailCode := flDomainNameToLong;
      Result := False;
      exit;
    end;
    if UserStrLen > MaxUserNamePortion then
    begin
      FailCode := flUserNameToLong;
      Result := False;
      exit;
    end;
    // Now verify the user portion of the email address
    // Ensure that the period is neither the first or last char (or the only char)
    // Check first char
    if (UserStr[1] = '.') then
    begin
      // Report a missing section and exit
      FailCode := flInvalidChar;
      Result := False;
      FailPosition := 1;
      exit;
    end;
    // Check end char
    if (UserStr[UserStrLen] = '.') then
    begin
      // Report a missing section and exit
      FailCode := flInvalidChar;
      Result := False;
      FailPosition := UserStrLen;
      exit;
    end;
    // No direct checking for a single char is needed since the previous two
    // checks would have detected it.
    // Ensure no subsequent periods
    for Itt := 1 to UserStrLen do // Iterate
    begin
      if UserStr[Itt] = '.' then
      begin
        // Check the next char, to make sure it's not a .
        if UserStr[Itt + 1] = '.' then
        begin
          // Report the error
          FailCode := flInvalidChar;
          Result := False;
          FailPosition := Itt;
          exit;
        end;
      end;
    end; // for
    { At this point, we've validated the user name, and will now move into the domain.}
    // Ensure that the period is neither the first or last char (or the only char)
    // Check first char
    if (DomainStr[1] = '.') then
    begin
      // Report a missing section and exit
      FailCode := flInvalidChar;
      Result := False;
      // The position here needs to have the user name portion added to it
      // to get the right number, + 1 for the now missing @
      FailPosition := UserStrLen + 2;
      exit;
    end;
    // Check end char
    if (DomainStr[DomainStrLen] = '.') then
    begin
      // Report a missing section and exit
      FailCode := flInvalidChar;
      Result := False;
      // The position here needs to have the user name portion added to it
      // to get the right number, + 1 for the now missing @
      FailPosition := UserStrLen + 1 + DomainStrLen;
      exit;
    end;
    // No direct checking for a single char is needed since the previous two
    // checks would have detected it.
    // Ensure no subsequent periods, and while in the loop count the periods, and
    // record the last one, and while checking items, verify that the domain and
    // subdomains to dont start or end with a -
    SepCount := 0;
    LastSep := 0;
    PrevSep := 1; // Start of string
    for Itt := 1 to DomainStrLen do // Iterate
    begin
      if DomainStr[Itt] = '.' then
      begin
        // Check the next char, to make sure it's not a .
        if DomainStr[Itt + 1] = '.' then
        begin
          // Report the error
          FailCode := flInvalidChar;
          Result := False;
          FailPosition := UserStrLen + 1 + Itt;
          exit;
        end;
        // Up the count, record the last sep
        Inc(SepCount);
        LastSep := Itt;
        // Now verify this domain
        SubDomain := Copy(DomainStr, PrevSep, (LastSep) - PrevSep);
        // Make sure it doens't start with a -
        if SubDomain[1] = '-' then
        begin
          FailCode := flInvalidChar;
          Result := False;
          FailPosition := UserStrLen + 1 + (PrevSep);
          exit;
        end;
        // Make sure it doens't end with a -
        if SubDomain[Length(SubDomain)] = '-' then
        begin
          FailCode := flInvalidChar;
          Result := False;
          FailPosition := (UserStrLen + 1) + LastSep - 1;
          exit;
        end;
        // Update the pointer
        PrevSep := LastSep + 1;
      end
      else
      begin
        if DomainStr[Itt] = '@' then
        begin
          // Report an error
          FailPosition := UserStrLen + 1 + Itt;
          FailCode := flToManyAtSymbols;
          result := False;
          exit;
        end;
      end;
    end; // for
    // Verify that there is at least one .
    if SepCount < 1 then
    begin
      FailCode := flMissingDomainSeperator;
      Result := False;
      exit;
    end;
    // Now do some extended work on the final domain the most general (.com)
    // Verify that the lowest level is at least 2 chars
    SubDomain := Copy(DomainStr, LastSep, DomainStrLen);
    if Length(SubDomain) < 2 then
    begin
      FailCode := flMissingGeneralDomain;
      Result := False;
      exit;
    end;
    // Well after all that checking, we should now have a valid address
    Result := True;
  except
    Result := False;
    FailCode := -1;
  end; // try/except
end;

// ---------------------------ooo------------------------------ \\
// This function returns the error string from the constant
// array, and makes sure that the error code is valid, if
// not it returns an invalid error code string.
// ---------------------------ooo------------------------------ \\

function ValidationErrorString(Code: Integer): string;
begin
  // Make sure a valid error code is passed
  if (Code < Low(ErrorDescriptions)) or (Code > High(ErrorDescriptions)) then
  begin
    Result := 'Invalid error code!';
    exit;
  end;
  // Get the error description from the constant array
  Result := ErrorDescriptions[Code];
end;
end.

Nincsenek megjegyzések:

Megjegyzés küldése