grokgarble.com

Database, Software, and System Engineering

To VARCHAR or NVARCHAR? (and how it bit me using HASHBYTES)

The Difference

The key difference between the two data types is how they’re stored. VARCHAR is stored as regular 8-bit data, NVARCHAR strings are stored in the database as UTF-16.

UTF-8 (UCS Transformation Format—8-bit) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.

UTF-8 has become the dominant character encoding.  Over half of all internet pages are encoded this way, the Internet Mail Consortium (IMC) recommends that all e-mail programs be able to display and create mail using UTF-8, and UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.

UTF-8 is [or continues to be] kind of a big deal… The others, not so much anymore.

Storage

NVARCHAR strings have the same length restrictions as its VARCHAR brother — 8,000 bytes. However, since NVARCHARs use two bytes for each character, that means a given NVARCHAR can only hold 4,000 characters (not bytes) maximum. The amount of storage needed for NVARCHAR entities is going to be twice whatever you’d allocate for a plain old VARCHAR.  Therefore, most don’t want to use NVARCHAR universally, and may want to fall back on VARCHAR — which takes up less space per row — whenever possible.

How it bit me…

I have some data that’s MD5 hased using a C# function much like this example from MSDN that I needed to validate at my data layer (SQL Server 2008 R2).  However, I was getting different values that should have matched since I was providing the same intput.

Here’s the function using the MD5 class

static string GetMd5Hash(MD5 md5Hash, string input)
        {

            // Convert the input string to a byte array and compute the hash. 
            byte[] data = md5Hash.ComputeHash(Encoding.UTF8.GetBytes(input));

            // Create a new Stringbuilder to collect the bytes 
            // and create a string.
            StringBuilder sBuilder = new StringBuilder();

            // Loop through each byte of the hashed data  
            // and format each one as a hexadecimal string. 
            for (int i = 0; i < data.Length; i++)
            {
                sBuilder.Append(data[i].ToString("x2"));
            }

            // Return the hexadecimal string. 
            return sBuilder.ToString();
        }

The code is storing the returned string object in a column along with some salted details for a password.  Somehow a few of the values were not syncing up with the input correctly.  To test I used input and the  HASHBYTES function to convert and compare to stored values with original inputs’ HASED values to find the culprits.  But, what I found was nothing matched!  I went back to simple data to troubleshoot and found that I was getting different values for “FOO” from using the C# method vs. the SQL method!

Here’s an example:

C#

MD5 md5Hash = MD5.Create();
string input = "FOO";
string results = GetMd5Hash(md5Hash, input);
#results = 901890A8E9C8CF6D5A1A542B229FEBFF

SQL

DECLARE @nvarchar nvarchar(400) 

SET @nvarchar = 'FOO'

SELECT CONVERT(VARCHAR(32), HASHBYTES( 'MD5', @nvarchar ), 2)
--RESULT:  4676C8342EF52153F898C2EAAD1C528D

The Problem

The HASHBYTES built-in function indicates it will take an input of varchar, nvarchar, or varbinary to be hashed.  What is easy to overlook is that the encoding of datatype matters.  

Even though it clearly shows in the C# code and example that it is encoding to UTF-8, I didn’t catch it until after I pulled out a few remaining hairs that the encoding of the variables in storage were causing my problem.  SQL ambiguates  the conversion of nvarchar and varchar on the fly, displaying them both as unincoded strings, post conversion when you select on them.  So, the difference its transparent when inspecting the result sets.  At least that makes me “feel” like less of an idiot.

Solution

Depending on how your variable is stored within SQL will dictate the output of your hash, regardless of algorithm since the binary data presented to the function will be different per data type.

Therefore, if you’re wondering why your HASHED values of any encryption algorithm differ from their clear text counter parts.  Ensure your encoding of your data type is the same as the original source.

For a proof, play with the below and modify the datatypes of both the conversion and variables and see how encryption and hashing are always done at the byte level.

 DECLARE @varchar varchar(400) 
 DECLARE @nvarchar nvarchar(400) 

 SET @varchar = 'FOO'
 SET @nvarchar = 'FOO'

 SELECT CONVERT(VARCHAR(32), HASHBYTES( 'MD5', @varchar ), 2)
--RESULT:  901890A8E9C8CF6D5A1A542B229FEBFF
 SELECT CONVERT(VARCHAR(32), HASHBYTES( 'MD5', @nvarchar ), 2)
--RESULT: 4676C8342EF52153F898C2EAAD1C528D

2 thoughts on “To VARCHAR or NVARCHAR? (and how it bit me using HASHBYTES)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>