воскресенье, 24 марта 2013 г.

C#: LINQ and IEqualityComparer

When writing programs in C# with the help of LINQ I have come across the IEqualityComparer<T> generic interface several times. The name conveys the purpose of the interface clearly, still, surprisingly, my first attempts to apply it were marked by strong confusion - the thing merely didn't work the way I wanted it to. Moreover, after going to Google and StackOverflow I have noticed that I'm not the only one to face difficulties with it. After inspecting multiple SO questions and answers related to the topic as well as some other articles I have both found the solution that fitted my needs for the moment and got some understanding of the way IEqualityComparer is actually used by LINQ operators. Here I will try to explain what I got from there.
 
We will deal with the Distinct(..) LINQ extension method, because it seems both the simplest and the most common consumer of IEqualityComparer<T> interface. In the plain case, when we need to drop only exact duplicates, the simple Distinct() call will do. The simplest possible example is getting unique integers from a collection of numbers:
 
var numbers = new int[] { 1, 2, 3, 2, 5, 5, 3 };
foreach (var n in numbers.Distinct())
{
     Console.WriteLine(n);
}

/*
The output is:
1
2
3
5
*/
However, quite frequently we, C# programmers, use classes and want to get rid of duplicated objects in a collection the same way. Suppose, we do something with actors of a movie:
class MovieActor
{
    public string LastName { get; set; }
    public string FirstName { get; set; }
    public string CharacterName { get; set; }

    public override string ToString()
    {
        return String.Format(
         "{0} \"{1}\" {2}", FirstName, CharacterName, LastName);
    }

    public static List<MovieActor> CreateSome()
    {
        return new List<MovieActor>()
        {
            new MovieActor() { 
             FirstName = "Brad", 
             LastName = "Pitt", 
             CharacterName = "Rusty"},
            new MovieActor() { 
             FirstName = "Andy", 
             LastName = "Garcia", 
             CharacterName = "Terry"},
            new MovieActor() { 
             FirstName = "George", 
             LastName = "Clooney", 
             CharacterName = "Dany"},
            new MovieActor() { 
             FirstName = "Jullia", 
             LastName = "Roberts", 
             CharacterName = "Tess"}
        };
    }
}
Good news is that this same Distinct() method works for collections of our custom objects too. So if George Clooney accidentally creeps into our collection twice that's not a problem:
var actors = MovieActor.CreateSome();
actors.Add(actors[2]);

Console.WriteLine(
 String.Format("{0} total actors.", actors.Count()));

var distinct = actors.Distinct();
Console.WriteLine(
 String.Format("\n{0} distinct actors.", distinct.Count()));

foreach (var actor in distinct)
{
    Console.WriteLine(actor);
}
Not surprisingly, the program prints out our list of actors and despite Mr. Clooney's excellency he is mentioned only once there:
/* Output:
5 total actors.

4 distinct actors.
Brad "Rusty" Pitt
Andy "Terry" Garcia
George "Dany" Clooney
Jullia "Tess" Roberts
*/
 
However, it gets a little more difficult in case there are two instances of George:
 
var actors = MovieActor.CreateSome();
actors.Add(new MovieActor() { 
 FirstName = "George", LastName = "Clooney", 
 CharacterName = "Dany"});
What Distinct() actually does here is comparing object references. So our second George Clooney is in fact an independent object (since it was created independently) and its reference differs from the reference of the first one - therefore it appears in the output twice:
/* Output:

5 total actors.

5 distinct actors.
Brad "Rusty" Pitt
Andy "Terry" Garcia
George "Dany" Clooney
Julia "Tess" Roberts
George "Dany" Clooney
*/

IEqualityComparer<T>

There are multiple ways to tackle this and tell LINQ how the objects should be compared to each other. I will cover the use of IEqualityComparer<T>. The interface contains two methods:
  • bool Equals(T x, T y) and
  • int GetHashCode(T obj).
The first time I dealt with it I have done a straightforward thing: keeping in mind the fact that I want to compare two objects to check whether they are equal or not, I have implemented Equals(..) part and also introduced very simple GetHashCode(..) version:
 
class ActorComparer : IEqualityComparer<MovieActor>
{
    public bool Equals(MovieActor x, MovieActor y)
    {
        return
            x.LastName == y.LastName &&
            x.FirstName == y.FirstName &&
            x.CharacterName == y.CharacterName;
    }

    public int GetHashCode(MovieActor obj)
    {
        return obj.GetHashCode();
    }
}
Imagine my surprise when the program spit something like this:
/* Output:

5 total actors.

5 distinct actors.
Brad "Rusty" Pitt
Andy "Terry" Garcia
George "Dany" Clooney
Julia "Tess" Roberts
George "Dany" Clooney
*/
 
For me it seemed that LINQ simply ignored my equality comparer and kept doing everything on its own. In fact, we could check this by adding a debug output to Equals method:
public bool Equals(MovieActor x, MovieActor y)
{
    Console.WriteLine(
     "Equals called on " + 
     x.ToString() + " " + 
     y.ToString());
    ...
}
And yes, after this the output of the program doesn't change at all. So, what the hell is LINQ doing and why doesn't it even try to use the equality comparer that we have provided? The situation was even bitter for me because I was trying to use Distinct() combined with custom equality comparer to prepare data for pushing to data base, so I ended up with SQL Server telling me that I am breaking a primary key constraint. Because of being focused on a broader problem I didn't take time to think over the stuff that I was trying to use. What I missed was the fact that LINQ, being a query engine, tries to perform as efficiently as possible. More importantly, I have stepped into the trap of thinking that objects, which I use in my code, can be only partially relevant to me, while in case we do OOP every object must be considered in its entirety (things like SOLID help us with that.) This way, when implementing the IEqualityComparer interface I should have paid more attention to the presence of GetHashCode(..) method. This would solve my problem at once, because it is this same method that LINQ uses when asked to extract distinct objects and perform other equality related operations (see Set operators in this article). To verify this we add a debug message to the method and observe several calls to it:
 
public int GetHashCode(MovieActor obj)
{
    Console.WriteLine(
     "Hash called on " + 
     obj.ToString() + 
     " (" + obj.GetHashCode() + ")");
    return obj.GetHashCode();
}
/* Output:

5 distinct actors.
Hash called on Brad "Rusty" Pitt (46104728)
Brad "Rusty" Pitt
Hash called on Andy "Terry" Garcia (12289376)
Andy "Terry" Garcia
Hash called on George "Dany" Clooney (43495525)
George "Dany" Clooney
Hash called on Julia "Tess" Roberts (55915408)
Julia "Tess" Roberts
Hash called on George "Dany" Clooney (33476626)
George "Dany" Clooney
*/
 
Because for reference types default GetHashCode() as well as default Equals(..) is based on references, we still observe both Georges in our printout. Although, now, when we know what exactly goes on under the hood, we can resolve the issue easily. For this purpose let's just use hash of concatenated FirstName, LastName and CharacterName:
public int GetHashCode(MovieActor obj)
{
    Console.WriteLine(
     "\tHash called on " + 
     obj.ToString() + 
     " (" + obj.GetHashCode() + ")");
    return (obj.FirstName+obj.LastName+obj.CharacterName).GetHashCode();
}

While doing this, one may ask why do we still need to implement the Equals(..) method of IEqualityComparer. The output of our refreshed program gives the answer:
5 actors total.
        Hash called on Brad "Rusty" Pitt (46104728)
Brad "Rusty" Pitt
        Hash called on Andy "Terry" Garcia (12289376)
Andy "Terry" Garcia
        Hash called on George "Dany" Clooney (43495525)
George "Dany" Clooney
        Hash called on Julia "Tess" Roberts (55915408)
Julia "Tess" Roberts
        Hash called on George "Dany" Clooney (33476626)
        Equals called on George "Dany" Clooney George "Dany" Clooney
So, first, we finally got it working - there is only one George Clooney. Second, notice the last printed line - it shows that our custom Equals method is actually called. With the output that we have it is quite easy to deduce the way Distinct() works:
  1. first, as with all deferred LINQ queries the query object is prepared (that's what the call to Distinct(..) method does)
  2. next, when we iterate over the query object it gets hash code for each item in the source collection and compares it with the hashes of preceding objects
  3. in case a particular hash value was not encountered before, the corresponding object makes its way into the result
  4. if the value was met before, LINQ performs more thorough check for equality calling the Equals(..) method of the comparer for the objects with equal hashes. If this returns true, the second object is forgotten. Otherwise, it gets into the resulting collection.
The last statement basically means that in case we rely on Equals(..) method we could calculate hash based only on the last name and the result will still be the same:
public int GetHashCode(MovieActor obj)
{
    return obj.LastName.GetHashCode();
}
Moreover, if we don't care about efficiency at all, we could make GetHashCode(..) return a constant, thus making Equals(..) entirely responsible for the results we get. Still, I hope that the output will persuade you to avoid this, because the performance impact is significant:
 
public int GetHashCode(MovieActor obj)
{
    return 0;
}
/* Output:

5 actors total.
        Hash called on Brad "Rusty" Pitt (46104728)
Brad "Rusty" Pitt
        Hash called on Andy "Terry" Garcia (12289376)
        Equals called on Brad "Rusty" Pitt vs Andy "Terry" Garcia
Andy "Terry" Garcia
        Hash called on George "Dany" Clooney (43495525)
        Equals called on Andy "Terry" Garcia vs George "Dany" Clooney
        Equals called on Brad "Rusty" Pitt vs George "Dany" Clooney
George "Dany" Clooney
        Hash called on Julia "Tess" Roberts (55915408)
        Equals called on George "Dany" Clooney vs Julia "Tess" Roberts
        Equals called on Andy "Terry" Garcia vs Julia "Tess" Roberts
        Equals called on Brad "Rusty" Pitt vs Julia "Tess" Roberts
Julia "Tess" Roberts
        Hash called on George "Dany" Clooney (33476626)
        Equals called on Julia "Tess" Roberts vs George "Dany" Clooney
        Equals called on George "Dany" Clooney vs George "Dany" Clooney
*/
So now we know how to use IEqualityComparer with LINQ and not make a mess of it. I hope this helps. However, I would like to push our ActorComparer a little further. As you might know, LINQ has an OrderBy(..) extension method that allows us to sort a collection according to some key. For this purpose a programmer must provide a function that gets the key from an object - usually this is done via lambda expression. What I would like to do is to provide a way to call the LINQ Distinct(..) method in the same manner. For this purpose we will change the ActorComparer.
 

Generalizing a little

As a result of the modification we must be able to do something like this:
var distinct = actors.Distinct(new ActorComparer(a => a.LastName));
and get a set of actors with unique last names. So we need a way to provide a key selector to our ActorComparer. This is done simply by creating a constructor that takes a function object as an argument and stores it for further use:
public ActorComparer(Func<MovieActor, object> keySelector)
{
    KeySelector = keySelector;
}
The Func<MovieActor, object> is a class standing for something that might be called with MovieActor argument and must yield a result of type object. Although generally I don't like dealing with pure object's in my code, this is a valid way to define a key selector not bounding it to some specific key type. Moreover, in this case this doesn't lead to any problems because we aren't going to use anything except Equals(..) and GetHashCode() methods on the key, so no casts are required. Now we only need to modify Equals(..) and GetHashCode(..) methods (don't confuse these with object.Equals(object o) and object.GetHashCode() ) of our comparer so that they use new KeySelector property:
public bool Equals(MovieActor x, MovieActor y)
{
    return KeySelector(x).Equals(KeySelector(y));
}

public int GetHashCode(MovieActor obj)
{
    return KeySelector(obj).GetHashCode();
}
Since our Equals(..) and GetHashCode(..) methods look very similar, a question may arise: why do we need them both? First of all, we already know that we cant get rid of GetHashCode(..) because it is what Distinct(..) uses for comparison in the first place. Okay, let's deal with Equals(..) then: do we still need to compare key values when we have already used hash codes for comparison? Absolutely yes! What makes it inevitable is the idea behind the hash codes.
 
Hash functions that are used to generate hash codes, actually do one thing: they project elements from some data set to a smaller data set (the set of hash codes). The former might be almost anything, while the latter is usually the set of integers. This transformation allows for faster comparison of elements during look-up, because the elements of the second set are easier to compare and because there are fewer of them. Still, due to this same reason any hash function might eventually produce equal codes for non equal objects - this is called hash collision. That's why when LINQ comes across two elements with equal hashes it calls Equals(..) function to check whether the elements are actually equal.
 
This said, let's return to our ActorComparer. You might suggest that to achieve the goal we need to perform some more complex modifications, but no - all we have to do is use the comparer the new way:
var distinct = actors.Distinct(new ActorComparer(a => a.LastName));
The result is the same as when using the first version of ActorComparer, although the new one is much more flexible in the sense that it may be used differently in different contexts and no further modifications are required to its code. Besides, it allows to use more than one property as a key, so the next call is absolutely valid and will preserve all actors with the same last name as long as their first names differ:
var distinct = actors.Distinct(
 new ActorComparer(a => 
 new { a.LastName, a.FirstName }));
The flexibility that this solution offers might be useful when one deals with the movie's sequel. The problem is that Julia Roberts plays two roles there: Tess Ocean and herself:
public static List<MovieActor> CreateSome()
{
    return new List<MovieActor>()
    {
        new MovieActor() { 
         FirstName = "Brad", LastName = "Pitt", 
         CharacterName = "Rusty"},
        new MovieActor() { 
         FirstName = "Andy", LastName = "Garcia", 
         CharacterName = "Terry"},
        new MovieActor() { 
         FirstName = "George", LastName = "Clooney", 
         CharacterName = "Dany"},
        new MovieActor() { 
         FirstName = "Julia", LastName = "Roberts", 
         CharacterName = "Tess"},
        new MovieActor() { 
         FirstName = "Julia", LastName = "Roberts", 
         CharacterName = "Julia Roberts"}
    };
}
Still, with the previous call we'll see her only once in the results. The simple modification of the call to Distinct(..) will solve this issue, while still showing only one copy of George Clooney:
var distinct = actors.Distinct(
 new ActorComparer(
  a => 
  new { a.LastName, a.FirstName, a.CharacterName }));
/* Output:

6 actors total.
Brad "Rusty" Pitt
Andy "Terry" Garcia
George "Dany" Clooney
Julia "Tess" Roberts
Julia "Julia Roberts" Roberts
*/

Conclusion

This is it. We have explored the interaction between LINQ extension methods and custom IEqualityComparers and even implemented one. The resulting class is both easy to use and highly customizable, because its operation is fully defined by the key selector function provided by user. Furthermore, it is very easy to make the class generic so that it can be used for collections of objects of other types - not only for MovieActors. The complete code for this example is available through github. (There is also a generic version of our comparer.)
 
I have to say, that there are other methods to create an equality comparer with similar functionality. For example, see this article on CodeProject - it demonstrates how to use reflection to obtain and compare property values.
 
Finally, if you just need to filter collection for distinct values based on some key and you want to do it quickly with as few additional actions as possible, there is a trick that doesn't require creating new types:
return actors.GroupBy(a => new { 
    a.LastName, a.FirstName, a.CharacterName }).
    Select(g => g.First());
Note that IEqualityComparer may (and should) be used to perform more complex comparisons, however its implementation won't get much more complex in most cases.
 

1 комментарий:

  1. Borgata Hotel Casino & Spa - JTR Hub
    Located https://jancasino.com/review/merit-casino/ in Atlantic City, Borgata Hotel Casino & 출장안마 Spa offers the finest in หารายได้เสริม amenities and entertainment. It ventureberg.com/ also apr casino provides a seasonal outdoor swimming

    ОтветитьУдалить