воскресенье, 21 апреля 2013 г.

C#: Side Effects and LINQ's Defferred Execution

It is difficult for me to imagine a program that doesn't deal with collections of some type – all our applications do proliferate with Arrays, Lists HashSets, DataTables and dozens of others. When writing code in C# the first tool I consider when faced with a collection is LINQ. Its capabilities combined with the fact that LINQ queries can be applied to nearly everything make it a very powerful tool saving a lot of time for any .NET developer. Still, no power comes for free and every piece of equipment must be used with some caution.
 
One of the paradigms widely used by LINQ is lazy evaluation. Most LINQ methods, although it seems that these just return results immediately, are actually deferred. What it means is that whenever you write something like var strings = myArray.Select(n => n.ToString()); the only thing done at the moment of execution of these lines is creation of a query object. Real work in this case is not approached until someone wants the results. This is a well-known fact about LINQ and generally doesn't do any harm. However, when the processing wrapped into LINQ queries is not pure in the functional sense of the word quite strange and difficult to track down problems may occur. Below I will try to demonstrate such a case. 

First of all, let us set up a framework. We'll use a very simple class Person with only three string fields and a ToString() method:
 
class Person
{
    public string Firstname { get; set; }
    public string Lastname { get; set; }
    public string Middlename { get; set; }

    public override string ToString()
    {
        return String.IsNullOrEmpty(Middlename) ?
            Firstname + " " + Lastname :
            Firstname + " " + Middlename + " " + Lastname;
    }
}
 
Another thing that we will need is something to process collections. For this purpose we define a StatefulPersonParser class:
 
class StatefulPersonParser
{
    public List<Person> People { get; private set; }

    public Person Parse(string personString)
    {
        var parts = personString.Split();

        Person person = null;
        if (parts.Count() == 2)
        {
            person = new Person() 
            { Firstname = parts[0], Middlename = "", Lastname = parts[1] };
        }
        else if (parts.Count() == 3)
        {
            person = new Person() 
            { Firstname = parts[0], Middlename = parts[1], Lastname = parts[2] };
        }
        else
        {
            throw new ArgumentException("Bad person string.");
        }

        People.Add(person);
        return person;
    }

    public StatefulPersonParser()
    {
        People = new List<Person>();
    }
}
 
The sole purpose of the parser is creating Person objects from string representations – the Parse(..) method is responsible for this. The statefulness of our parser manifests itself in the form of the People field – a list that holds every Person instance created by the instance of the processor. Having the class we can use it to build up some people from a collection of strings. Here are code and output of our simple program:
 
static void Main(string[] args)
{
    string somePeople = @"Douglas R. Hofstadter,Egbert B. Gebstadter,James Gleick";

    var parser = new StatefulPersonParser();
    var parsedPeople = somePeople.Split(',').Select(_ => parser.Parse(_));
    foreach (var p in parsedPeople)
    {
        Console.WriteLine(p.Lastname + ", " + p.Firstname + " " + p.Middlename);
    }

    Console.WriteLine(
     String.Format("A total of {0} people were parsed.", parser.People.Count));
    Console.ReadLine();
}
 
Hofstadter, Douglas R.
Gebstadter, Egbert B.
Gleick, James
A total of 3 people were parsed.
 
Everything's plain and great. Let's break it. Suppose for some strange reason we want to have first names, last names and middle names of our virtuous people in separate collections. The target is easy to achieve as soon as we have a collection of Persons: we just apply three Select(..) methods to the source collection and get the desired IEnumerables. After this we are free to iterate over the results and do whatever we want with the separated names. For now just writing them to the console will do:
 
var parsedPeople = somePeople.Split(',').Select(_ => parser.Parse(_));

var firstnames = parsedPeople.Select(p => p.Firstname);
var middlenames = parsedPeople.Select(p => p.Middlename);
var lastnames = parsedPeople.Select(p => p.Lastname);

foreach (var name in lastnames)
{
    Console.WriteLine(name);
}
Console.WriteLine();

foreach (var name in firstnames)
{
    Console.WriteLine(name);
}
Console.WriteLine();

foreach (var name in middlenames)
{
    Console.WriteLine(name);
}
Console.WriteLine();

Console.WriteLine(String.Format("A total of {0} people were parsed.", parser.People.Count));
 
In the output we expect to see three groups of lines. This modest expectation will be fulfilled, but there is a surprise waiting for us in the printout as well:
 
Hofstadter
Gebstadter
Gleick

Douglas
Egbert
James

R.
B.


A total of 9 people were parsed.
 
Certainly we have parsed only 3 strings (the same as in the previous example – no people were added), but the program says that there were 9 people. So what has happened? We have just created a deferred query object (with the first Select) and fed it to 3 more queries. The laziness of the object created by the Select implies that it won't be processed until it is forced to, but this does not explain our results. What is more, even the laziness of the following three Selects has nothing to do with the fact that the query was executed three times. The issue is simpler: each time we explicitly (e.g. via foreach loop) or implicitly (for instance by other LINQ queries) iterate over the query it will be invoked anew. Because the class we use in the body of the query maintains some state, it'd be reasonable to expect that this state will be changed each time the query is executed and the output above fully reflects this mere fact.
 
So what does this situation teach us? Definitely not to fear lazy evaluations or LINQ. In fact, the problem should be approached from the opposite side, that is to avoid making a mess one should fully understand which parts of their program do incur changes in state of the objects and how and when can this impact other parts of program. LINQ is in its heart a functional tool and the functional programming paradigm teaches us to make side effects as rare, clear and easy to spot as possible. This said, the general lesson is to avoid side effects, but the more specific one is to avoid mixing them with deferred execution. Moreover, the mere name "Language Integrated Query" suggests that no side effects should occur upon LINQ methods execution – that's not what queries do.
 
Returning to our code, the only thing we need to do to make it work correctly is to force the query execution prior to consuming its results elsewhere. For this purpose one actually needs to alter only one line of code adding a call to one of LINQ methods which force execution and return results in the form of some sequence. In our case ToList() will do:
 
var parsedPeople = somePeople.Split(',').Select(_ => parser.Parse(_)).ToList();
 
Hofstadter
Gebstadter
Gleick

Douglas
Egbert
James

R.
B.


A total of 3 people were parsed.
 
So now, observing the correct output, we can be sure that only the required amount of processing is done. Still, while this solution works here and will work in many other cases, the problem is deeper and has nothing to do with the LINQ itself. The processor that not only processes items but makes some changes to the state of the program must feel to you like a time bomb, because sooner or latter its statefulness will manifest itself in the form of severe and unfathomable bugs. Since there are no means to tell a programmer that he is dealing with a bomb and must do this with caution, such things should therefore be escaped whenever possible. The good news is that avoiding side effects is more often possible than not and even when it is hard to achieve the dangerous code can, at least, be isolated from other parts of the program which could otherwise play a role of the detonator.
 
The code for our simplistic example is available through GitHub.

Комментариев нет:

Отправить комментарий