Friday, August 23, 2013

Syntax Modification using Microsoft Roslyn

In the previous two posts we just used the analysis capabilities of Microsoft Roslyn. In other words, we only get information about the code under inspection, we never changed it. In this post we going to change the code through Roslyn.

Before we start we need to highlight that data structures that represent code under inspection in Roslyn APIs are immutable. This means that it cannot be changed after they are created (in order to safely share them between tools). This applies to syntax trees, compilations, symbols, semantic models, and every other data structure in the Roslyn API. Instead of modification, new objects are created based on specified differences to the old ones.

Creating and Modifying Trees

We used SyntaxTree before to create syntax trees from a code snippet. For creating a syntax tree bit-by-bit, you have to compose syntax nodes hierarchically in a bottom-up fashion. To create SyntaxNodes you must use the Syntax class factory methods. For each kind of node, token, or trivia there is a factory method which can be used to create an instance of that type.

In the following example we going to replace a using directive by a new one we create.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Roslyn.Compilers;
using Roslyn.Compilers.CSharp;
using Roslyn.Services;
using Roslyn.Services.CSharp;

namespace RoslynDemo4
{
class Program
{
static void Main(string[] args)
{
SyntaxTree tree = SyntaxTree.ParseText(
@"using System;
using System.Collections;
using System.Linq;
using System.Text;

namespace HelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(""Hello, World!"");
}
}
}");

var root = (CompilationUnitSyntax)tree.GetRoot();
// Get the old Using directive [using System.Collections;]
var oldUsing = root.Usings[1];

// Construct the new Using directive
NameSyntax name = Syntax.IdentifierName("System");
name = Syntax.QualifiedName(name, Syntax.IdentifierName("Collections"));
name = Syntax.QualifiedName(name, Syntax.IdentifierName("Generic"));
var newUsing = oldUsing.WithName(name);

// Replace the old Using directive by the new Using directive
root = root.ReplaceNode(oldUsing, newUsing);

SyntaxTree newTree = SyntaxTree.Create(root);
Console.WriteLine(newTree.ToString());
Console.ReadLine();
}
}
}

NameSyntax is the base class for the following names that can appear in C# code:


  • IdentifierNameSyntax which represents simple single identifier names like System and Roslyn
  • GenericNameSyntax which represents a generic type or method name such as List<int>
  • QualifiedNameSyntax which represents a qualified name of the form <left-name>.<right-identifier-or-generic-name> such as System.IO

In our example we constructed a NameSyntax to represent System.Collections.Generic incrementally. Then we used the ReplaceNode method to replace the using System.Collections; directive with the using System.Collections.Generic; directive. We then constructed a new SyntaxTree using the modified root and print it. You should see the modified HelloWorld program:fileFProjectsRoslynDemo4RoslynDemo4binDebugRoslynDemo4.EXE_2013-08-23_09-38-56


Modifying Syntax Trees using SyntaxRewriters


In a previous post we implemented our SyntaxWalker. Now we going to use the same technique and implement our SyntaxRewriter. Both SyntaxWalker and SyntaxRewriter are subclasses of SyntaxVisitor which recursively visit every node in the syntax tree starting from the passed node. The SyntaxRewriter class can be used to apply a transformation to a specific type of SyntaxNode. It is also possible to apply a set of transformations to multiple types of SyntaxNode wherever they appear in a syntax tree. It uses depth-first search algorithm for SyntaxTree traversal.


In the following example we will implement our own SyntaxRewriter which will replace all variable declaration types with the var type. Create a new Console application and add a new class myRewriter to it. Edit the myRewriter.cs to look like:

using System;
using Roslyn.Compilers;
using Roslyn.Compilers.CSharp;

namespace RoslynDemo5
{
class myRewriter : SyntaxRewriter
{
private readonly SemanticModel SemanticModel;

public myRewriter(SemanticModel semanticModel)
{
this.SemanticModel = semanticModel;
}

public override SyntaxNode VisitLocalDeclarationStatement(LocalDeclarationStatementSyntax node)
{
// To avoid the case of multiple variables in a single declaration like the following one
// Type variable1 = expression1,
// variable2 = expression2;
if (node.Declaration.Variables.Count > 1)
{
return node;
}

// To avoid the case of case of a declation without an initializer like the following
// Type variable;
if (node.Declaration.Variables[0].Initializer == null)
{
return node;
}

// Get the first varaible in the declation
VariableDeclaratorSyntax declarator = node.Declaration.Variables.First();
// Get the Type used in the declaration
TypeSyntax variableTypeName = node.Declaration.Type;
// Bind the Type used in the declaration to a Symbol
TypeSymbol variableType = (TypeSymbol)SemanticModel.GetSymbolInfo(variableTypeName).Symbol;
// Get the whole intialization expression
TypeInfo initializerInfo = SemanticModel.GetTypeInfo(declarator.Initializer.Value);
if (variableType == initializerInfo.Type)
{
// Constructing a new Type with the same leading/trailing in the old type to maintain vertical whitespace and indentation
TypeSyntax varTypeName = Syntax.IdentifierName("var").
WithLeadingTrivia(
variableTypeName.GetLeadingTrivia()
).
WithTrailingTrivia(
variableTypeName.GetTrailingTrivia()
);
// Do the replacement and return the new node
return node.ReplaceNode(variableTypeName, varTypeName);
}
else
{
return node;
}
}
}
}


myRewriter class is simply inheriting from SyntaxRewriter class and implments the VisitLocalDeclarationStatment() method. This method is called by the Visit method on SyntaxRewriter, which recursively visits a node and each of its children. The VisitLocalDeclarationStatment method is called on local declaration statements only. It tokes the starting SyntaxNode, if you passed the tree root (as we will do) it will visit the the local declaration statements that exists in the whole program.


Our rewriter takes the semantic model under inspection in the constructor.


In the VisitLocalDeclarationStatment method we check the declaration statement to filter out the cases that are not within our interest. For these cases we return the node unchanged.


For the cases we interested in, we retrieve the variable and the type used for its declaration. Get a symbol for the type and the whole initialization expression.


Then we check that the type used in the declaration is the type used in the initialization to avoid the situations where the declaration may be casting the initializer expression to a base class or interface. Removing the explicit type in these cases would change the semantics of a program.


Then we construct a new TypeSyntax representing the var type with any leading or trailing tivia (white spaces and tabs) to conserve the indentation. Then we replace the node and return the new node.


Let’s go back to our Main method in the Program.cs file. In order to use the myRewriter class, edit the Main method to look like:

 static void Main(string[] args)
{
// Create a test Compilation of our current project
Compilation test = CreateTestCompilation();

// Loop on each SyntaxTree in the Compilation
foreach (SyntaxTree sourceTree in test.SyntaxTrees)
{
// Get the tree SemanticModel
SemanticModel model = test.GetSemanticModel(sourceTree);
// intialize myRewriter using the tree SemanticModel
myRewriter rewriter = new myRewriter(model);
// put our rewriter into action
SyntaxNode newSource = rewriter.Visit(sourceTree.GetRoot());

if (newSource != sourceTree.GetRoot())
Console.WriteLine(newSource.GetText().ToString());
Console.ReadLine();
}
}

Here we create a test Compilation using CreateTestCompilation method, which we will add soon. We then loop for each SyntaxTree in this test Compilation. For each SyntaxTree we get the SemanticModel and pass it to our myRewriter constructor. Then we call the Visit() method on the SyntaxTree root. This will cause our VisitLocalDeclarationStatement method to be called and do its syntax transformation task. If the new SyntaxTree is different than the old SyntaxTree, this means that we did syntax modifications and we print the new code.


Now it add the following method to your code inside Program.cs.

private static Compilation CreateTestCompilation()
{
SyntaxTree programTree = SyntaxTree.ParseFile(@"..\..\Program.cs");
SyntaxTree rewriterTree = SyntaxTree.ParseFile(@"..\..\myRewriter.cs");

List sourceTrees = new List();
sourceTrees.Add(programTree);
sourceTrees.Add(rewriterTree);

MetadataReference mscorlib = MetadataReference.CreateAssemblyReference("mscorlib");
MetadataReference roslynCompilers = MetadataReference.CreateAssemblyReference("Roslyn.Compilers");
MetadataReference csCompiler = MetadataReference.CreateAssemblyReference("Roslyn.Compilers.CSharp");

List references = new List();
references.Add(mscorlib);
references.Add(roslynCompilers);
references.Add(csCompiler);

return Compilation.Create("TransformationCS", null, sourceTrees, references);

}

This method basically creates two SyntaxTrees of our project code. Then make a list of the required references for these source code files. Then it sum it all in the factory method Compilation.Create() and return the resulted Compilation object.


When you run the project, you will see a modified version of the Program.cs file. Notice that our myRewriter class replaced the type used in all variables declarations to be of type var.


fileFProjectsRoslynDemo5RoslynDemo5binDebugRoslynDemo5.EXE_2013-08-23_13-28-40


pressing Enter will bring the modified version of myRewriter.cs file, which have the same kind of change.


fileFProjectsRoslynDemo5RoslynDemo5binDebugRoslynDemo5.EXE_2013-08-23_17-33-15



We have used the Roslyn Compiler APIs to write our own SyntaxRewriter that searches all files in a C# project for certain syntactic pattern, analyzes the semantics of source code that matches that pattern, and transforms it.


Now you can grab a paper and pencil, and start designing the next big thing in refactoring tools industry :)

Thursday, August 22, 2013

Semantic Analysis using Microsoft Roslyn

In a previous post we talked about using Microsoft Rosyln Syntax API to deal with syntax text in terms of SyntaxTrees and SyntaxNodes. But as we we all know, a single source code or code snippet can’t make a useful program. 99% of the time we end up with many source code files that depend on so many externals: assembly references, namespace imports, or other code files. The meaning (semantic) of SyntaxNodes depend heavily on these externals and may change due changes in these externals even if its enclosing source code file have not been changed.

The Compilation class help us deal with source text in the context of its dependents and externals. A Compilation is analogous to a single project as seen by the compiler and represents everything needed to compile a Visual Basic or C# program such as assembly references, compiler options, and the set of source files to be compiled. With this context you can reason about the meaning of code. Compilations allow you to find Symbols – entities such as types, namespaces, members, and variables which names and other expressions refer to. The process of associating names and expressions with Symbols is called Binding.

Creating a Compilation

In this following example we will create a SyntaxTree for our traditional HelloWorld program. Then we created a Compilation out of this SyntaxTree and added a reference to MS Core Library.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Roslyn.Compilers;
using Roslyn.Compilers.CSharp;
using Roslyn.Services;
using Roslyn.Services.CSharp;

namespace RoslynDemo3
{
class Program
{
static void Main(string[] args)
{
SyntaxTree tree = SyntaxTree.ParseText(
@"using System;
using System.Collections.Generic;
using System.Text;

namespace HelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(""Hello, World!"");
}
}
}");

var root = (CompilationUnitSyntax)tree.GetRoot();

Compilation compilation = Compilation.Create("HelloWorld")
.AddReferences(MetadataReference.CreateAssemblyReference("mscorlib"))
.AddSyntaxTrees(tree);

foreach (MetadataReference reference in compilation.ExternalReferences)
Console.WriteLine(reference.Display);

Console.ReadLine();
}
}
}

You can supply all the syntax trees, assembly references, and options in one call or you can spread them out over multiple calls. To add the reference we just used the metadata name, which is the same name you find when you add a reference through Visual Studio’s Reference Manager.Reference Manager - RoslynDemo3_2013-08-22_09-54-14


Now when you run your program, you should get an output similar to the following:FProjectsRoslynDemo3RoslynDemo3binDebugRoslynDemo3.exe_2013-08-22_10-00-45


The SemanticModel


Once we have a Compilation we can get a SemanticModel for any SyntaxTree in that Compilation. SemanticModels can be queried to answer questions like “What names are in scope at this location?” “What members are accessible from this method?” “What variables are used in this block of text?” and “What does this name/expression refer to?”


In the following example we going to modify our example a little bit. We will get the semantic model of our HelloWorld program. Then use this model to get semantic Symbol that represents the first using statement (of type NameSpaceSymbol). Then we use that symbol to get a list of namespaces inside the System namespace.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Roslyn.Compilers;
using Roslyn.Compilers.CSharp;
using Roslyn.Services;
using Roslyn.Services.CSharp;

namespace RoslynDemo3
{
class Program
{
static void Main(string[] args)
{
SyntaxTree tree = SyntaxTree.ParseText(
@"using System;
using System.Collections.Generic;
using System.Text;

namespace HelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(""Hello, World!"");
}
}
}");

var root = (CompilationUnitSyntax)tree.GetRoot();

Compilation compilation = Compilation.Create("HelloWorld")
.AddReferences(MetadataReference.CreateAssemblyReference("mscorlib"))
.AddSyntaxTrees(tree);

var model = compilation.GetSemanticModel(tree);
var nameInfo = model.GetSymbolInfo(root.Usings[0].Name);
var systemSymbol = (NamespaceSymbol)nameInfo.Symbol;

foreach (var ns in systemSymbol.GetNamespaceMembers())
Console.WriteLine(ns.Name);
Console.ReadLine();
}
}
}

When you run the above code you should get an output like the following:


FProjectsRoslynDemo3RoslynDemo3binDebugRoslynDemo3.exe_2013-08-22_13-23-45


You can also use the query methods we showed in the previous post to retrieve a certain node and then use the semantic model to get more information about that node. You could use the code below to get the node that represent the “Hello, World!” string in our code snippet, get information about its symbol type (System.String) , and even get information about the assembly that contains this type.

            var helloWorldString = root.DescendantNodes()
.OfType()
.First();
var literalInfo = model.GetTypeInfo(helloWorldString);
Console.WriteLine(literalInfo.Type.ContainingAssembly.BaseName);

or enumerate the public methods of the System.String class

            var stringTypeSymbol = (NamedTypeSymbol)literalInfo.Type;

Console.Clear();
foreach (var name in (from method in stringTypeSymbol.GetMembers()
.OfType()
where method.ReturnType == stringTypeSymbol &&
method.DeclaredAccessibility ==
Accessibility.Public
select method.Name).Distinct())
{
Console.WriteLine(name);
}

Somehow, we just scratched the surface of sematic analysis in this post. using both syntax and semantic analysis we can do more advanced and meaningful code focused tasks (which we will do in the next post).

Tuesday, August 20, 2013

Syntax Analysis using Microsoft Roslyn

In a previous post we talked about Microsoft Roslyn. In this post will get our hands dirty with Roslyn syntax analysis in order to develop our first code-focused program. We will see how the SyntaxTree of HelloWorld program look like and how we can traverse and query it.

The Syntax API exposes the syntax trees the compilers use to understand Visual Basic and C# programs. They are produced by the same parser that runs when a project is built or a developer hits F5. The four primary building blocks of syntax trees are:

  • The SyntaxTree class, an instance of which represents an entire parse tree.
  • The SyntaxNode class, instances of which represent syntactic constructs such as declarations, statements, clauses, and expressions.
  • The SyntaxToken structure, which represents an individual keyword, identifier, operator, or punctuation.
  • And lastly the SyntaxTrivia structure, which represents syntactically insignificant bits of information such as the whitespace between tokens, preprocessor directives, and comments.

Traversing Syntax Trees

First, download Microsoft Roslyn CTP from here. and Install it.You will need Visual Studio 2012 and Visual Studio 2012 SDK in order to install Roslyn.

Now open your Visual Studio 2012 and create a new Console Application project. Add the following to your project references

Reference Manager - RoslynDemo1_2013-08-19_15-25-47

Now edit your code to look like:

using System;
using System;
using Roslyn.Compilers;
using Roslyn.Compilers.CSharp;

namespace RoslynDemo1
{
class Program
{
static void Main(string[] args)
{
SyntaxTree tree = SyntaxTree.ParseText(
@"using System;

namespace HelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine(""Hello, World!"");
}
}
}");

foreach (SyntaxNode node in tree.GetRoot().ChildNodesAndTokens())
PrintSyntaxTree(node);

Console.ReadLine();
}
static void PrintSyntaxTree(SyntaxNode node)
{
if (node != null)
{
Console.WriteLine(node.Kind + " | " + node.ToString());
foreach (SyntaxNode child in node.ChildNodes())
PrintSyntaxTree(child);
}
}
}
}

Now, run our program. You should see an output similar to the following:


fileFProjectsRoslynDemo1RoslynDemo1binDebugRoslynDemo1.EXE_2013-08-20_15-24-06


Now let’s explain what we already did. First, we included the needed namespaces. The common Syntax APIs are found in the Roslyn.Compilers and the Roslyn.Compilers.Common namespace, while the language specific Syntax APIs are found in Roslyn.Compilers.CSharp and Roslyn.Compilers.VisualBasic.


Second, we tried to get a syntax tree corresponding to our HelloWorld program using the SyntaxTree.ParseText function, which produces a syntax tree by parsing the passed source code.


Third, in order to traverse the syntax tree we get its root, then get its child nodes. Then loop on these child nodes and their children recursively to print its kind and the string representation. C# node kind is on of the kinds in enum Roslyn.Compilers.CSharp.SymbolKind.


Inside the PrintSyntaxTree method, by playing in the order of node printing and the recursive call, you can apply your tree traversal logic. You can also inspect specific elements of the tree:

var root = (CompilationUnitSyntax)tree.GetRoot(); 
var helloWorldDeclaration = (NamespaceDeclarationSyntax)root.Members[0];
var programDeclaration = (TypeDeclarationSyntax)helloWorldDeclaration.Members[0];
var mainDeclaration = (MethodDeclarationSyntax)programDeclaration.Members[0];
var argsParameter = mainDeclaration.ParameterList.Parameters[0];
Console.WriteLine(argsParameter.ToString() );

Query Methods


In addition to traversing syntax trees you can also explore the syntax tree using the query methods defined on SyntaxNode. These methods are similar to XPath. You can also use these methods with LINQ to quickly find things in a tree. For example, the following statements uses a LINQ expression and the DescendantNodes method to locate the same parameter as in the previous example:

            var firstParameters = from methodDeclaration in root.DescendantNodes()
.OfType()
where methodDeclaration.Identifier.ValueText == "Main"
select methodDeclaration.ParameterList.Parameters.First();

var argsParameter2 = firstParameters.Single();
Console.WriteLine(argsParameter2.ToString());

SyntaxWalkers


If you want to find all nodes of a specific type in a syntax tree, for example, every property declaration in a file, you can do that by extending the SyntaxWalker class and overriding the VisitPropertyDeclaration method, you can process every property declaration in a syntax tree without knowing its structure beforehand.


Now let’s implement, as an example, a SyntaxWalker that examines the entire syntax tree and collects any using directives it find which aren’t importing a System namespace.


Create a new C# Console application and edit it to look like:

using System;
using Roslyn.Compilers;
using Roslyn.Compilers.CSharp;

namespace RoslynDemo2
{
class Program
{
static void Main(string[] args)
{
SyntaxTree tree = SyntaxTree.ParseText(
@"using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Roslyn.Compilers;
using Roslyn.Compilers.CSharp;

namespace TopLevel
{
using Microsoft;
using System.ComponentModel;

namespace Child1
{
using Microsoft.Win32;
using System.Runtime.InteropServices;

class Foo { }
}

namespace Child2
{
using System.CodeDom;
using Microsoft.CSharp;

class Bar { }
}
}");

var root = (CompilationUnitSyntax)tree.GetRoot();
}
}
}

As you noticed, this source text contains using directives scattered across four different locations: the file-level, in the top-level namespace, and in the two nested namespaces.


Now, add a new class to your project and edit it to look like the next code. Basically we inherit from the SyntaxWalker class, and over the VisitUsingDirective method. This method is called by Visit method, which recursively visits a node and each of its children. VisitUsingDirective method is called for using directives only, inside it we collect the namespaces that starts with “System.” but not equal to “System”.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Roslyn.Compilers;
using Roslyn.Compilers.CSharp;

namespace RoslynDemo2
{
class myWalker : SyntaxWalker
{
public readonly List Usings = new List();

public override void VisitUsingDirective(UsingDirectiveSyntax node)
{
if (node.Name.GetText().ToString() != "System" &&
!node.Name.GetText().ToString().StartsWith("System."))
{
this.Usings.Add(node);
}
}
}
}

Now let’s go back to the Main method and use our walker by adding the following code snippet to the end of Main method. Basically it passes the syntax tree root to myWalker.Visit method, which internally call the VisitUsingDirective method. Then we loop and print the collected usings.


When you run your program, you should see an output similar to the following:


fileFProjectsRoslynDemo2RoslynDemo2binDebugRoslynDemo2.EXE_2013-08-21_14-30-03


In this post we gave a quick example about obtaining a SyntaxTree of HelloWorld program and explored its nodes. We also built a SyntaxWalker that walk the syntax tree and collect the nodes of our interest only. Later we will see how to fully manipulate SyntaxNodes and SyntaxTrees.

Monday, August 19, 2013

Overview of Microsoft Roslyn – The first compiler-as-service product.

Introduction

From developer perspective, compilers are black boxes -- source code goes in one end, magic happens in the middle, and object files or assemblies come out the other end. During their job, compilers build up a deep understanding of our code. For decades this valuable information was unreachable.

The Roslyn project aims to open compilers for developers and offer it as services. Through Roslyn, compilers become services with APIs that can be used for code related tasks in your tools and applications.

In order to understand the APIs that Roslyn offers, we need to understand the general compiler phases:

  1. The parse phase where source is tokenized and parsed into syntax that follows the language grammar.
  2. The declaration phase where declarations from source and imported metadata are analyzed to form named symbols.
  3. The binding phase where identifiers in the code are matched to symbols.
  4. The emitting phase where all the information built up by the compiler is emitted as an assembly.

For each phase of these phases, Roslyn provides a set of APIs to access its services and an object model that allow access to the information at that phase.

  1. The parsing phase is exposed as a syntax tree
  2. The declaration phase as a hierarchical symbol table
  3. The binding phase as a model that exposes the result of the compiler’s semantic analysis
  4. The emit phase as an API that produces IL byte codes.

image

Rosyln API Layers

image

  1. Compiler APIs contains a representation of a single invocation of a compiler (including assembly references, compiler options, and source code files) which provides the object models that corresponds to the compiler pahases.
    1. Scripting APIs represents a runtime execution context for C# or VB piece of code. It contains a scripting engine that allows evaluation of expressions and statements as top-level constructs in programs.
  2. Services APIs represents the starting point for doing code analysis and refactoring over entire solutions by organizing all the information about the projects in a solution into single object model, offering you direct access to the compiler layer object models without needing to parse files, configure options or manage project to project dependencies. In addition, the services layer surfaces a set of commonly used APIs used when implementing code analysis and refactoring tools that function within a host environment like the Visual Studio IDE.
  3. Editor Services APIs allows a user to easily extend Visual Studio IDE features (such as IntelliSense, refactorings, and code formatting). This layer has a dependency on the Visual Studio text editor and the Visual Studio software development kit (SDK).

Working with Syntax

The fundamental data structure exposed by the Compiler APIs is the syntax tree. It represents the lexical and syntactic structure of source code, No part of the source code is understood without it first being identified. These trees enable tools to manipulate source code without editing it as text.

Syntax Trees Attributes

Syntax trees have three key attributes:

  1. Full fidelity which means that the syntax tree contains every piece of information found in the source text, every grammatical construct, every lexical token, and everything else in between including whitespace, comments, and preprocessor directives.
  2. Round-trippable which means that it can be traced back to the text it was parsed from and get the text representation of the sub-tree rooted at that node. This means that syntax trees can be used as a way to construct and edit source text. By creating a tree you have by implication created the equivalent source code, and by editing a syntax tree, making a new tree out of changes to an existing tree, you have effectively edited the source code.
  3. Immutable and thread-safe which means that after a tree is obtained, it is a snapshot of the current state of the code, and never changes. This allows multiple users to interact with the same syntax tree at the same time in different threads without locking or duplication.

Syntax Tree Elements

Each syntax tree is made up of nodes, tokens, and trivia.

  • Syntax Nodes: the primary elements of syntax trees. It represent syntactic constructs such as declarations, statements, clauses, and expressions. They are non-terminal nodes in the syntax tree, which means they always have other nodes and tokens as children. Each category of syntax nodes is represented by a separate class derived from SyntaxNode.
    • Parent property returns the node parent.
    • ChildNodes() returns a list of child nodes in sequential order based on its position in the source text. This list does not contain tokens.
    • DescendantNodes(), DescendantTokens(), and DescendantTrivia() returns a list of child nodes in sequential order based on its position in the source text.
    • Each syntax node subclass exposes all the same children through strongly typed properties.
      • BinaryExpressionSyntax node class
        • Property Left/Right of type ExpressionSyntax
        • Property OperatorToken of type SyntaxToken
      • IfStatementSyntax node class
        • Optional property ElseClauseSyntax which returns null if the child is not present.
  • Syntax Tokens are the terminals of the language grammar, representing the smallest syntactic fragments of the code. They are never parents of other nodes or tokens. Syntax tokens consist of keywords, identifiers, literals, and punctuation.
  • Syntax Trivia represent the parts of the source text that are largely insignificant for normal understanding of the code, such as whitespace, comments, and preprocessor directives. Because trivia are not part of the normal language syntax and can appear anywhere between any two tokens, they are not included in the syntax tree as a child of a node. you can access them through token’s LeadingTrivia or TrailingTrivia collections.
  • Spans In order to make each node knows its position within the source text, each node has two properties of type TextSpan. A TextSpan object is the beginning position and a count of characters, both represented as integers. Each node has two TextSpan properties:
    • The Span property is the text span from the start of the first token in the node’s sub-tree to the end of the last token. This span does not include any leading or trailing trivia.
    • The FullSpan property is the text span that includes the node’s normal span, plus the span of any leading or trailing trivia.
  • Kinds Each node, token, or trivia has a Kind property, of type SyntaxKind, that identifies the exact syntax element represented. The Kind property allows for easy disambiguation of syntax node types that share the same node class.
  • Errors When the parser encounters code that does not conform to the defined syntax of the language, it uses one of two techniques to create a syntax tree.
    • If the parser expects a particular kind of token, but does not find it, it may insert a missing token into the syntax tree in the location that the token was expected. A missing token represents the actual token that was expected, but it has an empty span, and its IsMissing property returns true.
    • The parser may skip tokens until it finds one where it can continue parsing. In this case, the skipped tokens that were skipped are attached as a trivia node with the kind SkippedTokens.

In this post we explored Microsoft Roslyn, an interesting product that offer the compiler as a service and opens a lot of doors in front of developers to develop code-focused tools and applications. In later posts we will gradually show Roslyn in action.

Tuesday, August 13, 2013

How to debug your Pex parameterized unit tests

We talked before about Microsoft Pex, and how it choose the shown inputs. We then talked more about building blocks of parameterized unit tests and its patterns.

What we missed till now is how to debug your parameterized unit tests. In order to debug your parameterized unit tests, do the following:

  • Go to Tools >> Options >> Pex >> Diagnostic.
  • Change the BreakOnStart option to True.

Options_2013-08-13_13-01-05

  • Now set a break point at any line of code inside your parameterized unit test.
  • Right click inside your parameterized unit test and select Run Pex Explorations.
  • You will see a dialog popping asking to attach a debugger.
  • Select your current Visual Studio instance.

Visual Studio Just-In-Time Debugger_2013-08-13_13-17-20

  • Once attached, press F5.
  • Now Visual Studio debugger will jump to your break point in the parameterized unit test and stop there. You can now step through you parameterized unit test code.

You will notice that parameter’s values are those selected by Pex. The debugger will run the parameterized unit test for each input value(s) chosen by Pex, So you can inspect the PUT behavior against a specific input value.

Thursday, August 1, 2013

Convert Microsoft Pex path conditions from Z3 native format to SMT-LIB

We talked before about getting the code path conditions from Microsoft Pex as in Z3 native format(.z3 file).

Sometimes you may need to convert from Z3 native format to SMT-LIB standard. You can do that using the Z3 C# APIs. The only trick here is: you have to use the Microsoft.Z3.dll that come with your current Microsoft Pex installation (C:\Program Files\Microsoft Pex\bin\Microsoft.Z3.dll).

To show how it works, let’s create a new C# console application. Add refeerence to Microsoft.Z3.dll that come with your Pex installation.

Edit your code to look like:

using System;
using System.IO;
using Microsoft.Z3;

namespace Z3ToSMTLIB
{
class Program
{
static void Main(string[] args)
{
if(args.Length == 0)
Console.WriteLine("Please provide input file name.");
if (!File.Exists(args[0]))
Console.WriteLine("The specified file doesn't exist.");
using (Context ctx = new Context(new Config()))
{
string SMTLIB= ctx.ParseZ3File(args[0]).ToString();
StreamWriter writer = new StreamWriter(args[0].TrimEnd("z3".ToCharArray()) + "SMTLIB");
writer.Write(SMTLIB);
writer.Close();
}
}
}
}

Right click your project in the solution explorer and select Add >> New Item >> select Application Configuration File >> click Add. Edit the newly created App.config file to look like: Z3ToSMTLIB - Microsoft Visual Studio_2013-08-13_12-58-55Now you can run your code and pass a file in Z3 native format and get a file in SMT-LIB format. This SMT-LIB format opens a lot of possibilities that we will explore later.

Tuesday, July 30, 2013

The Z3 Constraint Solver, a developer perspective

Z3 is a high-performance SMT solver developed at Microsoft Research. It have been integrated with many many tools that came out from Microsoft for program analysis, testing, and verification.

What SAT means ?

SAT refers to Boolean satisfiability problem where we want to determine if there exists an interpretation that satisfies a given Boolean formula. In other words, it establishes if the variables of a given Boolean formula can be assigned in such a way as to make the formula evaluate to true.

What SMT means ?

SMT stands for Satiability Modulo Theories. SMT instance is a formula in first-order logic, where some functions and predicates have additional interpretations. SMT problem is a decision problem of determining whether such a formula is satisfiable or not.

An SMT instance is a generalization of a Boolean SAT instance in which various sets of variables are replaced by predicates over a suitable set of non-binary variables.

These predicates are classified according to the theory they belong to. For instance, linear inequalities over real variables are evaluated using the rules of the theory of linear real arithmetic.

What SMT solver means ?

The goal of an SMT solver is to determine whether an SMT instance can evaluate to true or not. The same analog applies for SAT solvers.

SMT solvers

There is a lot of SMT solvers available but there is only one SMT solver with C# APIs, it is Z3. For a list of available SMT solvers, refer to this page.

Download Z3

You can download Z3 from http://z3.codeplex.com/ . I downloaded Z3 4.3.0 and extracted it to C:\z3.

C# Example

In the following example we going to let Z3 solve the following equation system:

  • x > 0
  • y = x + 1 
  • y < 3

Solving the equations means finding values for x and y that make the whole formula evaluates to true.

  • Let’s create a new console application project in Visual Studio.
  • Add reference to Microsoft.Z3.dll which is located in the bin directory of the Z3 installation directory.
  • Copy the file libz3.dll from the bin directory of the Z3 installation directory to your project build directory.
  • Now edit your code to look like the following:
using System;
using Microsoft.Z3;

namespace Z3Demo1
{
class Program
{
static void Main(string[] args)
{
using (Context ctx = new Context())
{
Expr x = ctx.MkConst("x", ctx.MkIntSort());
Expr y = ctx.MkConst("y", ctx.MkIntSort());
Expr zero = ctx.MkNumeral(0, ctx.MkIntSort());
Expr one = ctx.MkNumeral(1, ctx.MkIntSort());
Expr three = ctx.MkNumeral(3, ctx.MkIntSort());

Solver s = ctx.MkSolver();
s.Assert(ctx.MkAnd(ctx.MkGt((ArithExpr)x, (ArithExpr)zero), ctx.MkEq((ArithExpr)y,
ctx.MkAdd((ArithExpr)x, (ArithExpr)one)), ctx.MkLt((ArithExpr)y, (ArithExpr)three)));
Console.WriteLine(s.Check());

Model m = s.Model;
foreach (FuncDecl d in m.Decls)
Console.WriteLine(d.Name + " -> " + m.ConstInterp(d));

Console.ReadLine();
}
}
}
}


Now let’s run the code above and see the output. The solver says the equation system is satisfiable and then gives us the x and y values that satisfy.



fileFProjectsZ3Demo1Z3Demo1binDebugZ3Demo1.EXE_2013-07-29_08-51-07



How it works ?



To interact with Z3 through C#, you need a Context object. Variables and numerals in your equations are modeled as Expr objects. You get these objects using member functions in the Context object (MkConts(), MkNumeral(),….). You construct your operand using member functions in the Context object (MkGt(), MkAdd(), MkLt(),…). To solve all the equations together you need to hock them up using AND operator, which is implemented using Context.MkAnd(). After hocking everything in one AND, you pass that the solver through Solver.Assert(). And as you may guessed, you obtain this Solver using Context.MkSolver().



Solver.Check() will tell you whether this equation system can be solved or not. To get the variables’ assignments that the come up with, get a Model object Solver.Model. Then use the Delcs collection to get all symbols that have an interpretation in the model. Model.ConstInterp() will get the symbol assigned value.



In this post we briefly introduced SAT, SMT, and their solvers. Then we explored the only SMT solver that written in C# and had a C# API. Now you can play with as many equations as you want and check them for satisfiability and even got the solution values.

Wednesday, July 24, 2013

Get path conditions from Microsoft Pex

We talked in a previous post about Microsoft Pex and how it choose values to explore your code through symbolically executing it and getting all its paths. A code path is the represented by all the conditions that have to be satisfied in order to make the code execution go to this path. Your code conditional statements will lead to different paths according to different input values. Test cases generated by Pex represents the code behavior for specific input values that lead to specific code path.

Within the parameterized unit test You can use the method GetPathConditionString of the PexSymbolicValue class to obtain a textual representation of the current path condition, a predicate (condition) that characterizes an execution path (below, column with title Condition) .

ConsoleApplication2 - Microsoft Visual Studio_2013-04-25_11-49-50

Methods ToRawString and GetRawPathConditionString of the PexSymbolicValue class return expressions representing symbolic values and the path condition, formatted as S-expressions.

As we know, Pex uses the Z3 SMT solver behind the scenes. If you want to get the path conditions before being passed to Z3, do the following:

  1. Create an environment variable with name z3_logdir and value c:\z3_logdir (or any other directory you want)
  2. Create an environment variable with name z3_loglevel and value constraintsandmodel .

Now, Pex will create *.z3 files in subdirectories in the z3_logdir you specified in step 1. These files are in Z3 native format and can be later passed to Z3. For more information about this format, refer to Z3 online documentation.

In future posts we will talk more about the Z3 native format and what we can do with it.

Tuesday, July 23, 2013

Bring “Sign in as Different User” back to SharePoint 2013

If you new to SharePoint 2013 like me and get confused looking for “Sign in as a Different User” menu command, don’t worry it is not there by design but we can easily bring it back.

Do the following steps to get the link back:

  • Locate the file \15\TEMPLATE\CONTROLTEMPLATES\Welcome.ascx and open it for editing.
  • Add the following snippet before the existing element that have id of “ID_RequestAccess”.

Text="<%$Resources:wss,personalactions_loginasdifferentuser%>"
Description="<%$Resources:wss,personalactions_loginasdifferentuserdescription%>"
MenuGroupId="100"
Sequence="100"
UseShortId="true"
/>

Save and close the file. Now you should get the link “Sign in as Different User” back.

Wednesday, July 17, 2013

Create a new Web Part Page with Quick Launch menu

This post is about a simple task we do daily in SharePoint 2010, creating a new Web Part page. Every time we need a new Web Part page, we go to All Site Content > Create > Web Part Page > Create. Now we have a Web Part page but no quick launch menu on the left.

To show the quick launch menu on the new created page:

  • Open the page for editing in SharePoint 2010.
  • Switch to the Code view.
  • Click on the Advanced Mode button on the ribbon.
  • Now look for the following code snippet and delete it.

<SharePoint:UIVersionedContent ID="WebPartPageHideQLStyles" UIVersion="4"runat="server">
<ContentTemplate>
body # {s4-leftpanel
display: none;
}
. s4 {ca-
margin-left: 0px;
}
</ style>
</ ContentTemplate>
</ SharePoint: UIVersionedContent>

  • Now look for the following snippet and delete it.

<asp:Content ContentPlaceHolderId="PlaceHolderLeftNavBar" runat="server">
</ asp: Content>

  • Save and Close

Now when you open the page, the quick launch navigation will be there.

Friday, July 12, 2013

Getting Started with MongoDB – Part 2

In the previous post we explored the basics of MongoDB. In this post we going to dig deeper in MongoDB.

Indexing

Whenever a new collection is created, MongoDB automatically creates an index by the _id field. These indexes can be found in the system.indexes collection. You can show all indexes in the database using db.system.indexes.find() . Most queries will include more fields than just the _id, so we need to make indexes on those fields.

Before creating more indexes, let’s see what is the performance of a sample query without creating any indexes other than the automatically created one for _id. Create the following function to generate random phone numbers.

function (area,start,stop) {
for(var i=start; i < stop; i++) {
var country = 1 + ((Math.random() * 8) << 0);
var num = (country * 1e10) + (area * 1e7) + i;
db.phones.insert({
_id: num,
components: {
country: country,
area: area,
prefix: (i * 1e-4) << 0,
number: i
},
display: "+" + country + " " + area + "-" + i
});

Run the function with a three-digit area code (like 800) and a range of seven digit numbers (5,550,000 to 5,650,000)

populatePhones( 800, 5550000, 5650000 )

Now we expecting to see a new index created for our new collection.

> db.system.indexes.find()
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "newdb.towns", "name" : "_id_" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "newdb.countries", "name" : "_id_" }
{ "v" : 1, "key" : { "_id" : 1 }, "ns" : "newdb.phones", "name" : "_id_" }

Now let’s check the query without an index. The explain() method is used to output details of a given operation and can help us here.

> db.phones.find( { display : "+1 800-5650001" } ).explain()
{
        "cursor" : "BasicCursor",
        "isMultiKey" : false,
        "n" : 0,
        "nscannedObjects" : 100000,
        "nscanned" : 100000,
        "nscannedObjectsAllPlans" : 100000,
        "nscannedAllPlans" : 100000,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 134,
        "indexBounds" : {

        },
        "server" : "ESOLIMAN:27017"
}

Just to make things simple, we will look at the millis field only which gives the milliseconds needed to complete the query. Now it is 134.

Now we going to create an index and see how it improves our query execution time. We create an index by calling ensureIndex(fields,options) on the collection. The fields parameter is an object containing the fields to be indexed against. The options parameter describes the type of index to make. On production environments, creating an index on a large collection can be slow and resource-intensive, you should create them in off-peak times. In our case we going to build a unique index on the display field and we will drop duplicate entries.

> db.phones.ensureIndex(
... { display : 1 },
... { unique : true, dropDups : true }
... )

lets try explain() of find() and see the new value for millis field. Query execution time improved, from 134 down to 16.

> db.phones.find( { display : "+1 800-5650001" } ).explain()
{
        "cursor" : "BtreeCursor display_1",
        "isMultiKey" : false,
        "n" : 0,
        "nscannedObjects" : 0,
        "nscanned" : 0,
        "nscannedObjectsAllPlans" : 0,
        "nscannedAllPlans" : 0,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 16,
        "indexBounds" : {
                "display" : [
                        [
                                "+1 800-5650001",
                                "+1 800-5650001"
                        ]
                ]
        },
        "server" : "ESOLIMAN:27017"
}

Notice the cursor changed from a Basic to a B-tree cursor. MongoDB is no longer doing
a full collection scan but instead walking the tree to retrieve the value.

Mongo can build your index on nested values: db.phones.ensureIndex({ "components.area": 1 }, { background : 1 })

Aggregations

count() counts the number of matching documents. It takes a query and returns a number.

> db.phones.count({'components.number': { $gt : 5599999 } })
100000

distinct() returns each matching value where one or more exists.

> db.phones.distinct('components.number', {'components.number': { $lt : 5550005 } })
[ 5550000, 5550001, 5550002, 5550003, 5550004 ]

group() groups documents in a collection by the specified keys and performs simple aggregation functions such as computing counts and sums. It is similar to GROUP BY in SQL. It accepts the following parameters

  • key – Specifies one or more document fields to group by.
  • reduce – Specifies a function for the group operation perform on the documents during the grouping operation, such as compute a sum or a count. The aggregation function takes two arguments: the current document and the aggregate result for the previous documents in the group.
  • initial – Initializes the aggregation result document.
  • keyf – Optional. Alternative to the key field. Specifies a function that creates a “key object” for use as the grouping key. Use the keyf instead of key to group by calculated fields rather than existing document fields. Like HAVING in SQL.
  • cond – Optional. Specifies the selection criteria to determine which documents in the collection to process. If you omit the cond field, db.collection.group() processes all the documents in the collection for the group operation.
  • finalize – Optional. Specifies a function that runs each item in the result set before db.collection.group() returns the final value. This function can either modify the result document or replace the result document as a whole.

> db.phones.group({
... initial : { count : 0 },
... reduce : function(phone, output) { output.count++; },
... cond : { 'components.number' : { $gt : 5599999 } },
... key : { 'components.area' : true }
... })
[
        {
                "components.area" : 800,
                "count" : 50000
        },
        {
                "components.area" : 855,
                "count" : 50000
        }
]

The first thing we did here was set an initial object with a field named count set to 0—fields created here will appear in the output. Next we describe what to do with this field by declaring a reduce function that adds one for every document we encounter. Finally, we gave group a condition restricting which documents to reduce over.

Server-Side Commands

All queries and operations we did till now, execute on the client side. The db object provides a command named eval(), which passes the given function to the server. This dramatically reduces the communication between client and server. It is similar to stored procedures in SQL.

There is a also a set of prebuilt commands that can be executed on the server. Use db.listCommands() to get a list of these commands. To run any command on the server use db.runCommand() like db.runCommand({ "count" : "phones" })

Although it is not recommended, you can store a JavaScript function on the server for later reuse.

MapReduce

MapReduce is a framework for parallelizing problems. Generally speaking the parallelization happens on two steps:

  • "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
  • "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

To show the MapReduce framework in action, let’s build on the phones collections that we created previously. Let’s generate a report that counts all phone numbers that contain the same digits for each country.

First we create a helper function that extracts an array of all distinct numbers (this step is not a MapReduce step).

> distinctDigits = function(phone) {
... var
... number = phone.components.number + '',
... seen = [],
... result = [],
... i = number.length;
... while(i--) {
...  seen[+number[i]] = 1;
...  }
... for (i=0; i<10; i++) {
...  if (seen[i]) {
...   result[result.length] = i;
...   }
...  }
... return result;
... }

> db.eval("distinctDigits(db.phones.findOne({ 'components.number' : 5551213 }))")
[ 1, 2, 3, 5 ]

Now let’s find find distinct numbers of each country. Since we need to query by country later, we will add the distinct digits array and country as compound key. For each distinct digits array in each country, we will add a count field that hold the value 1.

> map = function() {
... var digits = distinctDigits(this);
... emit( { digits : digits, country : this.components.country } , { count : 1 } );
... }

The reducer function will all these 1s that have been emitted from the map function.

>reduce = function(key, values) {
... var total = 0;
... for(var i=0; i<values.length; i++) {
...  total += values[i].count;
...  }
...  return { count : total };
... }

Now it is time to put all pieces together and start the whole thing (the input collection, map function, reduce function, output collection).

> results = db.runCommand({
... mapReduce : 'phones',
... map : map,
... reduce : reduce,
... out : 'phones.report'
... })
{
        "result" : "phones.report",
        "timeMillis" : 21084,
        "counts" : {
                "input" : 200000,
                "emit" : 200000,
                "reduce" : 48469,
                "output" : 3489
        },
        "ok" : 1
}

Now you can query the output collection like any other collection

> db.phones.report.find()
{ "_id" : { "digits" : [  0,  1,  2,  3,  4,  5,  6 ], "country" : 1 }, "value" : { "count" : 37 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  4,  5,  6 ], "country" : 2 }, "value" : { "count" : 23 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  4,  5,  6 ], "country" : 3 }, "value" : { "count" : 17 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  4,  5,  6 ], "country" : 4 }, "value" : { "count" : 29 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  4,  5,  6 ], "country" : 5 }, "value" : { "count" : 34 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  4,  5,  6 ], "country" : 6 }, "value" : { "count" : 35 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  4,  5,  6 ], "country" : 7 }, "value" : { "count" : 33 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  4,  5,  6 ], "country" : 8 }, "value" : { "count" : 32 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5 ], "country" : 1 }, "value" : { "count" : 5 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5 ], "country" : 2 }, "value" : { "count" : 7 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5 ], "country" : 3 }, "value" : { "count" : 3 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5 ], "country" : 4 }, "value" : { "count" : 6 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5 ], "country" : 5 }, "value" : { "count" : 5 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5 ], "country" : 6 }, "value" : { "count" : 10 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5 ], "country" : 7 }, "value" : { "count" : 5 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5 ], "country" : 8 }, "value" : { "count" : 7 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5,  6 ], "country" : 1 }, "value" : { "count" : 95 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5,  6 ], "country" : 2 }, "value" : { "count" : 104 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5,  6 ], "country" : 3 }, "value" : { "count" : 108 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5,  6 ], "country" : 4 }, "value" : { "count" : 113 } }
Type "it" for more

or

> db.phones.report.find({'_id.country' : 8})
{ "_id" : { "digits" : [  0,  1,  2,  3,  4,  5,  6 ], "country" : 8 }, "value" : { "count" : 32 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5 ], "country" : 8 }, "value" : { "count" : 7 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5,  6 ], "country" : 8 }, "value" : { "count" : 127 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5,  6,  7 ], "country" : 8 }, "value" : { "count" : 28 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5,  6,  8 ], "country" : 8 }, "value" : { "count" : 27 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5,  6,  9 ], "country" : 8 }, "value" : { "count" : 29 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5,  7 ], "country" : 8 }, "value" : { "count" : 10 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5,  8 ], "country" : 8 }, "value" : { "count" : 7 } }
{ "_id" : { "digits" : [  0,  1,  2,  3,  5,  9 ], "country" : 8 }, "value" : { "count" : 8 } }
{ "_id" : { "digits" : [  0,  1,  2,  4,  5 ], "country" : 8 }, "value" : { "count" : 3 } }
{ "_id" : { "digits" : [  0,  1,  2,  4,  5,  6 ], "country" : 8 }, "value" : { "count" : 121 } }
{ "_id" : { "digits" : [  0,  1,  2,  4,  5,  6,  7 ], "country" : 8 }, "value" : { "count" : 25 } }
{ "_id" : { "digits" : [  0,  1,  2,  4,  5,  6,  8 ], "country" : 8 }, "value" : { "count" : 27 } }
{ "_id" : { "digits" : [  0,  1,  2,  4,  5,  6,  9 ], "country" : 8 }, "value" : { "count" : 17 } }
{ "_id" : { "digits" : [  0,  1,  2,  4,  5,  7 ], "country" : 8 }, "value" : { "count" : 4 } }
{ "_id" : { "digits" : [  0,  1,  2,  4,  5,  8 ], "country" : 8 }, "value" : { "count" : 4 } }
{ "_id" : { "digits" : [  0,  1,  2,  4,  5,  9 ], "country" : 8 }, "value" : { "count" : 7 } }
{ "_id" : { "digits" : [  0,  1,  2,  5 ], "country" : 8 }, "value" : { "count" : 14 } }
{ "_id" : { "digits" : [  0,  1,  2,  5,  6 ], "country" : 8 }, "value" : { "count" : 162 } }
{ "_id" : { "digits" : [  0,  1,  2,  5,  6,  7 ], "country" : 8 }, "value" : { "count" : 95 } }
Type "it" for more

The unique emitted keys are under the field _id, and all of the data returned from the reducers are
under the field value. If you prefer that the mapreducer just output the results, rather than outputting to a collection, you can set the out value to { inline : 1 }, but bear in mind there is a limit to the size of a result you can output (16 MB).

In some situations you may need to feed the reducer function’s output into another reducer function. In these situations we need to carefully handle both cases: either map’s output or another reduce’s output.

MongoDB have so many features that we didn’t even mentioned here. In later posts will continue working on them.

Getting Started with MongoDB – Part 1


MongoDB (from "humongous") is an open source document-oriented database system developed and supported by 10gen (founded by Dwight Merriman). First publicly released in 2009, and since then it have been a rising star in the NoSQL world. MongoDB stores structured data as JSON-like documents with dynamic schemas (technically data is stored in a binary form of JSON known as BSON), making the integration of data in certain types of applications easier and faster.

Installation

  1. Download the latest mongoDB version from here.
  2. Extract the archive to your preferred location (in my case C:\mongodb). MongoDB is self-contained and does not have any other system dependencies. You can run MongoDB from any folder you choose. You may install MongoDB in any directory.
  3. MongoDB requires a data folder to store its files (default is C:\data\db ). You may specify a different path with the dbpath setting when lunching mongod.exe.

Starting the Server

To start MongoDB, open the command prompt window, and run mongod.exe from the bin directory (specify the data path if needed)
CWindowssystem32cmd.exe - mongod.exe  --dbpath Cmongodbdata_2013-07-09_16-10-49The waiting for connections message in the console output indicates that the mongod.exe process is running successfully and waiting for connections on port 27017

Connecting to the Server

To connect to the server, open another command prompt window and run mongo.exe from the bin directory.
CWindowssystem32cmd.exe - mongo_2013-07-09_16-27-37

Run MongoDB as a Windows Service

  1. Create a log file for MongoDB (in my case c:\mongodb\log\mongo.log ).
  2. Create a data directory for MongoDB (in my case c:\mongodb\data ).
  3. Open the command prompt window as an administrator.
  4. Run the following command C:\mongodb\bin>mongod.exe –install –rest –master --logpath "c:\mongodb\log\mongo.log"
  5. Run regedit from start menu.
  6. Go to HKEY_LOCAL_MACHINE >> SYSTEM >> CurrentControlSet >> services
  7. Find the MongoDB directory & edit the ImagePath key.
  8. Set value as c:\mongodb\bin\mongod --service  --rest  --master  --logpath=C:\mongodb\log\mongo.log  --dbpath=C:\mongodb\data
  9. Save and exit registry editor
  10. Open ComponentServices from Start menu >> Run
  11. Locate the Mongo DB service, and reigh click >> Properties.
  12. Set the Startup Type to Automatic. Then start the service.
    1. To run the MongoDB service from command window, use net start MongoDB
  13. Check at http://localhost:28017/ to see , MongoDB should return stats.
In case you want to remove MongoDB service C:\mongodb\bin\mongod.exe –remove

Data Model

Data in MongoDB has a flexible schema.
  • Database consists on a set of collections.
  • Collections consists of a set of documents.
    • Documents in the same collection do not need to have the same set of fields or structure.
    • Common fields in a collection’s documents can hold different types of data.
Based on what we mentioned, you could say that MongoDB is schema-less. But you may refer to this data modeling article before doing real work on MongoDB.

CRUD operations

  • When you start the mongo shell, it connect to test database by default. To create a new database or switch to another database, use use newdb
  • To show a list of databases, use show dbs (databases are created when you insert the first values in it. This means that any database have been created but no values inserted in it, it really doesn’t exist).
  • To confirm the current session database, use db
  • To create a collection, just insert an initial record to it. Since Mongo is schema-less, there is no need to define anything up front. The following code creates/inserts a towns collection:
db.towns.insert({
name: "New York",
population: 22200000,
last_census: ISODate("2009-07-31"),
famous_for: [ "statue of liberty", "food" ],
mayor : {
name : "Michael Bloomberg",
party : "I"
}
})

    • brackets like {...} denote an object with key-value pairs.
    • brackets like [...] denote an array.
    • You can nest these values to any depth.
  • To show a list of collections, use show collections.
  • To list the contents of a collection, use db.towns.find()
    • You will see a system generated field _id ( composed of a timestamp, client machine ID, client process ID, and a 3-byte incremented counter) (you can override this system generated
  • MongoDB commands are JavaScript functions.
    • db is a JavaScript object that contains information about the current database. Try typeof db
    • db.x is a JavaScript object that represent a collection named x within the current database. Try typeof db.towns
    • db.x.help() will list available functions related to the given object. Try typeof db.towns.insert
    • If you want to inspect the source code of a function, call it without parameter or parentheses.
  • Functions You can create JavaScript functions and call them on the mongo shell like:
function insertCity( name, population, last_census, famous_for, mayor_info) {
db.towns.insert({
name:name,
population:population,
last_census: ISODate(last_census),
famous_for:famous_for,
mayor : mayor_info
});
}

insertCity("Punxsutawney", 6200, '2008-31-01', ["phil the groundhog"], { name : "Jim Wehrle" } )
insertCity("Portland", 582000, '2007-20-09', ["beer", "food"], { name : "Sam Adams", party : "D" } )

Now have three towns in our collection
Command Prompt - mongo_2013-07-11_14-35-39
  • To get a specific document, we only need _id passed to find() function in type ObjectId(findOne() retrieves only one matching document). String can be converted to ObjectId using ObjectId(str) function.
 db.towns.find({ "_id" : ObjectId("51def56c1cf66f4c40bb7f4a") })
    • The find() function also accepts an optional second parameter: a fields object we can use to filter which fields are retrieved. If we want only the town name (along with _id), pass in name with a value resolving to 1 (or true).
db.towns.find({ "_id" : ObjectId("51def56c1cf66f4c40bb7f4a") }, { name : 1})
    • To retrieve all fields except name, set name to 0 (or false or null).
db.towns.find({ "_id" : ObjectId("51def56c1cf66f4c40bb7f4a") }, { name : 0})
    • You can retrieve documents based on criteria other than _id. You can use regular expressions or any operator.
db.towns.find( { name : /^P/, population : { $lt : 10000 } }, { name : 1, population : 1 } )
We said before that the query language is JavaScript, which means we can construct operations as we would construct objects. In the following query, we build a criteria where the population must be between 10.000 and 1 million. Ranges work also on dates.
> var population_range = {}
> population_range['$lt'] = 100000
100000
> population_range['$gt'] = 10000
10000
> population_range['$lt'] = 1000000
1000000
> population_range['$gt'] = 10000
10000
> db.towns.find( {name : /^P/, population : population_range }, {name: 1})
{ "_id" : ObjectId("51df08e72476b99608460870"), "name" : "Portland" }

    • You can also query based on values in nested arrays, either matching exact values or matching partial values or all matching values or the lack of matching values.
> db.towns.find( { famous_for : 'food' }, { _id : 0, name : 1, famous_for : 1 } )
{ "name" : "New York", "famous_for" : [  "statue of liberty",  "food" ] }
{ "name" : "Portland", "famous_for" : [  "beer",  "food" ] }
> db.towns.find( { famous_for : /statue/ }, { _id : 0, name : 1, famous_for : 1 } )
{ "name" : "New York", "famous_for" : [  "statue of liberty",  "food" ] }
> db.towns.find( { famous_for : { $all : ['food', 'beer'] } }, { _id : 0, name:1, famous_for:1 } )
{ "name" : "Portland", "famous_for" : [  "beer",  "food" ] }
> db.towns.find( { famous_for : { $nin : ['food', 'beer'] } }, { _id : 0, name : 1, famous_for : 1 } )
{ "name" : "Punxsutawney", "famous_for" : [  "phil the groundhog" ] }

    • You can query a sub-document by giving the field name as a string separating nested layers with a dot.
> db.towns.find( { 'mayor.party' : 'I' }, { _id : 0, name : 1, mayor : 1 } )
{ "name" : "New York", "mayor" : { "name" : "Michael Bloomberg", "party" : "I" } }

    • To query the nonexistence of a field value
> db.towns.find( { 'mayor.party' : { $exists : false } }, { _id : 0, name : 1, mayor : 1 } )
{ "name" : "Punxsutawney", "mayor" : { "name" : "jim Wehrle" } }

  • elemMatch

$elemMatch helps us specify if a document or a nested document matches all of our criteria, the document counts as a match and returned. We can use any advanced operators within this criteria. To show it in action, let’s insert some data into a new collection countries
> db.countries.insert({ _id : "us",
... name : "United States",
... exports : {
...  foods : [
...   { name : "bacon", tasty : true },
...   { name : "burgers" }
...          ]
...            }
... })
> db.countries.insert({ _id : "ca",
... name : "Canada",
... exports : {
...  foods : [
...   { name : "bacon", tasty : false },
...   { name : "syrup", tasty : true }
...          ]
...             }
... })
> db.countries.insert({ _id : "mx",
... name : "Mexico",
... exports : {
...  foods : [
...   { name : "salsa", tasty : true, condiment : true }
...          ]
...           }
... })
> print( db.countries.count() )
3

Now if we need to select countries that export tasty bacon, we should $elemMatch in a query like:
> db.countries.find(
... {
...   'exports.foods' : {
...    $elemMatch : {
...      name : "bacon",
...      tasty : true
...     }
...   }
... },
... { _id : 0, name : 1 }
... )
{ "name" : "United States" }

If we didn’t used $elemMatch and wrote a query like the following, it will return countries that export bacon or tasty food, not tasty bacon
> db.countries.find(
... { 'exports.foods.name' : 'bacon' , 'exports.foods.tasty' : true },
... { _id : 0, name : 1 }
... )
{ "name" : "United States" }
{ "name" : "Canada" }

  • Boolean Operators

$or a prefix for criteria to return document that match either condition1 or condition2. There is a a lot of operators you can use.
> db.countries.find({
... $or : [ { _id : "mx" } , { name : "United States" } ] },
... {_id : 1} )
{ "_id" : "us" }
{ "_id" : "mx" }

Update

The find() function toke two parameters, criteria and list of fields to return. The update() function works the same way. The first parameter is a criteria (the same way you will use to retreive the document through find()). The second parameter is an object whose fields will replace the matched document(s) or a modifier operation ($set to set a field value, $unset to delete a field, $inc to increment a field value by a number).
The following query will set the field state with the string OR for the matching document.
db.towns.update( { _id : ObjectId("4d0ada87bb30773266f39fe5") }, { $set : { "state" : "OR" } } )
but the following query will replace the matching document with a new document { state : “OR”}
db.towns.update( { _id : ObjectId("4d0ada87bb30773266f39fe5") }, { state : "OR" } )

References

Although MongoDB is schema-less but you can make one document reference another document using a construct like { $ref : “collection_name”, $id : “reference_id” }. In the following query we linking New York town with the country US. Notice the display of the new country field in the New York town document.
> db.towns.update(
... { _id : ObjectId("51def56c1cf66f4c40bb7f4a") },
... { $set : { country : { $ref : "countries", $id : "us" } } }
... )
> db.towns.find( { _id : ObjectId("51def56c1cf66f4c40bb7f4a") } )
{ "_id" : ObjectId("51def56c1cf66f4c40bb7f4a"), "country" : DBRef("countries", "us"), "famous_for" : [  "statue of liberty",  "food" ], "last_census" : ISODate("2009-07-31T00:00:00Z"), "mayor" : { "na
me" : "Michael Bloomberg", "party" : "I" }, "name" : "New York", "population" : 22200000 }
Now we can retrieve New york from towns collection, then use it to retrieve its country
> var NY = db.towns.findOne( { _id : ObjectId("51def56c1cf66f4c40bb7f4a") } )
> db.countries.findOne( { _id : NY.country.$id })
{
        "_id" : "us",
        "name" : "United States",
        "exports" : {
                "foods" : [
                        {
                                "name" : "bacon",
                                "tasty" : true
                        },
                        {
                                "name" : "burgers"
                        }
                ]
        }
}

or in a different way > db[ NY.country.$ref ].findOne( { _id : NY.country.$id} )

Delete

Removing documents from a collections is simple, use your criteria with a call to remove() function and all matching documents will be removed. It’s a recommended practice to build your criteria in an object, use that criteria to ensure that matching documents are the expected ones, pass this criteria to remove() function.
> var bad_bacon = { 'exports.foods' : {
... $elemMatch : { name : 'bacon', tasty : false }
... } }
> db.countries.find ( bad_bacon)
{ "_id" : "ca", "name" : "Canada", "exports" : { "foods" : [    {       "name" : "bacon",       "tasty" : false },      {       "name" : "syrup",       "tasty" : true } ] } }
> db.countries.remove( bad_bacon)
> db.countries.count()
2

Reading with Code

You could ask MongoDb to run a decision function across documents.
db.towns.find(  function() {
return this.population > 6000 && this.population < 600000;
} )
or in a short-hand format db.towns.find("this.population > 6000 && this.population < 600000")
or combine custom code with other criteria using $where
db.towns.find( {
$where : "this.population > 6000 && this.population < 600000",
famous_for : /groundhog/
} )
Custom code should be your last resort due to the following:
  1. Custom code queries run slower than regular queries.
  2. Custom code queries can’t be indexed.
  3. MongoDb can’t optimize custom code queries.
  4. If your custom code assume the existence of a particular field and this field is missing in a single, the entire query will fail.
In this post we explored the basics of MongoDB, a rising star in the NoSQL family and the most common document database. We saw how to install and configure it. We saw how we can store nested structured data as JSON objects and query that data at any depth. We How we can update and delete data. In the next post we going to dig deep into MongoDB.