HSG-MCS-HS21_Julia/JuliaTutorial-master/Tutorial_12_MissingValues.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Missing Values\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Packages and Extra Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "printyellow (generic function with 1 method)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using Printf\n",
    "\n",
    "include(\"jlFiles/printmat.jl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NaN\n",
    "\n",
    "The `NaN` (Not-a-Number) can be used to indicate that a floating point number (for instance, 2.0) is missing or otherwise strange. For other types of data (for instance, 2), use a ```missing``` (see below) instead.\n",
    "\n",
    "Most computations involving NaNs give `NaN` as the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NaN\n"
     ]
    }
   ],
   "source": [
    "println(2.0 + NaN)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Data\n",
    "\n",
    "When your data (loaded from a csv file, say) has special values for missing data points (for instance, -999.99), then you can simply replace those values with `NaN`.  This works since `NaN` is a Float64 value, so you can change an existing array of `Float64`s to `NaN`.\n",
    "\n",
    "(See the tutorial on loading and saving data for more information.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z: \n",
      "     1.000       NaN\n",
      "     2.000    12.000\n",
      "     3.000    13.000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data = [1.0 -999.99;\n",
    "        2.0 12.0;\n",
    "        3.0 13.0]\n",
    "\n",
    "z = replace(data,-999.99=>NaN)    #replace -999.99 by NaN\n",
    "println(\"z: \")\n",
    "printmat(z)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NaNs in a Matrix\n",
    "\n",
    "If a matrix contains NaNs, then many calculations (eg. summing all elements) give NaN as the result. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z has some NaNs\n",
      "\n",
      "The sum of each column: \n",
      "     6.000       NaN\n",
      "\n"
     ]
    }
   ],
   "source": [
    "if any(isnan,z)                #check if any NaNs\n",
    "  println(\"z has some NaNs\")   #can also do any(isnan.(z))\n",
    "end\n",
    "\n",
    "println(\"\\nThe sum of each column: \")\n",
    "printmat(sum(z,dims=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Rid of NaNs\n",
    "\n",
    "It is a common procedure in statistics to throw out all cases with NaNs/missing values. For instance, let `z` be a matrix and `z[t,:]` the data for period $t$  which contains one or more `NaN` values. It is then common (for instance, in linear regressions) to throw out that entire row of the matrix.\n",
    "\n",
    "This is a reasonable approach if it can be argued that the fact that the data is missing is random - and not related to the subject of the investigation. It is much less reasonable if, for instance, the returns for all poorly performing mutual funds are listed as \"missing\" - and you want to study what fund characteristics that drive performance.\n",
    "\n",
    "The code below shows a simple way of how to through out all rows of a matrix with at least one `NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z:\n",
      "     1.000       NaN\n",
      "     2.000    12.000\n",
      "     3.000    13.000\n",
      "\n",
      "z2: a new matrix where all rows with any NaNs have been pruned:\n",
      "     2.000    12.000\n",
      "     3.000    13.000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "println(\"z:\")\n",
    "printmat(z)\n",
    "\n",
    "vb = any(isnan,z,dims=2)      #indicates rows with NaNs\n",
    "vc = .!vec(vb)                #indicates rows without NaNs\n",
    "\n",
    "z2 = z[vc,:]           #keep only rows without NaNs\n",
    "println(\"z2: a new matrix where all rows with any NaNs have been pruned:\")\n",
    "printmat(z2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Missings \n",
    "\n",
    "can be used to indicate missing values for most types (not just floats).\n",
    "\n",
    "Similarly to `NaN`s, computations involving `missing` (for instance, `1+missing`) result in `missing`.\n",
    "\n",
    "In contrast to `NaN`s, you cannot just change an element of an existing matrix (of Float64 or Int, say) to `missing.` The [Missings](https://github.com/JuliaData/Missings.jl) package has help routines to handle that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Missings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z: \n",
      "     1       missing\n",
      "     2        12    \n",
      "     3        13    \n",
      "\n"
     ]
    }
   ],
   "source": [
    "data = [1 -999;\n",
    "        2 12;\n",
    "        3 13]\n",
    "z = allowmissing(data)                #convert to an array that can include missing\n",
    "z = replace(data,-999=>missing)       #replace -999 by missing\n",
    "println(\"z: \")\n",
    "printmat(z)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z has some missings\n"
     ]
    }
   ],
   "source": [
    "if any(ismissing,z)                      #check if any NaNs\n",
    "  println(\"z has some missings\")\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z2: a new matrix where all rows with any missings have been pruned:\n",
      "     2        12    \n",
      "     3        13    \n",
      "\n"
     ]
    }
   ],
   "source": [
    "vd = .!vec(any(ismissing,z,dims=2))\n",
    "\n",
    "z2 = z[vd,:]                #keep only rows without NaNs\n",
    "println(\"z2: a new matrix where all rows with any missings have been pruned:\")\n",
    "printmat(z2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once `z2` does not have any `missing` (although it still allows you to) you can typically use it as any other array. However, if you for some reason need to work with a traditional array, then convert `z2` (see below) by using the `disallowmissing` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The type of z2 is Matrix{Union{Missing, Int64}}\n",
      "\n",
      "The type of z3 is Matrix{Int64}\n"
     ]
    }
   ],
   "source": [
    "println(\"The type of z2 is \", typeof(z2))\n",
    "\n",
    "z3 = disallowmissing(z2)       #convert to traditional array,\n",
    "                               #same as  same as convert.(Int,z2)\n",
    "\n",
    "println(\"\\nThe type of z3 is \", typeof(z3)) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Julia 1.6.0",
   "language": "julia",
   "name": "julia-1.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}