HSG-MCS-HS21_Julia/JuliaTutorial-master/Tutorial_12_MissingValues.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Missing Values\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Packages and Extra Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "printyellow (generic function with 1 method)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using Printf\n",
    "\n",
    "include(\"jlFiles/printmat.jl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NaN\n",
    "\n",
    "The `NaN` (Not-a-Number) can be used to indicate that a floating point number (for instance, 2.0) is missing or otherwise strange. For other types of data (for instance, 2), use a ```missing``` (see below) instead.\n",
    "\n",
    "Most computations involving NaNs give `NaN` as the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NaN\n"
     ]
    }
   ],
   "source": [
    "println(2.0 + NaN)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Data\n",
    "\n",
    "When your data (loaded from a csv file, say) has special values for missing data points (for instance, -999.99), then you can simply replace those values with `NaN`.  This works since `NaN` is a Float64 value, so you can change an existing array of `Float64`s to `NaN`.\n",
    "\n",
    "(See the tutorial on loading and saving data for more information.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z: \n",
      "     1.000       NaN\n",
      "     2.000    12.000\n",
      "     3.000    13.000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data = [1.0 -999.99;\n",
    "        2.0 12.0;\n",
    "        3.0 13.0]\n",
    "\n",
    "z = replace(data,-999.99=>NaN)    #replace -999.99 by NaN\n",
    "println(\"z: \")\n",
    "printmat(z)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NaNs in a Matrix\n",
    "\n",
    "If a matrix contains NaNs, then many calculations (eg. summing all elements) give NaN as the result. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z has some NaNs\n",
      "\n",
      "The sum of each column: \n",
      "     6.000       NaN\n",
      "\n"
     ]
    }
   ],
   "source": [
    "if any(isnan,z)                #check if any NaNs\n",
    "  println(\"z has some NaNs\")   #can also do any(isnan.(z))\n",
    "end\n",
    "\n",
    "println(\"\\nThe sum of each column: \")\n",
    "printmat(sum(z,dims=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Rid of NaNs\n",
    "\n",
    "It is a common procedure in statistics to throw out all cases with NaNs/missing values. For instance, let `z` be a matrix and `z[t,:]` the data for period $t$  which contains one or more `NaN` values. It is then common (for instance, in linear regressions) to throw out that entire row of the matrix.\n",
    "\n",
    "This is a reasonable approach if it can be argued that the fact that the data is missing is random - and not related to the subject of the investigation. It is much less reasonable if, for instance, the returns for all poorly performing mutual funds are listed as \"missing\" - and you want to study what fund characteristics that drive performance.\n",
    "\n",
    "The code below shows a simple way of how to through out all rows of a matrix with at least one `NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z:\n",
      "     1.000       NaN\n",
      "     2.000    12.000\n",
      "     3.000    13.000\n",
      "\n",
      "z2: a new matrix where all rows with any NaNs have been pruned:\n",
      "     2.000    12.000\n",
      "     3.000    13.000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "println(\"z:\")\n",
    "printmat(z)\n",
    "\n",
    "vb = any(isnan,z,dims=2)      #indicates rows with NaNs\n",
    "vc = .!vec(vb)                #indicates rows without NaNs\n",
    "\n",
    "z2 = z[vc,:]           #keep only rows without NaNs\n",
    "println(\"z2: a new matrix where all rows with any NaNs have been pruned:\")\n",
    "printmat(z2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Missings \n",
    "\n",
    "can be used to indicate missing values for most types (not just floats).\n",
    "\n",
    "Similarly to `NaN`s, computations involving `missing` (for instance, `1+missing`) result in `missing`.\n",
    "\n",
    "In contrast to `NaN`s, you cannot just change an element of an existing matrix (of Float64 or Int, say) to `missing.` The [Missings](https://github.com/JuliaData/Missings.jl) package has help routines to handle that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Missings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z: \n",
      "     1       missing\n",
      "     2        12    \n",
      "     3        13    \n",
      "\n"
     ]
    }
   ],
   "source": [
    "data = [1 -999;\n",
    "        2 12;\n",
    "        3 13]\n",
    "z = allowmissing(data)                #convert to an array that can include missing\n",
    "z = replace(data,-999=>missing)       #replace -999 by missing\n",
    "println(\"z: \")\n",
    "printmat(z)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z has some missings\n"
     ]
    }
   ],
   "source": [
    "if any(ismissing,z)                      #check if any NaNs\n",
    "  println(\"z has some missings\")\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "z2: a new matrix where all rows with any missings have been pruned:\n",
      "     2        12    \n",
      "     3        13    \n",
      "\n"
     ]
    }
   ],
   "source": [
    "vd = .!vec(any(ismissing,z,dims=2))\n",
    "\n",
    "z2 = z[vd,:]                #keep only rows without NaNs\n",
    "println(\"z2: a new matrix where all rows with any missings have been pruned:\")\n",
    "printmat(z2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once `z2` does not have any `missing` (although it still allows you to) you can typically use it as any other array. However, if you for some reason need to work with a traditional array, then convert `z2` (see below) by using the `disallowmissing` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The type of z2 is Matrix{Union{Missing, Int64}}\n",
      "\n",
      "The type of z3 is Matrix{Int64}\n"
     ]
    }
   ],
   "source": [
    "println(\"The type of z2 is \", typeof(z2))\n",
    "\n",
    "z3 = disallowmissing(z2)       #convert to traditional array,\n",
    "                               #same as  same as convert.(Int,z2)\n",
    "\n",
    "println(\"\\nThe type of z3 is \", typeof(z3)) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Julia 1.6.0",
   "language": "julia",
   "name": "julia-1.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
first commit 2021-11-15 20:14:51 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Missing Values\n"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Load Packages and Extra Functions"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"printyellow (generic function with 1 method)"`
			`]`
			`},`
			`"execution_count": 1,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"using Printf\n",`
			`"\n",`
			`"include(\"jlFiles/printmat.jl\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# NaN\n",`
			`"\n",`
			"The `NaN` (Not-a-Number) can be used to indicate that a floating point number (for instance, 2.0) is missing or otherwise strange. For other types of data (for instance, 2), use a ```missing``` (see below) instead.\n",
			`"\n",`
			"Most computations involving NaNs give `NaN` as the result."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"NaN\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"println(2.0 + NaN)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Loading Data\n",`
			`"\n",`
			"When your data (loaded from a csv file, say) has special values for missing data points (for instance, -999.99), then you can simply replace those values with `NaN`. This works since `NaN` is a Float64 value, so you can change an existing array of `Float64`s to `NaN`.\n",
			`"\n",`
			`"(See the tutorial on loading and saving data for more information.)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"z: \n",`
			`" 1.000 NaN\n",`
			`" 2.000 12.000\n",`
			`" 3.000 13.000\n",`
			`"\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"data = [1.0 -999.99;\n",`
			`" 2.0 12.0;\n",`
			`" 3.0 13.0]\n",`
			`"\n",`
			`"z = replace(data,-999.99=>NaN) #replace -999.99 by NaN\n",`
			`"println(\"z: \")\n",`
			`"printmat(z)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## NaNs in a Matrix\n",`
			`"\n",`
			`"If a matrix contains NaNs, then many calculations (eg. summing all elements) give NaN as the result. \n",`
			`"\n"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"z has some NaNs\n",`
			`"\n",`
			`"The sum of each column: \n",`
			`" 6.000 NaN\n",`
			`"\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"if any(isnan,z) #check if any NaNs\n",`
			`" println(\"z has some NaNs\") #can also do any(isnan.(z))\n",`
			`"end\n",`
			`"\n",`
			`"println(\"\\nThe sum of each column: \")\n",`
			`"printmat(sum(z,dims=1))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Getting Rid of NaNs\n",`
			`"\n",`
			"It is a common procedure in statistics to throw out all cases with NaNs/missing values. For instance, let `z` be a matrix and `z[t,:]` the data for period $t$ which contains one or more `NaN` values. It is then common (for instance, in linear regressions) to throw out that entire row of the matrix.\n",
			`"\n",`
			`"This is a reasonable approach if it can be argued that the fact that the data is missing is random - and not related to the subject of the investigation. It is much less reasonable if, for instance, the returns for all poorly performing mutual funds are listed as \"missing\" - and you want to study what fund characteristics that drive performance.\n",`
			`"\n",`
			"The code below shows a simple way of how to through out all rows of a matrix with at least one `NaN`."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 5,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"z:\n",`
			`" 1.000 NaN\n",`
			`" 2.000 12.000\n",`
			`" 3.000 13.000\n",`
			`"\n",`
			`"z2: a new matrix where all rows with any NaNs have been pruned:\n",`
			`" 2.000 12.000\n",`
			`" 3.000 13.000\n",`
			`"\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"println(\"z:\")\n",`
			`"printmat(z)\n",`
			`"\n",`
			`"vb = any(isnan,z,dims=2) #indicates rows with NaNs\n",`
			`"vc = .!vec(vb) #indicates rows without NaNs\n",`
			`"\n",`
			`"z2 = z[vc,:] #keep only rows without NaNs\n",`
			`"println(\"z2: a new matrix where all rows with any NaNs have been pruned:\")\n",`
			`"printmat(z2)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Missings \n",`
			`"\n",`
			`"can be used to indicate missing values for most types (not just floats).\n",`
			`"\n",`
			"Similarly to `NaN`s, computations involving `missing` (for instance, `1+missing`) result in `missing`.\n",
			`"\n",`
			"In contrast to `NaN`s, you cannot just change an element of an existing matrix (of Float64 or Int, say) to `missing.` The [Missings](https://github.com/JuliaData/Missings.jl) package has help routines to handle that."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"using Missings"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"z: \n",`
			`" 1 missing\n",`
			`" 2 12 \n",`
			`" 3 13 \n",`
			`"\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"data = [1 -999;\n",`
			`" 2 12;\n",`
			`" 3 13]\n",`
			`"z = allowmissing(data) #convert to an array that can include missing\n",`
			`"z = replace(data,-999=>missing) #replace -999 by missing\n",`
			`"println(\"z: \")\n",`
			`"printmat(z)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"z has some missings\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"if any(ismissing,z) #check if any NaNs\n",`
			`" println(\"z has some missings\")\n",`
			`"end"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"z2: a new matrix where all rows with any missings have been pruned:\n",`
			`" 2 12 \n",`
			`" 3 13 \n",`
			`"\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"vd = .!vec(any(ismissing,z,dims=2))\n",`
			`"\n",`
			`"z2 = z[vd,:] #keep only rows without NaNs\n",`
			`"println(\"z2: a new matrix where all rows with any missings have been pruned:\")\n",`
			`"printmat(z2)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"Once `z2` does not have any `missing` (although it still allows you to) you can typically use it as any other array. However, if you for some reason need to work with a traditional array, then convert `z2` (see below) by using the `disallowmissing` function."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 10,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"The type of z2 is Matrix{Union{Missing, Int64}}\n",`
			`"\n",`
			`"The type of z3 is Matrix{Int64}\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"println(\"The type of z2 is \", typeof(z2))\n",`
			`"\n",`
			`"z3 = disallowmissing(z2) #convert to traditional array,\n",`
			`" #same as same as convert.(Int,z2)\n",`
			`"\n",`
			`"println(\"\\nThe type of z3 is \", typeof(z3)) "`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": []`
			`}`
			`],`
			`"metadata": {`
			`"@webio": {`
			`"lastCommId": null,`
			`"lastKernelId": null`
			`},`
			`"anaconda-cloud": {},`
			`"kernelspec": {`
			`"display_name": "Julia 1.6.0",`
			`"language": "julia",`
			`"name": "julia-1.6"`
			`},`
			`"language_info": {`
			`"file_extension": ".jl",`
			`"mimetype": "application/julia",`
			`"name": "julia",`
			`"version": "1.6.0"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 1`
			`}`