{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Missing Values\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Packages and Extra Functions" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "printyellow (generic function with 1 method)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using Printf\n", "\n", "include(\"jlFiles/printmat.jl\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# NaN\n", "\n", "The `NaN` (Not-a-Number) can be used to indicate that a floating point number (for instance, 2.0) is missing or otherwise strange. For other types of data (for instance, 2), use a ```missing``` (see below) instead.\n", "\n", "Most computations involving NaNs give `NaN` as the result." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NaN\n" ] } ], "source": [ "println(2.0 + NaN)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Data\n", "\n", "When your data (loaded from a csv file, say) has special values for missing data points (for instance, -999.99), then you can simply replace those values with `NaN`. This works since `NaN` is a Float64 value, so you can change an existing array of `Float64`s to `NaN`.\n", "\n", "(See the tutorial on loading and saving data for more information.)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "z: \n", " 1.000 NaN\n", " 2.000 12.000\n", " 3.000 13.000\n", "\n" ] } ], "source": [ "data = [1.0 -999.99;\n", " 2.0 12.0;\n", " 3.0 13.0]\n", "\n", "z = replace(data,-999.99=>NaN) #replace -999.99 by NaN\n", "println(\"z: \")\n", "printmat(z)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NaNs in a Matrix\n", "\n", "If a matrix contains NaNs, then many calculations (eg. summing all elements) give NaN as the result. \n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "z has some NaNs\n", "\n", "The sum of each column: \n", " 6.000 NaN\n", "\n" ] } ], "source": [ "if any(isnan,z) #check if any NaNs\n", " println(\"z has some NaNs\") #can also do any(isnan.(z))\n", "end\n", "\n", "println(\"\\nThe sum of each column: \")\n", "printmat(sum(z,dims=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting Rid of NaNs\n", "\n", "It is a common procedure in statistics to throw out all cases with NaNs/missing values. For instance, let `z` be a matrix and `z[t,:]` the data for period $t$ which contains one or more `NaN` values. It is then common (for instance, in linear regressions) to throw out that entire row of the matrix.\n", "\n", "This is a reasonable approach if it can be argued that the fact that the data is missing is random - and not related to the subject of the investigation. It is much less reasonable if, for instance, the returns for all poorly performing mutual funds are listed as \"missing\" - and you want to study what fund characteristics that drive performance.\n", "\n", "The code below shows a simple way of how to through out all rows of a matrix with at least one `NaN`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "z:\n", " 1.000 NaN\n", " 2.000 12.000\n", " 3.000 13.000\n", "\n", "z2: a new matrix where all rows with any NaNs have been pruned:\n", " 2.000 12.000\n", " 3.000 13.000\n", "\n" ] } ], "source": [ "println(\"z:\")\n", "printmat(z)\n", "\n", "vb = any(isnan,z,dims=2) #indicates rows with NaNs\n", "vc = .!vec(vb) #indicates rows without NaNs\n", "\n", "z2 = z[vc,:] #keep only rows without NaNs\n", "println(\"z2: a new matrix where all rows with any NaNs have been pruned:\")\n", "printmat(z2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Missings \n", "\n", "can be used to indicate missing values for most types (not just floats).\n", "\n", "Similarly to `NaN`s, computations involving `missing` (for instance, `1+missing`) result in `missing`.\n", "\n", "In contrast to `NaN`s, you cannot just change an element of an existing matrix (of Float64 or Int, say) to `missing.` The [Missings](https://github.com/JuliaData/Missings.jl) package has help routines to handle that." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "using Missings" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "z: \n", " 1 missing\n", " 2 12 \n", " 3 13 \n", "\n" ] } ], "source": [ "data = [1 -999;\n", " 2 12;\n", " 3 13]\n", "z = allowmissing(data) #convert to an array that can include missing\n", "z = replace(data,-999=>missing) #replace -999 by missing\n", "println(\"z: \")\n", "printmat(z)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "z has some missings\n" ] } ], "source": [ "if any(ismissing,z) #check if any NaNs\n", " println(\"z has some missings\")\n", "end" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "z2: a new matrix where all rows with any missings have been pruned:\n", " 2 12 \n", " 3 13 \n", "\n" ] } ], "source": [ "vd = .!vec(any(ismissing,z,dims=2))\n", "\n", "z2 = z[vd,:] #keep only rows without NaNs\n", "println(\"z2: a new matrix where all rows with any missings have been pruned:\")\n", "printmat(z2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once `z2` does not have any `missing` (although it still allows you to) you can typically use it as any other array. However, if you for some reason need to work with a traditional array, then convert `z2` (see below) by using the `disallowmissing` function." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The type of z2 is Matrix{Union{Missing, Int64}}\n", "\n", "The type of z3 is Matrix{Int64}\n" ] } ], "source": [ "println(\"The type of z2 is \", typeof(z2))\n", "\n", "z3 = disallowmissing(z2) #convert to traditional array,\n", " #same as same as convert.(Int,z2)\n", "\n", "println(\"\\nThe type of z3 is \", typeof(z3)) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "anaconda-cloud": {}, "kernelspec": { "display_name": "Julia 1.6.0", "language": "julia", "name": "julia-1.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.6.0" } }, "nbformat": 4, "nbformat_minor": 1 }