HSG-MCS-HS21_Julia/JuliaTutorial-master/Tutorial_12_MissingValues.ipynb

339 lines
8.3 KiB
Plaintext
Raw Normal View History

2021-11-15 20:14:51 +00:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Missing Values\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Packages and Extra Functions"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"printyellow (generic function with 1 method)"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"using Printf\n",
"\n",
"include(\"jlFiles/printmat.jl\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NaN\n",
"\n",
"The `NaN` (Not-a-Number) can be used to indicate that a floating point number (for instance, 2.0) is missing or otherwise strange. For other types of data (for instance, 2), use a ```missing``` (see below) instead.\n",
"\n",
"Most computations involving NaNs give `NaN` as the result."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NaN\n"
]
}
],
"source": [
"println(2.0 + NaN)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading Data\n",
"\n",
"When your data (loaded from a csv file, say) has special values for missing data points (for instance, -999.99), then you can simply replace those values with `NaN`. This works since `NaN` is a Float64 value, so you can change an existing array of `Float64`s to `NaN`.\n",
"\n",
"(See the tutorial on loading and saving data for more information.)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"z: \n",
" 1.000 NaN\n",
" 2.000 12.000\n",
" 3.000 13.000\n",
"\n"
]
}
],
"source": [
"data = [1.0 -999.99;\n",
" 2.0 12.0;\n",
" 3.0 13.0]\n",
"\n",
"z = replace(data,-999.99=>NaN) #replace -999.99 by NaN\n",
"println(\"z: \")\n",
"printmat(z)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NaNs in a Matrix\n",
"\n",
"If a matrix contains NaNs, then many calculations (eg. summing all elements) give NaN as the result. \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"z has some NaNs\n",
"\n",
"The sum of each column: \n",
" 6.000 NaN\n",
"\n"
]
}
],
"source": [
"if any(isnan,z) #check if any NaNs\n",
" println(\"z has some NaNs\") #can also do any(isnan.(z))\n",
"end\n",
"\n",
"println(\"\\nThe sum of each column: \")\n",
"printmat(sum(z,dims=1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting Rid of NaNs\n",
"\n",
"It is a common procedure in statistics to throw out all cases with NaNs/missing values. For instance, let `z` be a matrix and `z[t,:]` the data for period $t$ which contains one or more `NaN` values. It is then common (for instance, in linear regressions) to throw out that entire row of the matrix.\n",
"\n",
"This is a reasonable approach if it can be argued that the fact that the data is missing is random - and not related to the subject of the investigation. It is much less reasonable if, for instance, the returns for all poorly performing mutual funds are listed as \"missing\" - and you want to study what fund characteristics that drive performance.\n",
"\n",
"The code below shows a simple way of how to through out all rows of a matrix with at least one `NaN`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"z:\n",
" 1.000 NaN\n",
" 2.000 12.000\n",
" 3.000 13.000\n",
"\n",
"z2: a new matrix where all rows with any NaNs have been pruned:\n",
" 2.000 12.000\n",
" 3.000 13.000\n",
"\n"
]
}
],
"source": [
"println(\"z:\")\n",
"printmat(z)\n",
"\n",
"vb = any(isnan,z,dims=2) #indicates rows with NaNs\n",
"vc = .!vec(vb) #indicates rows without NaNs\n",
"\n",
"z2 = z[vc,:] #keep only rows without NaNs\n",
"println(\"z2: a new matrix where all rows with any NaNs have been pruned:\")\n",
"printmat(z2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Missings \n",
"\n",
"can be used to indicate missing values for most types (not just floats).\n",
"\n",
"Similarly to `NaN`s, computations involving `missing` (for instance, `1+missing`) result in `missing`.\n",
"\n",
"In contrast to `NaN`s, you cannot just change an element of an existing matrix (of Float64 or Int, say) to `missing.` The [Missings](https://github.com/JuliaData/Missings.jl) package has help routines to handle that."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"using Missings"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"z: \n",
" 1 missing\n",
" 2 12 \n",
" 3 13 \n",
"\n"
]
}
],
"source": [
"data = [1 -999;\n",
" 2 12;\n",
" 3 13]\n",
"z = allowmissing(data) #convert to an array that can include missing\n",
"z = replace(data,-999=>missing) #replace -999 by missing\n",
"println(\"z: \")\n",
"printmat(z)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"z has some missings\n"
]
}
],
"source": [
"if any(ismissing,z) #check if any NaNs\n",
" println(\"z has some missings\")\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"z2: a new matrix where all rows with any missings have been pruned:\n",
" 2 12 \n",
" 3 13 \n",
"\n"
]
}
],
"source": [
"vd = .!vec(any(ismissing,z,dims=2))\n",
"\n",
"z2 = z[vd,:] #keep only rows without NaNs\n",
"println(\"z2: a new matrix where all rows with any missings have been pruned:\")\n",
"printmat(z2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once `z2` does not have any `missing` (although it still allows you to) you can typically use it as any other array. However, if you for some reason need to work with a traditional array, then convert `z2` (see below) by using the `disallowmissing` function."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The type of z2 is Matrix{Union{Missing, Int64}}\n",
"\n",
"The type of z3 is Matrix{Int64}\n"
]
}
],
"source": [
"println(\"The type of z2 is \", typeof(z2))\n",
"\n",
"z3 = disallowmissing(z2) #convert to traditional array,\n",
" #same as same as convert.(Int,z2)\n",
"\n",
"println(\"\\nThe type of z3 is \", typeof(z3)) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"@webio": {
"lastCommId": null,
"lastKernelId": null
},
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Julia 1.6.0",
"language": "julia",
"name": "julia-1.6"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 1
}