339 lines
8.3 KiB
Plaintext
339 lines
8.3 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Missing Values\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load Packages and Extra Functions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"printyellow (generic function with 1 method)"
|
|
]
|
|
},
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"using Printf\n",
|
|
"\n",
|
|
"include(\"jlFiles/printmat.jl\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# NaN\n",
|
|
"\n",
|
|
"The `NaN` (Not-a-Number) can be used to indicate that a floating point number (for instance, 2.0) is missing or otherwise strange. For other types of data (for instance, 2), use a ```missing``` (see below) instead.\n",
|
|
"\n",
|
|
"Most computations involving NaNs give `NaN` as the result."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"NaN\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"println(2.0 + NaN)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Loading Data\n",
|
|
"\n",
|
|
"When your data (loaded from a csv file, say) has special values for missing data points (for instance, -999.99), then you can simply replace those values with `NaN`. This works since `NaN` is a Float64 value, so you can change an existing array of `Float64`s to `NaN`.\n",
|
|
"\n",
|
|
"(See the tutorial on loading and saving data for more information.)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"z: \n",
|
|
" 1.000 NaN\n",
|
|
" 2.000 12.000\n",
|
|
" 3.000 13.000\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"data = [1.0 -999.99;\n",
|
|
" 2.0 12.0;\n",
|
|
" 3.0 13.0]\n",
|
|
"\n",
|
|
"z = replace(data,-999.99=>NaN) #replace -999.99 by NaN\n",
|
|
"println(\"z: \")\n",
|
|
"printmat(z)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## NaNs in a Matrix\n",
|
|
"\n",
|
|
"If a matrix contains NaNs, then many calculations (eg. summing all elements) give NaN as the result. \n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"z has some NaNs\n",
|
|
"\n",
|
|
"The sum of each column: \n",
|
|
" 6.000 NaN\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"if any(isnan,z) #check if any NaNs\n",
|
|
" println(\"z has some NaNs\") #can also do any(isnan.(z))\n",
|
|
"end\n",
|
|
"\n",
|
|
"println(\"\\nThe sum of each column: \")\n",
|
|
"printmat(sum(z,dims=1))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Getting Rid of NaNs\n",
|
|
"\n",
|
|
"It is a common procedure in statistics to throw out all cases with NaNs/missing values. For instance, let `z` be a matrix and `z[t,:]` the data for period $t$ which contains one or more `NaN` values. It is then common (for instance, in linear regressions) to throw out that entire row of the matrix.\n",
|
|
"\n",
|
|
"This is a reasonable approach if it can be argued that the fact that the data is missing is random - and not related to the subject of the investigation. It is much less reasonable if, for instance, the returns for all poorly performing mutual funds are listed as \"missing\" - and you want to study what fund characteristics that drive performance.\n",
|
|
"\n",
|
|
"The code below shows a simple way of how to through out all rows of a matrix with at least one `NaN`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"z:\n",
|
|
" 1.000 NaN\n",
|
|
" 2.000 12.000\n",
|
|
" 3.000 13.000\n",
|
|
"\n",
|
|
"z2: a new matrix where all rows with any NaNs have been pruned:\n",
|
|
" 2.000 12.000\n",
|
|
" 3.000 13.000\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"println(\"z:\")\n",
|
|
"printmat(z)\n",
|
|
"\n",
|
|
"vb = any(isnan,z,dims=2) #indicates rows with NaNs\n",
|
|
"vc = .!vec(vb) #indicates rows without NaNs\n",
|
|
"\n",
|
|
"z2 = z[vc,:] #keep only rows without NaNs\n",
|
|
"println(\"z2: a new matrix where all rows with any NaNs have been pruned:\")\n",
|
|
"printmat(z2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Missings \n",
|
|
"\n",
|
|
"can be used to indicate missing values for most types (not just floats).\n",
|
|
"\n",
|
|
"Similarly to `NaN`s, computations involving `missing` (for instance, `1+missing`) result in `missing`.\n",
|
|
"\n",
|
|
"In contrast to `NaN`s, you cannot just change an element of an existing matrix (of Float64 or Int, say) to `missing.` The [Missings](https://github.com/JuliaData/Missings.jl) package has help routines to handle that."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"using Missings"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"z: \n",
|
|
" 1 missing\n",
|
|
" 2 12 \n",
|
|
" 3 13 \n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"data = [1 -999;\n",
|
|
" 2 12;\n",
|
|
" 3 13]\n",
|
|
"z = allowmissing(data) #convert to an array that can include missing\n",
|
|
"z = replace(data,-999=>missing) #replace -999 by missing\n",
|
|
"println(\"z: \")\n",
|
|
"printmat(z)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"z has some missings\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"if any(ismissing,z) #check if any NaNs\n",
|
|
" println(\"z has some missings\")\n",
|
|
"end"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"z2: a new matrix where all rows with any missings have been pruned:\n",
|
|
" 2 12 \n",
|
|
" 3 13 \n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"vd = .!vec(any(ismissing,z,dims=2))\n",
|
|
"\n",
|
|
"z2 = z[vd,:] #keep only rows without NaNs\n",
|
|
"println(\"z2: a new matrix where all rows with any missings have been pruned:\")\n",
|
|
"printmat(z2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Once `z2` does not have any `missing` (although it still allows you to) you can typically use it as any other array. However, if you for some reason need to work with a traditional array, then convert `z2` (see below) by using the `disallowmissing` function."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"The type of z2 is Matrix{Union{Missing, Int64}}\n",
|
|
"\n",
|
|
"The type of z3 is Matrix{Int64}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"println(\"The type of z2 is \", typeof(z2))\n",
|
|
"\n",
|
|
"z3 = disallowmissing(z2) #convert to traditional array,\n",
|
|
" #same as same as convert.(Int,z2)\n",
|
|
"\n",
|
|
"println(\"\\nThe type of z3 is \", typeof(z3)) "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"@webio": {
|
|
"lastCommId": null,
|
|
"lastKernelId": null
|
|
},
|
|
"anaconda-cloud": {},
|
|
"kernelspec": {
|
|
"display_name": "Julia 1.6.0",
|
|
"language": "julia",
|
|
"name": "julia-1.6"
|
|
},
|
|
"language_info": {
|
|
"file_extension": ".jl",
|
|
"mimetype": "application/julia",
|
|
"name": "julia",
|
|
"version": "1.6.0"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|