HSG-MCS-HS21_Julia/JuliaTutorial-master/Tutorial_09_LoadSaveData.ipynb
2021-11-15 21:14:51 +01:00

617 lines
18 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading and Saving Data\n",
"\n",
"to/from csv, hdf5, jld2, mat and xlsx files."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Packages and Extra Functions\n",
"\n",
"The packages are loaded in the respective sections below. This allows you to run parts of this notebook without having to install all packages.\n",
"\n",
"The data files created by this notebook are written to and loaded from the subfolder \"Results\"."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"using Printf, Dates\n",
"include(\"jlFiles/printmat.jl\")\n",
"\n",
"if !isdir(\"Results\")\n",
" error(\"create the subfolder Results before running this program\")\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading a csv File\n",
"\n",
"The csv (\"comma-separated values\") format provides a simple and robust method for moving data, and it can be read by most software.\n",
"\n",
"\n",
"For instance, for *reading* a data file delimited by comma (,) and where the first line of the file contains variable names, then use the following\n",
"```\n",
"(x,header) = readdlm(FileName,',',header=true)\n",
"```\n",
"Alternatively, use\n",
"```\n",
"x = readdlm(FileName,',',skipstart=1)\n",
"```\n",
"to disregard the first line.\n",
"\n",
"Extra arguments control the type of data (Float64, Int, etc), suppression of comment lines and more.\n",
"\n",
"If you need more powerful write/read routines, try the [CSV package](https://github.com/JuliaData/CSV.jl)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the raw contents of Data/CsvFile.csv:\n",
"\n",
"X,Y,Z\n",
"1.1,1.2,1.3\n",
"2.1,2.2,2.3\n",
"\n"
]
}
],
"source": [
"println(\"the raw contents of Data/CsvFile.csv:\\n\")\n",
"println(read(\"Data/CsvFile.csv\",String))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"header of csv file:\n",
" X Y Z\n",
"\n",
"x:\n",
" 1.100 1.200 1.300\n",
" 2.100 2.200 2.300\n",
"\n"
]
}
],
"source": [
"using DelimitedFiles\n",
"\n",
"(x,header) = readdlm(\"Data/CsvFile.csv\",',',header=true) #read csv file\n",
"\n",
"println(\"header of csv file:\")\n",
"printmat(header)\n",
"println(\"x:\")\n",
"printmat(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Saving a csv File\n",
"\n",
"To *write* csv data, the simplest approach is to create the matrix you want to save and then run\n",
"```\n",
"writedlm(FileName,matrix)\n",
"```\n",
"Alternatively, to write several matrices to the file (without having to first combine them), use\n",
"```\n",
"fh = open(Filename, \"w\")\n",
" writedlm(fh,matrix1,',')\n",
" writedlm(fh,matrix2,',')\n",
"close(fh)\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[34m\u001b[1mNewCsvFile.csv has been created in the subfolder Results. Check it out.\u001b[22m\u001b[39m\n"
]
}
],
"source": [
"x = [1.1 1.2 1.3;\n",
" 2.1 2.2 2.3]\n",
"header = [\"X\" \"Y\" \"Z\"]\n",
"\n",
"xx = [header; x] #to save \n",
"writedlm(\"Results/NewCsvFile.csv\",xx,',') #write csv file\n",
"\n",
"printblue(\"NewCsvFile.csv has been created in the subfolder Results. Check it out.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading csv with Dates and Missing Values (extra)\n",
"\n",
"The next cells show how to load a csv files with dates (for instance, 15/01/1979) and some missing values. \n",
"\n",
"The code does the following\n",
"\n",
"1. reads the csv file\n",
"2. converts `x2[:,1]` to Date,\n",
"3. finds all elements in `x = x2[:,2:end]` that are not numbers and converts them to `NaN` (use with `Float64`) or `missing` (use when data is not `Float64`)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"readdlmFix"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\"\"\n",
" readdlmFix(x,Typ=Float64,missVal=NaN)\n",
"\n",
"Change elements with missing data (' ') to either NaN or missing. \n",
"`x` is the input matrix, `Typ` is the type of the output (Float64, Int, etc) and \n",
"`missval` is your choice of either `NaN` or `missing`\n",
"\"\"\"\n",
"function readdlmFix(x,Typ=Float64,missVal=NaN)\n",
" y = replace(z->!isa(z,Number) ? missVal : z,x)\n",
" ismissing(missVal) && (Typ = Union{Missing,Typ}) #allow missing\n",
" y = convert.(Typ,y)\n",
" return y\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dates and data (first 4 obs):\n",
"1979-01-02 96.730 \n",
"1979-01-03 9.310\n",
"1979-01-04 98.580 9.310\n",
"1979-01-05 99.130 9.340\n",
"\n",
"after fix of missing data (first 4 obs):\n",
"1979-01-02 96.730 NaN\n",
"1979-01-03 NaN 9.310\n",
"1979-01-04 98.580 9.310\n",
"1979-01-05 99.130 9.340\n",
"\n"
]
}
],
"source": [
"x2 = readdlm(\"Data/CsvFileWithDates.csv\",',',skipstart=1)\n",
"dN = Date.(x2[:,1],\"d/m/y\") #to Date, \"d/m/y\" is the date format in the file\n",
"x = x2[:,2:end] #the data, but Any[] since missing data\n",
"\n",
"println(\"dates and data (first 4 obs):\")\n",
"printmat([dN[1:4] x[1:4,:]])\n",
"\n",
"x = readdlmFix(x)\n",
"\n",
"println(\"after fix of missing data (first 4 obs):\")\n",
"printmat([dN[1:4] x[1:4,:]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading and Saving hdf5\n",
"\n",
"hdf5 files are used in many computer languages. They can store different types of data: integers, floats, strings (but not Julia Dates).\n",
"\n",
"The basic syntax of the [HDF5 package](https://github.com/JuliaIO/HDF5.jl) is \n",
"```\n",
"fh = h5open(FileName,\"r\") #open for reading\n",
" (x,y) = read(fh,\"x\",\"y\")\n",
"close(fh)\n",
"\n",
"fh = h5open(FileName,\"w\") #open for writing\n",
" write(fh,\"x\",x,\"y\",y)\n",
"close(fh)\n",
"```\n",
"\n",
"To save dates, save either a matrix `[y m d]` (see eg. `Dates.month(date)`) or a date value (see eg. `Dates.value(date)`). \n",
"\n",
"The [HDFVIEW](https://support.hdfgroup.org/products/java/hdfview/) program allows you to look at the contents of a hdf5 file. (It is not needed here.)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Variables in h5 file: [\"B\", \"C\", \"x\", \"ymd\"]\n",
"\n",
"dates and x from h5 file is\n",
"2019-05-14 1.100 1.200 1.300\n",
"2019-05-15 2.100 2.200 2.300\n",
"\n"
]
}
],
"source": [
"using HDF5\n",
"\n",
"fh = h5open(\"Data/H5File.h5\",\"r\") #open for reading\n",
" println(\"\\nVariables in h5 file: \",keys(fh))\n",
" (x,B,ymd) = read(fh,\"x\",\"B\",\"ymd\") #load some of the data\n",
"close(fh)\n",
"\n",
"dN = Date.(ymd[:,1],ymd[:,2],ymd[:,3]) #reconstructing dates\n",
"\n",
"println(\"\\ndates and x from h5 file is\")\n",
"printmat([dN x])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NewH5File.h5 has been created in the subfolder Results\n"
]
}
],
"source": [
"x = [1.1 1.2 1.3;\n",
" 2.1 2.2 2.3]\n",
"ymd = [2019 5 14;\n",
" 2019 5 15]\n",
"B = 1\n",
"C = \"Nice cat\"\n",
"\n",
"fh = h5open(\"Results/NewH5File.h5\",\"w\") #open file for writing\n",
" write(fh,\"x\",x,\"ymd\",ymd,\"B\",B,\"C\",C)\n",
"close(fh) #close file\n",
"\n",
"println(\"NewH5File.h5 has been created in the subfolder Results\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading and Saving jld2\n",
"\n",
"jld2 files can store very different types of data: integers, floats, strings, dictionaries, etc. It is a dialect of hdf5, designed to save different Julia objects (including Dates). \n",
"\n",
"The basic syntax of the [JLD2 package](https://github.com/JuliaIO/JLD2.jl) is \n",
"```\n",
"(A,B) = load(FileName,\"A\",\"B\") #load some data\n",
"xx = load(FileName) #load all data into a Dict()\n",
"save(FileName,\"A\",A,\"B\",B) #save data\n",
"```\n",
"(It also possible to use the same syntax as for HDF5, except that we use ```jldopen``` instead of ```h5open```.)\n",
"\n",
"The [JLD package](https://github.com/JuliaIO/JLD.jl) is an an older alternative. It uses a similar (often the same) syntax. To use it, change the first line to `using JLD` and then also change `jld2` to `jld` everywhere in the cells below."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The variables are: [\"B\", \"C\", \"x\", \"d\"]\n",
"\n",
"dates and x from jld2 file is\n",
"2019-05-14 1.100 1.200 1.300\n",
"2019-05-15 2.100 2.200 2.300\n",
"\n"
]
}
],
"source": [
"using FileIO, JLD2\n",
"\n",
"xx = load(\"Data/JldFile.jld2\") #load entire file\n",
"println(\"The variables are: \",keys(xx)) #list contents of the file \n",
" \n",
"(x,dN) = load(\"Data/JldFile.jld2\",\"x\",\"d\") #read some of the data\n",
"\n",
"println(\"\\ndates and x from jld2 file is\")\n",
"printmat([dN x])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NewJldFile.jld2 has been created in the subfolder Results\n"
]
}
],
"source": [
"x = [1.1 1.2 1.3;\n",
" 2.1 2.2 2.3]\n",
"d = [Date(2019,5,14); #Julia dates\n",
" Date(2019,5,15)]\n",
"B = 1\n",
"C = \"Nice cat\"\n",
"\n",
"save(\"Results/NewJldFile.jld2\",\"x\",x,\"d\",d,\"B\",B,\"C\",C) #write jld2 file\n",
"\n",
"println(\"NewJldFile.jld2 has been created in the subfolder Results\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Saving and Loading Matlab mat files (extra)\n",
"\n",
"The [MAT package](https://github.com/JuliaIO/MAT.jl) allows you to load/save (Matlab) mat files (which is a dialect of HDF5)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Variables in mat file: [\"B\", \"C\", \"dM\", \"x\"]\n",
"\n",
"dates and x from mat file is\n",
"2019-05-14 1.100 1.200 1.300\n",
"2019-05-15 2.100 2.200 2.300\n",
"\n"
]
}
],
"source": [
"using MAT\n",
"\n",
"function DateMLtoDate(dNum) #Matlab datenum to Julia date\n",
" dNum = round.(Int,dNum) .- 366\n",
" dNTime = Dates.rata2datetime.(dNum)\n",
" dN = Date.(dNTime)\n",
" return dN\n",
"end\n",
"\n",
"\n",
"fh = matopen(\"Data/MatFile.mat\")\n",
" println(\"\\nVariables in mat file: \",names(fh))\n",
" (x,dM) = read(fh,\"x\",\"dM\")\n",
"close(fh) \n",
"\n",
"d = DateMLtoDate(dM) #Matlab datenum to Julia date\n",
"\n",
"println(\"\\ndates and x from mat file is\")\n",
"printmat([d x])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"NewMatFile.mat has been created in the subfolder Results\n"
]
}
],
"source": [
"x = [1.1 1.2 1.3;\n",
" 2.1 2.2 2.3]\n",
"d = [Date(2019,5,14); #Julia dates\n",
" Date(2019,5,15)]\n",
"B = 1\n",
"C = \"Nice cat\"\n",
"\n",
"dM = Dates.value.(d) .+ 366.0 #Julia Date to Matlab's datenum(), Float64\n",
"\n",
"fh = matopen(\"Results/NewMatFile.mat\",\"w\")\n",
" write(fh,\"x\",x) #write one variable at a time\n",
" write(fh,\"B\",B)\n",
" write(fh,\"dM\",dM)\n",
" write(fh,\"C\",C)\n",
"close(fh)\n",
"\n",
"println(\"\\nNewMatFile.mat has been created in the subfolder Results\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading an xls File\n",
"\n",
"The [XLSX package](https://github.com/felipenoris/XLSX.jl) allows you to read and write xls (and xlsx) files. \n",
"\n",
"As an alternative, you can use [ExcelReaders](https://github.com/davidanthoff/ExcelReaders.jl), which requires python and python's xlrd libarary. For instance, this would work\n",
"```\n",
"using ExcelReaders\n",
"data1 = readxl(\"Data/XlsFile.xlsx\",\"Data!B2:C11\")\n",
"x1 = convert.(Float64,data1) \n",
"printmat(x1)\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"part of the xlsx file:\n",
" 16.660 -999.990\n",
" 16.850 -999.990\n",
" 16.930 -999.990\n",
" 16.980 -999.990\n",
" 17.080 -999.990\n",
" 17.030 7.000\n",
" 17.090 8.000\n",
" 16.760 -999.990\n",
" 16.670 -999.990\n",
" 16.720 -999.990\n",
"\n"
]
}
],
"source": [
"using XLSX\n",
"\n",
"data1 = XLSX.readxlsx(\"Data/XlsFile.xlsx\") #reading the entire file\n",
"x1 = data1[\"Data!B2:C11\"] #extracting a part of the sheet \"Data\"\n",
"x1 = convert.(Float64,x1) #converting from Any to Float64\n",
"\n",
"println(\"part of the xlsx file:\")\n",
"printmat(x1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating Variables from Variable Names (extra)\n",
"\n",
"Suppose you have \n",
"\n",
"1. a matrix `x` with data\n",
"2. a list of the variable names for each column of `x`, for instance, from `header` in a CSV file or saved as a vector of strings in a hdf5/mat/jld2 file.\n",
"\n",
"If there are few variables, then you can manually create each of them from the loaded matrix. This becomes tedious when there are many variables.\n",
"\n",
"However, it is easy to create a Dict() or NamedTuple which can be used to easily refer to the variable names, for instance, `D[:X]` or `N.X` to get variable `X`."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"x[:,1], X, D[:X] and N.X\n",
"\n",
" x[:,1] X D[:X] N.X\n",
" 1.100 1.100 1.100 1.100\n",
" 2.100 2.100 2.100 2.100\n",
"\n"
]
}
],
"source": [
"using DelimitedFiles\n",
"(x,header) = readdlm(\"Data/CsvFile.csv\",',',header=true)\n",
"n = size(x,2)\n",
"\n",
"(X,Y,Z) = [x[:,i] for i=1:n] #manually creating X,Y,Z\n",
"\n",
"D = Dict([(Symbol(header[i]),x[:,i]) for i=1:n]) #Creating D with :X,:Y,:Z\n",
"\n",
"namesB = tuple(Symbol.(header)...) #a tuple (:X,:Y,:Z)\n",
"N = NamedTuple{namesB}([x[:,i] for i=1:n]) #NamedTuple with N.X, N.Y and N.Z\n",
"\n",
"println(\"x[:,1], X, D[:X] and N.X\\n\")\n",
"printmat([x[:,1] X D[:X] N.X],colNames=[\"x[:,1]\",\"X\",\"D[:X]\",\"N.X\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"@webio": {
"lastCommId": null,
"lastKernelId": null
},
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Julia 1.6.1",
"language": "julia",
"name": "julia-1.6"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
}