Table of contents

- Installation of Julia
- Data Structures Concept in Julia Programming Language
- For Loop
- Basics Of Julia For Data Analysis
- Exploratory Data Analysis With Julia
- Using R and Python Libraries in Julia
- Using Pandas With Julia
- Introduction To DataFrames.jl
- Visualization in Julia Using Plots.jl
- Histogram Chart
- Data Munging In Julia
- Building a Predictive ML Model
- Logistic Regression
- Decision Tree
- Random Forest
- Using ggplot2 in Julia

Julia is a general-purpose programming language like C, C++, etc. Julia was developed mainly for numerical computation. As of now, we know how science has been changing in the area of computation. Everything needs a quick calculation in-order to generate results from large scale data in a fraction of seconds. However, despite all the advancements in programming world and despite having so many programming languages with good performance and compatibility, etc. like C, C++, Java, Python, we face the following question: Why Julia?

Julia was developed mainly for numerical computation purpose, and it helps eliminate performance issues. It will provide an environment which is good enough to develop applications that require high performances.

*Check out Great Learning Academy for free courses on Data Science and more. *

**Installation** of Julia

Here, we are going to see the steps on how to download and install Julia on your system:

**Step-1:** To download Julia go to https://julialang.org/downloads/ link or else you can search Google for the following, “Download Julia”.

**Step-2:** Download as per your machine bit configuration, i.e. 32-bit or 64-bit.

**Step-3:** After download run the .exe file

**Step-4:** Click the install button and furtherly go with the picture shown below.

**Step-5:** Click the checkbox to run Julia and click **Finish** as shown in the figure below.

**Step-6**: Now you can see a command line prompt which is also known as REPL

(Read-Eval-Print-Loop)

Before going into another topic, let’s see Julia’s packages for data analysis and data science-related projects.

We know about jupyter notebook and its popularity in data science and ML, which gives fast results and easy to handle the IDE. Similarly, we do have a notebook for Julia i.e

Juno IDE but if you are familiar with notebook then go on with jupyter notebook. Let’s see how we can set up the package for Julia notebook(IJulia).

Open the Julia prompt and then type the following command:

Julia> Pkg.add(“IJulia”)

After you run the command, the necessary packages will be added or updated.

After IJulia package is downloaded or updated you can type the following code to run it:

Julia> using IJulia

Julia> notebook()

You will get by default notebook “dashboard” which opens in your home directory or in the installation folder where you have done the installation;

If you want to open the dashboard in a different directory then notebook(dir = “/some/path”).

**Data Structures Concept in Julia Programming Language**

Like every other programming language, Julia also has data structure concepts. Let’s learn about some of these concepts that are used for data analysis.

**Vector(Array) –**A vector is a one-dimensional array which is similar to a normal array. In array, we use numbers followed by a comma as separator similarly in Julia also the vector(array) follows same.

Let’s have a look on a piece of code.

In Julia, the index starts at ‘1’. In the above code snippet, it begins with ‘0’ since its python.

**Matrix Operations**

A matrix is another data structure that is widely used in linear algebra. We know that matrix is of a multidimensional array. Let’s see dome operation of a matrix in Julia,

**A = [1 2 3; 4 5 6; 7 8 9] # semi-column is used to change rows**

**When we print, it looks like: 1 2 3**

**4 5 6**

**7 8 9**

**In order access element, say A [1,2] = 2**

Now for transpose of a matrix, **A’** then the following result will look like:

**A’ = 1 4 7**

**2 5 8**

**3 6 9**

**Dictionary**

Another data structure is the dictionary, which is an unordered key-value pair, and the keys are always unique.

Let’s have a look on the dictionary implementation,

**D = Dict (“string1” => “Hello”, “length” => 5) #create dictionary**

It will get result : string => Hello

Length => 5

Suppose in-order to access the dictionary we will access the key of dictionary then the value will give us as result

**D[“length]**

**o/p: 5**

to get count of dictionary use object. Count i.e **D.count**

**Operations of Dictionary:**

**Creation =****Dict(“a” => 1, “b” => 2)****Addition = d[“c”] = 3****Removal = delete !(d, “b”)****Lookup = get(d,”a”, 1)****Update = d[“a”] = 10**

**Strings**

Next data structure is strings , strings are generally written within the quotes as {“ ”} i.e inverted commas. Similar to the python in Julia also once string is created it cannot be changes as they are immutable.

Lets have a look,

**Text = “Hello world”**

**print(Text[1]) # will gives first character of string as H**

**Print(Text.length) # will gives the length of string 11**

There are three key phases of data structures that are used in data analysis

**Data Exploration**

It’s all about finding the data more than what we have

**Data Munging**

Cleaning the data and use that data for making better statistical models

**Predictive Modelling**

Final thing is run the algorithm and have fun

**Loops, Conditions In Julia**

Like other programming languages Julia also uses the loops and conditional statements

For loop

While Loop

If condition

These are most commonly used loops and condition statement in Julia as well as other programming languages

**If and else**

In Julia we need not to worry about spaces, identation, semicolon, brackets etc instead just add end at the end of statement or condition. Lets have the syntax for if and else

Syntax: if condition

Statement

else

Statement

end

**if elseif and else**

It also follows same as if else block follows. Let’s have look on syntax

Syntax: if condition

Statement

elseif

Statement

else

Statement

End

Lets take an example to the above we discuused

If x > 0

“Positive”

else if x < 0

“Negative”

else

“Whole Number”

Lets talk about loops in Julia.

**For Loop**

The only difference to the loop for with other languages for loop is, in Julia for loop will have start and end counter.

Julia> for i in 0: 10: 100

Print(i)

end

will gives result as: 0 10 20 30 40 50 60 70 80 90 100

Julia> for a in [“red”, “green”, “yellow”]

Print(a, “ “)

end

Will give result as : red green yellow

Julia> for a in Dict(“name” => “orange”, “size” => 6)

Print(a)

end

Name => orange Size -=> 6

Similarly we can also iterate through 2D array, lets have look on that

A = reshape(1:50, (3, 3))

for I in A

Print(I, “ “)

end

The result will be as 1 2 3 4 5 6 7 8 9 …………..50

We can also use inside of functions

function()

for condition

Statement

end

return

end

We know that scope of an variable inside a method or function will exists until its life span is not yet done once method or function ends and comes out then the variable scope is zero or dead

Function()

K = 2

for I in 1:10 :50

K = k*i

end

return

end

**if we want to persist the variable to be exist in the function or method then use keyword “global” before variable name.**

**continue and break** are the condition statements used in between the loops

for I 10:5:20

print(i)

continue

end

**comprehensions**

similar to python Julia also supports comprehensions

Julia> s = set([a for a in 1: 8])

Set([6,4,5,7,1,3,2,8])

Julia> [(a,b) for a in 1:5, c in 1:2]

(1,1) (1,2)

(2,1) (2,2)

(3,1) (3,2)

(4,1) (4,2)

(5,1) (5,2)

**Generator Expressions**

Like comprehensions generating expressions can also be used to produce result using iterable variable.

Let’s have a look on the example

**Julia> sum( x^2 for x in 1:10)**

**385**

**Nested Loops**

Nested loops in Julia is quite different as of writing loop inside another loop is known to be as nested loops. But, in Julia we need not make duplicate loops instead we can use

@show(var1, var2) variables with comma separated

Have a loop on the piece of code for better understanding

for a in 1 : 10, y in 1: 10

@show (x,y)

Result will be:

(x,y) = (1,1)

(x,y) = (1,2)

(x,y) = (1,3)

(x,y) = (1,4)

(x,y) = (1,5)

(x,y) = (1,6)

(x,y) = (1,7)

(x,y) = (1,8)

……………

(x,y) = (10,10)

**@show is an macro that prints the names and values**

**@time will gives the complexity of loops**

Julia> x = rand(1000);

Julia> function sum()

A = 0.0

For I in x

A + = i

End

Return A

End

Julia> @time sum()

0.017705 seconds (15. 28k allocations: 694. 484 kiB)

496.84883432553846

**While Loop**

Same as for loop while as performs only when condition is true. The following syntax is

While condition

Statements

End

Let’s have an example

Julia> x = 0

Julia> while x < 3

Print(x)

global x+ = 1

end

result: 0 1 2

And finally Exceptions with loops, like other programming language Julia also have try, catch blocks.

Julia> s = “apple”

try

S[1] = “a”

catch e

Print(“caught an error: $e”)

End

**Basics Of Julia For Data Analysis**

Till today many of us familiar with python or R language in the field of machine learning, data science. All those are good in their performances and predicting fasten results. Whereas Julia is such a language that can computate the large amount of data and give results in fraction of seconds.

It is very similar to the languages like python or R with respect to syntax. There won’t be no time taking for one to use Julia on data analysis. Moreever a lot of time is spent by data scientists in-order to transform the data into good format . For that purpose Julia will provides an extensive library in dealing with the raw data and to make into good format of data I,e structured data format . There are basic steps to be followed in data analysis

- Always explore the given data sets or data tables and apply statistical methods to find patterns in numbers.
- Second thing is plot the data for visualization.

As in Machine Learning the data has to convert into data frames similarly using Julia we can do that. The following package provide by the Julia on Data Frames is DataFrames.jl that will converts the data into matrix format with extensions like .csv, .xlsx etc

Julia> Pkg.add(“DataFrames.jl”)

Let’s take an example to demonstrate dataframes in Julia

**Using DataFrames**

**#read the dataset**

**df = readtable(“demo.csv”, separator=’,’)**

—we have loaded the dataset into df variable and then we can print the dataset—-

**Df**

Look at the demo dataset , this is just the view of dataset its not the dataframe view.

**Dataframe functions **like** **finding size , column names, to know the first n rows of dataframe set

**size(df)** = given rows and columns (mXn)

output: [ 3, 3]

**Names(df)** = column names

Output: [‘Aanthony’, ‘Ball’, ‘Call’]

**head(df)** = say we give head(5) will results first five rows

output: first five rows

**Numerical Data **like describe() function which gives basic statistical data analysis such as mean, mode, sum, avg

**Categorial Data **countmap() is function that maps the values to the no. of occurrence in the dataset.

**Dealing with Missing Data**

This is very important concept because entire game runs on this data only as of when there is loss of data obviously the predicted result will generates differ accuracy. So, in-order to maintain a good accuracy we should handle the missing data from the dataset

**showcols()** = to check for missing values in variables

And we can replace the empty values with some related values , lets say

**df.replace(df[‘Anthony’] == “ “ , : “some data to replace”)**

**Visualization **part that generalizes the entire data and their relation among them.

Above chart says that rainfall over a period of time interval keeps on increasing [cm’s]

**Point to remember**

Histogram charts should always be divide into bins i.e more bins more data analyzed

Data Analysis is not limited to data visualization after modelling also data analysis is done.

**Exploratory Data Analysis With Julia**

Exploratory Data Analysis is used in understanding data in terms of data features, variables and their relationship among them. Always the main step to be do is understand the data set properly. There are some methods to be followed

**Methods to be followed on given dataset (explore)**

- Statistical Methods or Functions
- Visual Plot Techniques

**To the data table apply some statistics**

**Step1: **installing Data Frame Package

Using Julia over the data table or data set a data structure concept called Data Frames is used. As of data frame can handle multiple operations like speed , accuracy and compatibility

Data frames to be used in Julia should be installed first

The following command is used to install the data frame

Using Pkg

Pkg.add(“DataFrames”)

**Step2: **Next download the data set

**Step3: **Then install necessary packages, CSV packages, Data Frame etc

using DataFrames

using CSV

a = CSV.read(“sample.csv”)

**Step4:** Then have data exploration

Data exploration has to be done over the data set because it gives the relations among data variables, what are the functions ,column names, lists etc

using DataFrames

using CSV

a = CSV.read(“sample.csv”);

size(a)

names(a)

head(a, 10)

**Describe Function**

Describe function is used to give mean, mode, meadian, some basic statistical data over the data set

**Mean: **Mean gives** **the average of dataset or datatable.

**Mode: **Mode will gives the observed value of dataset or datatable

**Median**: Median will gives middest value of datatable or dataset.

using DataFrames

using CSV

a = CSV.read(“sample.csv”);

describe(a)

describe(a, :all, cols = :SepalLength)

**Apply visual plot techniques over the data set**

Visual plotting in Julia can be achieved using plot libraries like Plots, StatPlots and Pyplot

**Plots : **it’s an high level plotting package which interfaces with other plotting packages called

‘**back-ends’ **. Actually they behave like graphic engine that provides graphics

**StatPlots: **Its** **an plotting package including with the Plots package especially these StatPlots are used for some statistics

**Pyplot: **Its** **an** **package with Matplotlib which is library of python.

These libraries can be installed as follows:

Pkg.add(“Plots”)

Pkg.add(Statplots”)

Pkg.add(“Pyplot”)

**Distribution Analysis**

Here, in distribution Analysis Julia is performed using various plots such as histograms, scatterplot, boxplot

using DataFrames

using CSV

a = CSV.read(“sample.csv”);

using Plots

Plots.histogram(a[:SepalLength], bins = 50, xlabel = “Sepallength”,

Labels = “length in cm”)

Similarly we can plot graph using different formats like histogram etc

**Using R and Python Libraries in Julia**

Julia programming language is such a powerful language with many libraries and packages included as well as it also provides outside libraries to be accesses.

You may get doubt like if Julia is has such powerful libraries then why is needed to access from other languages especially Python and R because even the libraries are there but they might be very young to be used that’s the reason Julia provides ways to access libraries from R and python.

To call python libraries in the Julia PyCall is the package that will enables to call python libraries from Julia code

Julia> Pkg.add(“PyCall”).

**PyCall** provides many good functionality that helps in manipulating python in Julia using type **PyObject**

The following are the steps to be followed in order to call python packages

Step1: using Pkg

Step2: Pkg.add(“PyCall”)

Step3: using PyCall

Step4: @pyimport python_library_name

Lets see basic programe on how to import math package of python into Julia

using Pkg

Pkg.add(“PyCall”)

using PyCall

@pyimport math

Print(math.cos(90))

Second example to import Numpy package into Julia language

using Pkg

Pkg.add(“PyCall”)

using PyCall

@pyimport numpy

A = numpy.array([2,1,4,3,

5,7,6,8])

Print(A)

Output:

[2, 1, 4, 3, 5, 7, 6, 8]

**Using Pandas With Julia**

If you are familiar with the library pandas in python then it is same as Julia also. Using Pandas we can filter the data or analyze the data lot more. Especially converting the data into dataframes which is package of pandas library .

DataFrames will helps to visualize the data into multidimensional array i.e matrix format

Julia> Pkg.add(“Pandas”)

Lets see an example using pandas with Julia

using pandas

df = read_csv(“job.csv”)

df = DataFrame(Dict(:company => [“google”, “Apple”, “Microsoft”], :job=>[“sales executive”,

“business manager”, “business manager”, “computer manager”],

:degree=>[“bachelors”, “masters”], :salary=>[0,1]))

typeof(df)

head(df) # will gives first five rows of data

describe(df)

If df[“job”] == “computer manager”

df[“job”] = “manager”

end

df.mean(“salary”, axis = 1)

So, there are many operations which are basics of pandas and are used on data set as cleaning procedure .

Cleaning includes like removing null values, missing values replacement and modifying the data which is in appropriate .

Pandas is most powerful library not only in python but also in Julia .

**Introduction To DataFrames.jl**

As we all know that Julia has the library that handles the data transformation like python and R does i.e DataFrames. This approach although looks similar to python or R but it differs during API call. For complex data tables DataFramesMeta concept is used

Lets see how to install and import the library

- To install library use command
**Pkg.add(DataFrames**) - To load the library use command
**using DataFrame**

After doing above steps the next is to load the data set . The following way to read the data table is.

using CSV

Datatable = CSV.readtable(“sample.csv”)

**Fruits Sweet Sour**

**Apple 80% 10%**

**Orange 90% 10%**

**Pineapple 100% 0%**

After loading CSV file check for the missing values suppose if the column has missing values in the top most rows due to using type-auto recognization then there are chances of having error rate. Manually we have to remove the error tendancy from the data set.

To find missing value

Types = Dict(“Florida” => Union{Missing, Int64})

If we want to edit the values of imported dataframes then don not forget to use copycols = true

- Use the package from the stream HTTP:

Using DataFrame , HTTP, CSV

Resp = HTTP.request(“GET”, https://somesite@domain.com?accesstyep = “Download)

df = CSV.read(IOBuffer(String(resp.body))

- Again create df from scratch

Df = DataFrame(

Color = [“red”, “yellow”, “orange”, “white”]

Shape = [ “circle”, “rhombus”, “vertical”]

Border = [“line”, “dotted”, “line”]

Area = [1.1,1.2,1.3,2.5])

- There are many possibilities with df like convert matrix form data to vector form :

For example:

df = DataFrame([[mat[ : , i]…] for I in a : size(mat, 2)], Symbol.(headerstrs))

Using dataframes package we can do a lot mpre with the data set or data table. Always the given dataset has to be converted into data frames i.e matrix conversions so that one can analyze the data properly and handle it regarding null values, missing values..

**Get Some Insights of Data**

- first(df, size)
- show(df, allrows=true, allcolls = true)
- last(df, size)
- describe(df)
- unique(df.fieldName)
- names(df)
- size(df)
- to iterate over each column [for a in eachcol(df)]
- to iterate over each row [for a in eachrow(df)]

**Filter**

In-order to refer to some columns there are two ways in data frame like referencing the stored values into the object or copying them into another new object

- Myobject = df[ !, [cFruits]] {store values in object}
- newObject = df[ :, [cFruits(s)] { Copying entire into new object }

You know we can also query using data frames let’s see how we can do

dfresult1 = @from I in df begin

@where i.col > 1

@select {aNewColName = i.col1, i.col3}

@collect DataFrame

end

dfresult2 = @from I in df begin

@where i.value != 1 && i.cat1 in [“red”, “yellow”]

@ select i

@collect DataFrame

end

**Replace Data**

We can replace the values of column with other data that to dictionary based values

df.col1 = map(key ->mydict[key], df.col1)

Can be concate the values of column using dot operation df.a = df.b

Appending rows : push! (df, [1 2 3])

Delete rows: deleterows !(df, rowIdx)

**Change the structure of data or holding object**

Here dataframe can be used to change name of column, data type of column , delete column, rename column or else replacing position of columns. Type casting which can be help to convert any kind of data type

From int to float: df.a = convert(Array{Float32, 1}, df.a)

**Sorting **sort ! (df, cols = (:col2, :col1), rev = (false, false))

So, Dataframes is most powerful library or package for data handling . It will handle missing values which cause a lot error tendancy . we can split the datasets and re combine them together and apply some statistical operations like aggregate functions,

**Visualization in Julia Using Plots.jl**

This is another way to explore the data and analysis i.e by doing visualization using various kinds of plot formats.

In Julia we can even plot the graph for the data using library. But, Julia does not provide direct library of its own instead it provides to use libraries of your own choice in Julia programs.

To have this functionality we need install some packages:

Julia> Pkg.add(“Plots.jl”)

Julia> Pkg.add(“StatPlots.jl”)

Julia> Pkg.add(“PyPlot.jl”)

This Plots.jl is act as interface to any plotting library such that using libraries in Julia we can plot data .

StatPlots.jl is supporting package for Plots.jl

PyPlot.jl will act as Matplotlib of python

Now, let’s see some data visualization plots using pyplot.jl and also we can get information about data table more using plots.

Using CSV

S = CSV.readtable(‘Venice.csv’)

using Plots, StatPlots

pyplot() #set backend as matplotlib package i.e matplotlib.pyplot

Plots.histogram(dropna(train[: ApplicationTax]), bins = 50, xlabel = “ApplicationTax”, labels = “Frequency”) # plot histogram

If you observe the plot we have different values with depriciation in the plot , so that is the reason why we need the bins as 50 or relevant to that

In other scenario we can look at box plots to understand the distributions of bins in the above graph clearly.

Lets see another way of visualizing the plot:

Plots.boxplot(dropna(train[: ApplicationTax]), xlabel = “”ApplicationTax”)

If u look the plot below it tells us the preence of extreme values . This can be attributed to the Tax in the society. And also we can segregate the part based on their profession in the society

Plots.boxplot(train[: Education], train[: ApllicationTax], label = “ApplicationTax”)

**ApplicationTax**

Now, if u see there is no difference between the Tax of the persons and also the Profession

of persons based on which the tax is paid i.e high or low tax .

Lets have look on other charts like **line chart**, pie chart for rain data in a year/month

using CSV

a = CSV.read(“sample.csv”)

plot(a.month, a.max)

This graph will says that a month with maximum rain

Next, we will see **scatter chart** by using same data i.e rain data in a year/month

Scatter(a.Rain, label = “y1”)

This chart says that the rainfall is vary’s on every year i.e increase as the year goes on increase

Similarly lets look on the **pie chart** also with same rain data in a year/month

W = 1:5; y = rand(5); #plotting data

Pie(x,y)

The pie chart gives an analyzation of more area with rainfall followed by average and less rainfall per year or month.

**Histogram Chart**

Histogram(a.Rain, label = “Rainfall”)

We can easily find by histogram chart the rainfall is varies in a year with unequal distribution of rainfall.

The graphs and charts can be used for visualizing or seeing the trends.

So, I hope we learnt topic in Julia i.e plots. so far we completed all the basic charts that are used in Julia with plot library.

**Data Munging In Julia**

While we did analysis of data there are some problems that we encountered i.e missing values, null values all these problem has to be remove under data analysis step. To do so, data munging is a technique or process to handle the missing values in data table or data set i.e converting the raw data into some format that can be utilized for data analysis . It is also known as **Data Wrangling**

It is one of the most important component in data science .

The following packages that are required:

**RDataset **this packagae will load the data set generally used in R language since julia can also be access the libraries or packages of other languages like R it can be installed as follows

Julia> Pkg.add(‘RDatasets’)

As we know that inorder to convert into multidimensional array format to a data set in python or R we use data frames . similarly here in julia **DataFrames** and **DataFramesMeta** will provide the functionality

Julia> Pkg.add(‘DataFrames’)

Julia> Pkg.add(‘DataFramesMeta’)

Let’s load the data set

It contains columns

company

job

degree

salary

So, the analysis of this data set is if an employee having bachelors degree he or she can be promoted or salary can be increased and condition applys i.e varies with company.

using RDatasets

sal = dataset(“datasets”, “sample”)

head(sal)

**it gives the same dataset as we saw in the above figure**

**Using groupby():**

The groupby function will group the data in all the columns to a given value . It splits the datagrame and those split dataframes are again split into subsets then the function is used. The indices for data set starts from indices 1 when we use the groupby()

The following syntax:

groupby(a, :col_names, sort = false, skipmissing = false)

Parameters are

a : dataframe

:col_names: column names on which data set is split

sort: to return the data set in sorted manner by default it is false

skipmissing: it will decides whether to skip the missing values or not , by default false

using RDatasets

sal = dataset(“datasets”, “sample”)

groupby(sal)

**by() function**

This by() function will performs **split-apply **method which means it will split the column and then apply the by() function . The syntax as follows:

by(a, :col_names, function, sort = false)

The Parameters :

a: dataframe

col_names: the split of columns

function: function applied on each column

sort: the dataframe to be return sort order by default it is false

lets split the dataframe and show the column who are eligible for salary promotion

using RDatasets

using Statistics

sal = dataset(“datasets”, “sample”)

by(sal, [:job, :degree]) do a DataFrame(Mean_of_Salary = mean(a[:Salary]),

Variance_of_Salary = var(a[:Salary])

End

* Mean of Column Salary

**aggregate() function**

aggregate function will also follows split- apply method . columns are split and then the function is applied to the specified column .

aggregate(a, :col_names, function)

The Parameters are:

a: dataframe

col_names: the split of columns

function: function applied on each column

using RDatasets

sal = dataset(“datasets”, “Sample”)

aggregate(sal, :job, degree)

**Missing**

In Julia the missing values are represented using special name i.e **missing **which is instance for the type Missing.

Julia> missing

missing

let’s see for the type of of missing

Julia> typeof(missing)

Missing

Missing type will allows users to create Vectors and DataFrame column with missing values.

Let we see an example :

Julia> x = [0, 1, missing]

3-element Array{Union{Missing, Int64}, 1}:

1

Missing

Julia> eltype(x)

Union{Missing, Int64}

Julia> Union{Missing, Int}

Union{Missing, Int64}

Julia> eltype(x) == Union{Missing, Int}

True

While performing some operations missing values can be excluded using a technique called as

“skipmissing”

Julia> skipmissing(x)

Base.Skipmissing{Array{Union{Union{Missing, Int64}, 1}}(Union{Missing, Int64}[0,1,missing].

Lets take an scenario i.e I want to find the average of all missing values.

Julia> avg(skipmissing(x))

4

Julia> collect(skipmissing(x))

2-element Array{Int64, 1}

**Coalesce is the function which is used to replace null value with some other values.**

Julia> coalesce(x, 0)

3-element Array{Int64, 1}

1

2

Similarly we may also have missing values or null values in rows . For that we can use dropmissing and dropmissing! to remove the missing values .

Julia> df = DataFrame(I = 1:4,

P = [missing, 3, missing, 2,1]

Q = [missing, missing, “c”,“d”,”e”])

4X3 DataFrame

Row | I x y

| Int64 Int 64 String?

1 | 1 missing missing

2 | 2 3 missing

3 | 3 missing c

4 | 4 2 d

Julia> dropmissing(df)

2X3 DataFrame

Row | I x y

| Int64 Int64 String

————————————————

1 | 4 2 d

2| 5 1 e

One more point i.e Missings.jl package provide the few functions inorder to work with missing values.

Julia> using Missing

Julia> Missings.replace(x,1)

Missings.EachReplaceMissing{Array{Union{Misssing, Int64}, 1}, Int64}(Union{Missing, Int64}[1,2,missing], 1)

These are some basic functions used to handle the data while analyzing i.e mainly to remove null and missing values from the data set. This is what data munging.

**Building a Predictive ML Model**

Till now, we have saw how the data set should be handle , how to overcome the problems especially like missing values in the data set or null values and more-ever visualizing the data using library plot.pl, StatPlots.

Now, we will see how to build an Machine Learning Model using Julia programming language.

In python scikitlearn is the package or library that will provides all the necessary models , similarly in Julia Scikitlearn package will provides.

Julia> Pkg.add(“Scikitlearn.jl”)

This package will act as interface to the python’s Scikitlearn package

*“ Since Julia can access Packages of Python”*

**Label Encoder**

In python labelencoder() is the package that can be found from **Scikitlearn.Preprocessing **which will converts data into numerical format data [0,1,2…………….]

In Julia also we will convert the data into numerical format. The one who are familiar with python they can understand why label encoder is used.(it becomes easy to access any column of data with numerical values).

Lets encode sample data

using ScikitLearn

@sk_import preprocessing: LabelEncoder

encoder = LabelEncoder()

data = [“apple”, “orange”, “papaya”]

for col in data

train[data] = fit_transform! (encoder, train[data])

end

Now, we will define generic classification function which takes model as input and gives us the accuracy and cross-validation scores.

using ScikitLearn : fit!, predict, @sk_import, fit_transform!

@sk_import preprocessing : LabelEncoder

@sk_import model_selection : cross_val_score

@sk_import metrics: accuracy_score

@sk_import linear_model: LogisticRegression

@sk_import ensemble: RandomForestClassifier

@sk_import tree: DecisionTreeClassifier

function classification_Model(model, predictions)

p = convert(Array, train[:13])

q = convert(Array, train[predictions])

r = convert(Array, test[predictions])

# check for fitness of model

fit! (model, p, 1)

#predicitons on training data set

Predictions = predict(model, p)

#accuracy

Accuracy = accuracy(Predictions, q)

#cross_validation

Cross_score = cross_val_score(model, p, q, cv = 5)

#print cross score

print(“cross score: “, mean(Cross_score))

fit!(model, p, q)

Out = predict(model, r)

Return Out

End

**Logistic Regression**

Using logistic regression we are going to calculate the accuracy and cross validation scores like what we have done in the above classification_Model function.

LogisticRegression in Julia is similar to Python. Logistic Regression in Machine Learning is an classification algorithm which is used to predict the probability of dependent categorial value. The dependent values will be either in 0 or 1.

Logistic Regression can be classifies into two classifications

- Binary Classification
- Multiclass Classification

Lets see the logistic regression plot in visual

Mathematical Equation For Logistic Regression : **1/ 1+ e^-x (or) 1/ 1 + e^-z**

lets make use of model and determine the accuracy for the persons obesity

model = LogisticRegression()

predict_value = [:Obesity] => this code snippet add as

classification_Model(model, predict_value) continuation to above code

The result will be :

Accuracy: 80.9% Cross-Validation Score: 80%

The accuracy and cross_score are good but if you need more accuracy then change the column or variables and apply model again.

Predict_value = [:Obesity, :Age, :Weight]

Classification_Model(model,predict_value)

The result wil be :

Accuracy: 88% Cross-Validation-Score: 87.9%

This how logistic regression classifies. Generally problems which are not ended at particular limit instead they tend to change frequently for those problems Logistic Regression Model is used to solve.

**Decision Tree**

Decision Tree is another Model under Classification. Decision Tree works on parent child scenario, always the child node will be consider as the result node vice-versa parent node is consider as root node which takes decisions. The working process of decision tree

- Decision tree selects best attribute using Attribute Selection Measure
- Selected attribute will be consider as root node
- Then again it divides into sub nodes until it reaches to leaf node

The mathematic equations or formulae used in decision tree are:

**Information Gain (ig)**=**-p/s log(p/s) – n/s log(p/s)**

**Gini Index = ig – Entropy**

**Information Gain:**

This will gives us the information regarding an attribute i.e how important an attribute to the data set as of attribute posses feature od vectors through which we can identify the relations of parent and child nodes.

**Entropy**

Entropy , we can get this from information gain such that information gain will gives us the

entire relation of data set whereas the entropy will tells us the impurities from the data set.

The higher entropy the more information gain.

Let’s say two classes and we want to find the which class belongs to same category

Suppose class A belongs to some x category and B also same category x then it is not

a good entropy as 0. if it is like 50 – 50 % then it is good entropy and data set is good as 1

**Gini Index**

**Gini Index** will gives the pure impurity which means it will calculate the probability of s

Selected attribute if all are linked to same attribute then that attribute is pure attribute or

Belongs to same classs.

Decision tree gives higher accuracy than logistic regression , since decision tree follows the parent and child concept by taking exact decision.

Let’s see the implementation part for decision tree by considering an example.

We are going to calculate the results i.e accuracy and cross-validation-score of student using decision tree classifier algorithm. Now, the attributes for student are Name and age

Conside Name and Age columns possess some 10 rows of random data and we used decision tree classifier algorithm, which it should its gives best accuracy and cross-validation-score.

model = DecisionTreeClassifier()

predict_value = [:Student, :Name, :Age]

classification_Model(model, predict_value)

The result will be as:

Accuracy: 81.95% Cross-Validation Score: 75.6%

We can again increase the accuracy to more extent by changing the input columns so that maximum accuracy can be obtained.

“*Always find maximum accuracy and score”*

Predict_value = [:Student, :Name, :Class, :Age]

Classification_Model(model, predict_value)

The result will be as:

Accuracy: 85.78% Cross-Validation Score: 80.7%

**Random Forest**

Random Forest, it is an another algorithm that is capable of performing both regression as well as classification tasks with a technique called “Bootstrap” and “Aggregation” known as bagging.

Random Forest having multiple decision trees as its learning models then it performs random row sampling and feature sampling to the dataset by applying a model. This is called as Bootstrap.

Let’s see the approach or process involved to use random forest algorithm

- We should design a relevant question to the given information or data set
- And one more thing to make sure is convert all the data to accessible format or else convert into that format
- Develop a machine learning model
- categorize data set into training data and test data
- Apply model and find the accuracy or score for the testing data
- Repeatedly change the values so that accuracy will reach to max

Let’s see the implementation part of Random Forest

We are going to calculate the results i.e accuracy and cross-validation-score of bank customers using RandomForestClassifier algorithm to segregate customers based on loan status. Now, the attributes for customer are Name , Age, Sex, Loan.

Conside Name , Age, Sex, Loan columns possess n rows of random data and we used RandomForestClassifier algorithm, which it should its gives best accuracy and cross-validation-score.

model = RandomForestClassifier(n_value = 100)

predictions = [:Name, :Age, :Sex, :Loan]

classification_Model(model, prediction)

Accuracy : 100% Cross-Validation Score : 80%

Here, we got 100% accuracy for the training data set. This is the problem overfitting and can be resolved in two ways

- Reducing the number of predictions
- Tuning the model parameters

model = RandomForestClassifier(n_value = 100, min_samples_split = 50, max_depth = 20,

n_jobs = 1 )

classification_Model(model, predictions)

The result will be :

Accuracy : 83% Cross-Validation Score : 80%

Here if you see even though accuracy is reduced the score is increased which means the model is doing well Random Forest will use multiple decision trees which in return gives different predictions.

As possible as avoid complex modelling technique as black box without understanding the concepts.

**Using ggplot2 in Julia**

ggplot2 is an data visualization package used in statistical programming language R. ggplot will breaks the data into semantic components such as scales and layers.

Since, Julia can access the libraries of python and R so ggplot2 can be installed with Julia and include.

Lets see how to load R package into Julia

Using RCall

@rlibrary ggplot2

There might be question araise like having most powerful Julia with all packages include why to use R packages for data visualization ?

Plots.jl is powerful package but unfortunately its interface is similar to R language . If user wants to visualize the plot then it is very difficult to remember all the commands as there are more to remember .

So that’s the reason why Julia uses R packages for data visualization even python libraries too.

Lets consider an example with this scenario:

**Using Julia plot.jl package**

plot(plot_data_1, a = “a”, b = “b”, Geom.line,

layer(Geom.line, a = “a”, b = “text” , Theme(default_color = “red”)),

layer(Geom.line, a = “a”, b = “a_mc”, Theme(default_color = “blue”)),

layer(Geom.line, a = “a”, b = “a_mf”, Theme(default_color = “orange”)),

}

**Using R ggplot package**

ggplot(plot_data_1, aes(a = “a”, b = “b”)) +

geom_line(color = “red”) +

geom_line(aes(b = :a_mc), color = “green”) +

geom_line(aes(b =:a_mf), color = “violet”)

if u observe above piece of code using ggplot which is very simper when compared to Julia plots.jl . The user wont get frustrated on using R package as it is simpler than Julia package

The above code might be have some issues since, Gadfly do not follow grammer of graphics strictly like font size, data visualizing pattern, color pattern on the line etc.

By considering all these we can say at the end of day that packages of Julia are bit complex than the packages of R or python . R packages gives good interoperability and difficulty problems can be solved easily.

The package ggplot in Julia installed as follows:

Julia> Pkg.add(“RDatasets.jl”)

Julia> Pkg.add(“RCall.jl”)

Lets look on the plot visualized using ggplot library

using Rcall, RDatasets

val = datasets(“datasets”, “demo”)

library(ggplot2)

ggplot($demo, aes(p =”ASD” , q =”AOSI Total Score(Month 12)” )) + geom_print()

**Thoughts of Conclusion**

Finally Julia is such powerful language that provides accessability packages related to python and R by PyCall and RCall . Julia is ideal in its nature and its syntax too compared to python particularly when writing highly function code .

Julia is better programming language we can say . Strong reason might be its best for numerical computation

**“Technology Never Stops instead it flows like Water”**