An object’s name is flexible, but it must begin with a letter and typically follows snake_case.
like df_original, my_data, total_asset, etc.
Use # for comments.
Use pacman::p_load() to load packages.
pacman::p_load(tidyverse, ggthemes, knitr)
Assigning values
Replace the right-hand side with the left-hand side object x with <- or =.
<-
x <-34-> yx
[1] 3
y
[1] 4
Example
pipe operator %>% or |>
Use pipes %>% or |> to chain functions together. The pipe operator assigns the left-hand side result to the first argument of the right-hand side function.
A |> function(B)
has same meaning as function(A, B).
Example
Try the following code.
x <-1:100# assign 1 to 100 to x# calculate the sum of x above 50# nestedsum(subset(x, x >50))
[1] 3775
# use standard pipex |>subset(x >50) |>sum()
[1] 3775
# use magrittr pipex %>%subset(x >50) %>% sum
[1] 3775
Use packages
We can use packages to add functionality beyond the functions that come with R. Install a package to use it.
install.packages("pacman")
Installed the pacman package.
Load packages
Just installing a package does not allow you to use its functions. To use a package, you need to load it with the library() function.
library(pacman)
This allows you to use the functions of the pacman package.
Load multiple packages
We can load multiple packages at once using the p_load() function of the pacman package.
p_load(tidyverse, pysch, tableone)
This allows you to use the functions of the tidyverse, pysch, and tableone packages. If there are any packages that have not been installed, they will be installed automatically.
Specify the package
You can use the :: to specify the package name without loading the package with the library() function.
パッケージ名::関数()
pacman::p_load(tidyverse, psych, tableone)
Without running library(pacman), you can use the p_load() function. Use dplyr package’s mutate() function as dplyr::mutate().
Check the data
Check the data with the head() function.
head()
If you want to see the structure of the data, use the str() function.
str()
numは数値データ
chrは文字データ
Chapter 2 Foundations of Audit Analytics
Business and Data Analytics
The modern science of data analytics evolved from early projects to add scientific rigor in two fields —
games of chance and
governance (Stigler 1986).
The latter application, in governance, focused on summarizing the demographics of nation-states and lent its name, statistics, to the field.
The field has steadily evolved to include exploratory analysis, which used computers to improve graphics and summarizations; and to include the computationally intensive methods termed machine learning.
The mathematical foundations of statistics evolved from the 17th ~ 19th centuries based on work by Thomas Bayes, Pierre-Simon Laplace, and Carl Gauss.
Statistics as a rigorous scientific accelerated at the turn of the 20th century under Francis Galton, Karl Pearson and R.A. Fisher, introducing experimental design and maximum likelihood estimation (Stigler 1986).
Exploratory data analysis
Exploratory data analysis (EDA) arose from seminal work in data summarization presented in Cochran et al. (1954) and Tukey (1980).
EDA built on the emerging availability of computing power and software.
EDA is now an essential part of data analytics, and is important for determining how much, and what kind of information is contained in a dataset.
In contrast, the statistics of Pearson and Fisher tended to focus on testing of models and prediction.
EDA complements such testing by assessing whether data obtained for testing is actually appropriate for the questions posed; in the vernacular of machine learning, EDA helps us extract the features that are contained in the data.
In our current era of massive, information rich datasets, where often we have no control, and limited information about how the data was obtained, EDA has become an essential precursor of model testing and prediction.
A typical first step in analytical review of a company might be to review the industry statistics to find out what part, if any, of our client’s financial accounts are outliers in the industry.
Unusual account values or relationships can indicate audit risks that would require a broader scope of auditing.
Exploratory statistics are easy to implement, yet are ideal for quickly highlighting unusual account values.
We will first load a dataset of industry statistics that have been downloaded from the Wharton Research Data Service repository, and conduct simple summaries of the data.
The R package plotluckautomatically chooses graphs that are appropriate for the data, and is an ideal tool for quick visualization of data.
In general, R lets the dot symbols stands for all variables in the data set and .~1 regresses the variables against the intercept, thus giving the distribution (essentially a smoothed histogram) (Fig. 1)
Installing and loading packages (memo)
Usually, install a package with install.packages("hoge") and load it with library(hoge).
Now use the pacman package to load packages easily.
install.packages("pacman") # first time only
load the packages with pacman::p_load(hoge1, hoge2, hoge3).
You can load them all at once. If there is something not installed, it will be installed automatically.
パッケージの説明
読み込むパッケージの機能
auditanalytics : 著者が作成したパッケージでいろんなデータが活用できる。
tidyverse : データを操作する最高のパッケージ群で,超絶便利
plotluck : グラフを一気に作成できるEDA用パッケージ
broom : 統計モデルの出力をデータフレームに変換
loading the data and summary statistics
Read the data using the read.csv() function and display the summary statistics using the summary() function. Load the dataset from the auditanalytics package.
New methods for analyzing massive datasets have emerged in the past decade, and are continually evolving, as a product of the machine learning revolution.
The three most common terms used to describe these tools are reserved for a nested set of three technologies:
Artificial Intelligence = “any attempt to mimic human learning/intelligence”
Machine Learning = “computational methods for learning from data”
Deep Learning = “machine learning methods that mimic human neural networks (perceptrons)”
The emphasis on “learning” instructs us on the differences between these technologies and more traditional statistical approaches.
“Learn” is a transitive verb (i.e., takes an object) and involves
A subject (who makes a decision)
learns (from reviewing data) about some
construct (parameter value, classification, etc.)
The essential components of learning are: the decision, the learning method, the construct about which we are learning, and the quality assessment (how well we learned).
AI
The field of artificial intelligence was motivated by Vannevar Bush and Bush’s (1945) article “As we may think” and initiated projects in rule-based and symbolic programs such as early chess programs that involved hardcoded rules crafted by programmers.
Random forests quickly became a favorite on the platform, but by 2014, gradient boosting machines were outperforming them, and by 2016, perceptron models, in particular building on the successes at Google, began dominating the field.
ML
Machine learning re-envisioned several of the fundamental methods of statistics.
Computationally intensive stochastic gradient boosting replaces the first-order conditions in calculus used in statistics search for solutions.
The simple squared-error loss function in parametric statistics expanded to a wider range of solution concepts.
Machine learning is not limited to fitting statistical depictions of probability distributions typically defined with 1~3 parameters.
Clasification
For classification problems, cross-entropy measures offered much better fit than, say, comparable logit or probit models.
In 2011, the top-five accuracy of the winning model, based on classical approaches to computer vision, was only 74.3%. By 2015, the winner reached an accuracy of 96.4%.
Business analytics’ embrace of machine learning is inspired by the numerous successes: near-human-level image classification, language translation, speech recognition, handwriting transcription, text-to-speech conversion, autonomous driving and better than human Go and Chess games.
Despite these successes, there are reasons for a balanced, holistic perspective on the tools of data analytics.
Machine learning outperforms twentieth century statistics for:
Large datasets: These tend to overfit simple statistical models and may be a basis for p-hacking
Large complex construct spaces: In models estimating many parameters, like in image processing, 1-3 parameter distributions lack richness.
Feature extraction: the major contribution of exploratory data analysis was feature extraction from data, which could be used to decide where a dataset was appropriate for model testing.
But statistical models still outperform machine learning models (though this is steadily evolving) where there is a:
clear interpretation of results of analysis
consistency
replicability
formal definition of “information”
clear roles for data
clear demands on constructs
clear philosophies on the meaning of “learn”
formal logic and notation
Rest of Chapter 2
The rest of this chapter is devoted to classifying types of data, which in turn says something about the particular importance that we attach to an entity, and way that we measure it.
In the process I will share some of the excellent graphic tools that the R language offers the auditor for understanding accounting data.
TYPE OF DATA
Accounting Data Types
McCarthy (1979, 1982) proposed a design theory of accounting systems that applies the Chen (1976) framework to accounting.
In it, accounting transactions are measurements of economic events involving an entity’s contractual activities with a related party.
These measurements result in the recording of numbers, time, classifications, and descriptive information.
Classifications are dictated by a firm’s Chart of Accounts which delineates the types of economic activities in the firm.
Measurements
Measurements are in monetary units, thus require some method of valuing an economic event.
The ubiquity of information assets makes valuation one of the most difficult challenges facing modern accountants.
Descriptive information was originally entered in notes to a journal entry.
But social networks and news outlets provide auditors with a plethora of relevant information in textual form. Information technology, statistical methods, and software specify data types to differentiate the processing and storage of data.
The R language excels at managing data.
Indeed, this particular strength sets R, as an audit language, above any other software language.
This means, when using R, that the auditor never has to worry that some part of the client’s data will remain inaccessible.
Packages exist for any commercially important data structure and format, whether real-time stream, web-based, cloud-based, or on client’s bespoke system, can be analyzed with R code.
The next sections of this chapter discuss the most important data and file types, with examples of how these are represented in R.
The examples here use several databases ranging from financial reports of industry firms over time, of Sarbanes-Oxley reports and of control breaches.
In the process, the management of various types of information are highlighted: e.g., ticker information is categorical, fee information is continuous, breach and SOX-audit decision data is binary, and so forth.
Numerical vs. Categorical
Two basic types of structured data: numeric and categorical.
Numeric data comes in two forms: continuous, such as wind speed or time duration, and discrete, such as the count of the occurrence of an event.
Categorical data takes only a fixed set of values
Binary data is an important special case of categorical data that takes on only one of two values, such as 0/1 etc.
Another useful type of categorical data is ordinal data in which the categories are ordered; an example of this is a numerical rating (1, 2, 3, 4, or 5).
Classifying the type of data
Why do we bother with a taxonomy of data types?
It turns out that for the purposes of data analysis and predictive modeling, the data type is important to help determine the type of visual display, data analysis, or statistical model.
In fact, data science software, such as R and Python, uses these data types to improve computational performance.
More importantly, the data type for a variable determines how software will handle computations for that variable.
Binary (Dichotomous, Logical, Indicator, Boolean) Data
Binary data represents a special case of categorical data with just two categories.
These are data’s way of providing answers to yes/no or true/false. − An audit opinion will provide a yes/no answer concerning whether the F/S are fairly presented.
In the following figure, we are interested in whether credit card fraud is influenced by the fees paid to the auditor.
We analyze a binary variable by looking at the variation in other variables under a 0 or 1 value of the binary variable.
Ordinal data is categorical data with an explicit ordering.
Ordinal data provides an important control over documents of original entry in accounting systems.
When a journal entry of any sort is generated, it must be uniquely identifiable, and generally the sequence of identifying numbers is assigned in chronological sequence.
Modern systems assign these numbers internally, but auditors still consider sequential number of input documents to be one of the more important internal controls in a system.
But graphs provide immediate access to the degree of problem, by looking at the number of exceptions in the graph, and are most visible if exceptions are rendered in a contrasting color such as the green used in the accompanying charts.
Often this is all that the auditor needs to render an opinion on internal controls, as it is the client’s responsibility to correct these problems.
%in% is a binary operator that returns a logical value indicating whether the element on the left is found in the element on the right.
So !raw %in% journal_ent_no$invoice returns a logical vector of the same length as raw that is TRUE if the corresponding element of raw is not found in journal_ent_no$invoice.
Data Storage and Retrieval
The amount of recorded data produced by human activity has probably been growing exponentially for more than a millennium.
Most data today is digitized, both for archival storage as well as display.
This is good for the environment (newsprint alone in the 1970s accounted for 17% of US waste) but it also means that this data is potentially available for computer processing.
The current amount of digitized data is around 20 trillion GB (=2000万PB = 2万EB = 20ZB).
Much of this increase has been fueled by new structures for storing and retrieving data - video, text, and vast streams of surveillance data - that have arisen since the commercialization of the internet in the 1990s.
In the nineteenth century, vectors, matrices, and determinants were central tools in engineering and mathematics.
As accounting developed professional standards in this period, they naturally gravitated to representations that were matrix-like: accounts \times values for financial reporting, and transactions × values for journal entries and ledgers.
In the twentieth century, spreadsheet tools and rectangular tables of data brought matrices into the computer domain (Fig.5). -
Terminology
Terminology for matrix data can be confusing.
Statisticians and data scientists use different terms for the same thing.
For a statistician, predictor variables are used in a model to predict a response or dependent variable.
For a data scientist, features are used to predict a target.
One synonym is particularly confusing: computer scientists will use the term sample for a single row; a sample to a statistician means a collection of rows.
Within the past decade, twentieth-century concepts of data storage and retrieval have given way to richer modalities.
Storage of text has learned from the older discipline of library science; but entirely new approaches are demanded by music and video data.
One reason that machine learning is able to take on new tasks is their ability to analyze data that would be impossible with the matrix based methods of statistics.
Statistics evolved in the early twentieth century, building on the matrix algebra that dominated most of science then.
Big Data
The R language uses seven main storage formats for data:
A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.).
Data frames represent the most widely used format for data storage and retrieval in the R language
Type data.frame is like a spreadsheet of Excel. Column means variable and row means observation.
Create a data frame
ID <-c(1, 2, 3, 4) # numeric vectorColor <-c("red", "white", "red", NA) # character vectorPassed <-c(TRUE, TRUE, TRUE, FALSE) # logical vectormydata <-data.frame(ID, Color, Passed) # create data frame# access elements of the data framemydata[1:2] # columns 1 and 2 of data frame
ID Color
1 1 red
2 2 white
3 3 red
4 4 <NA>
mydata[c("ID","Color")] # columns ID and Age from data frame
ID Color
1 1 red
2 2 white
3 3 red
4 4 <NA>
Example: Data Frame
Tibbles
Tibbles are the tidyverse’s improvements on data frames, designed to address problems in data analysis at an earlier point.
Otherwise they are used exactly as data frames.
Tibbles are more user-friendly than data frames.
In R, the basic rectangular data structure is a data.frame().
A data.frame() also has an implicit integer index based on the row order.
While a custom key can be created through the row.names() attribute, the native R data.frame() does not support user-specified or multilevel indexes.
To overcome this deficiency, two new packages are gaining widespread use: data.table and dplyr.
Both support multilevel indexes and offer significant speedups in working with a data.frame.
List
An ordered collection of objects (components).
A list allows you to gather a variety of (possibly unrelated) objects under one name.
a <- b <-"seven"z <- y <-0# example of a list with 4 components -w <-list(name ="Fred", mynumbers = a, mymatrix = y, age =5.3)list1 <-list(name ="Fred", mynumbers = a, mymatrix = y, age =5.3)list2 <-list(name ="Joe", mynumbers = b, mymatrix = z, age =10.11) # 誤字v <-c(list1,list2)## Identify elements of a list using the [[]] convention.w[[1]] # 1st component of the list
[1] "Fred"
v[["mynumbers"]] # component named mynumbers in list
[1] "seven"
Example: list
Factors
Tell R that a variable is nominal by making it a factor.
The factor stores the nominal values as a vector of integers in the range [1, \dots , k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.
Make factors
Nominal variables are those that represent discrete categories without any intrinsic order.
[1] medium large small
Levels: small < medium < large
Levels: small < medium < large OK!
Factors and Ordered Factors
R treats factors as nominal (i.e., label or naming) variables and ordered factors as ordinal variables in statistical procedures and graphical analyses.
You can use options in the factor( ) and ordered( ) functions to control the mapping of integers to strings (overriding the alphabetical ordering).
You can also use factors to create value labels.
Useful Functions for Dataset Inspection and Manipulation
list1 list1
[1,] "a" "a"
[2,] "b" "b"
[3,] "c" "c"
rbind(list1, list1) # オブジェクトを行として結合
[,1] [,2] [,3]
list1 "a" "b" "c"
list1 "a" "b" "c"
Other Data Types
R supports almost any conceivable type of data structure.
A few additional structures that are important to account are:
Time series data records successive measurements of the same variable. It is the raw material for statistical forecasting methods, and it is also a key component of the data produced in surveillance.
Spatial data structures, which are used in mapping and location analytics, are more complex and varied than rectangular data structures.
Graph (or network) data structures are used to represent physical, social, and abstract relationships.
Further Study
This chapter should be seen as a survey of what is possible for auditors interested in an analytical, algorithmic approach to auditing.
The R package writeups are excellent sources for empirical statistical methods tailored to data structures unique to accounting and auditing.
Searches of the Internet will also find an increasing number of “data camp” type resources that are parts of larger educational programs.
Whatever your challenge, you should be able to find assistance through one or more of these resources.
R Packages Required for This Chapter
This chapter’s code requires the following packages to be installed:
tidyverse : data manipulation and visualization,
kableExtra : custom tables,
plotluck : make many plots automatically,
broom : tidy up model output,
References
Bush, Vannevar, and Vannevar Bush. 1945. As We May Think. Resonance 5(11).
Chen, Peter Pin-Shan. 1976. The Entity-Relationship Model—Toward a Unified View of Data. ACM Transactions on Database Systems (TODS) 1(1): 9–36.
Cochran, William G., Frederick Mosteller, and John W. Tukey. 1954. Principles of Sampling. Journal of the American Statistical Association 49(265): 13–35.
McCarthy, William E. 1979. An Entity-Relationship View of Accounting Models. Accounting Review LIV (4): 667–686.
McCarthy, William E. 1982. The REA Accounting Model: A Generalized Framework for Accounting Systems in a Shared Data Environment. Accounting Review LVII (3): 554–578.
Stigler, Stephen M. 1986. The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge: Harvard University Press.
Tukey, John W. 1980. We Need Both Exploratory and Confirmatory. The American Statistician 34 (1): 23–25.