Intro to Data Acquisition

Exploring and defining the methods of obtaining data

Introduction

The goal of nearly all data science endeavors is to answer a particular question, whether that’s how a business can attract more users to their site or what indicators of illness are present in a ct scan image. To answer these questions we need data, and more importantly the right data. This is was expressed well by Charles Babbage who originated the concept of a digital programmable computer when he stated:

“On two occasions I have been asked, ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.”

Data acquisition, or data mining, is the step of the Data Science Life Cycle where we identify and obtain the raw data that will later be cleaned for exploration and modeling.

Data Science Life Cycle

Data Science is viewed as such a powerful tool because data is often the most valuable resource we have when attempting to gain insight from our environment. In the frame of business, once a question has been defined and objectives set through the business understanding phase of the data science life cycle, the right data must be acquired to answer the business needs. In the diagram above, the arrows between the business understanding phase and data acquisition phase indicate that there is often an iterative relationship between the two. You need to understand how the questions you’re asking relate to your goals before collecting data. However after acquiring your data, that data may shift your understanding of your business, where you may revisit the business understanding phase. Some aspects to consider when acquiring data are:

  • What data is needed to achieve the business goal?
  • How much data is needed to produce valuable insight and modeling?
  • Where and how can this data be found?
  • What legal and privacy parameters should be considered?

There are several methods you can utilize to acquire data. Methods that we will cover in this article are:

  • Public & Private Data
  • Web Scraping
  • APIs
  • Manual Data Acquisition
  • BI Tools

Public & Private Data

Public Data

One of the most accessible methods for obtaining data is through publicly or privately stored datasets found online. There are several open source datasets that are hosted online that allow you to freely download data collected by others, offering solutions to a wide range of data science and machine learning applications. These public sources of data are often suitable for small to medium sized machine learning projects, concept validation, and research purposes. Some of the most commonly visited are:

It is relatively easy to access a dataset from one of these sites with Python. You can download it directly to your local machine or use the Pandas library to download directly from a URL. Pandas is one of the most commonly used Python libraries because it is so efficient for tasks such as data cleaning and pre-processing:

# Import pandas with alias import pandas as pd # Assign the dataset url as a variable url = "https://raw.githubusercontent.com/shrikant-temburwar/Iris-Dataset/master/Iris.csv" # Define the column names of dataset as a list columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"] # Use read_csv to read in data as a pandas dataframe df = pd.read_csv(url, names=columns) # Check head of dataframe print(df.head())

Private Data

There are also a number of private datasets that businesses curate themselves and are under ownership of the company. For instance, Netflix’s database of user preferences powers their immense recommendation systems. There are also services that allow you to purchase datasets such as data markets like Data & Sons or crowd-sourcing marketplaces such as Amazon’s Mechanical Turks where one can outsource their data acquisition needs like data validation and research to survey participation. Often we will find usage of private data within a large production scale setting.

Pros:

  • Time: Readily available datasets that can quickly move a project to the next phase of the Data Science Life Cycle.

  • Cost: Public datasets can cut costs of collecting data down to zero.

Cons:

  • Messy: Data can often come in forms that require intensive cleaning and modification.
  • Cost: Private services can lead to high costs in acquiring data.

Web Scraping

Web scraping can be one of the most potent methods of data acquisition when used effectively. Web scraping at its core is the act of extracting or copying information directly from a website. This data can then be stored within a dataframe or spreadsheet and used in the same manner as any other dataset. There are two general methods for web scraping, one where you will manually scrape the data from a web page and the other where you employ a web crawler or bot that automates the process of extracting data from a website. Python has useful libraries that allow you to implement both methods of web scraping, some of the most commonly used are BeautifulSoup, Selenium, and Scrapy. Web scraping is best used when the data you need is not available as a public or private dataset, but present on a web page. We typically employ web scraping techniques to acquire data for small to medium sized projects, but rarely in production as this can raise ownership and copyright issues.

A unique aspect of web scraping is that, as it involves referencing a website’s source code, one should have some familiarity with the language that the website is written in. For example, when viewing the code for Codecademy’s home page we can see it is written in HTML, and if we would like to scrape information from the site we would need to understand the architecture and various tags of the site.

Let’s try a simple implementation of web scraping in Python to demonstrate it’s efficiency for acquiring data. Note that we scraped Codecademy’s home page in September 2020. The homepage might look different now!

# Import libraries import pandas as pd from bs4 import BeautifulSoup import requests # Assign URL to variable url = "https://www.codecademy.com/" # Send request to download the data from URL response = requests.request("GET", url) # Create BeautifulSoup object # Use HTML parser to parse the page's text data = BeautifulSoup(response.text, 'html.parser') # Print the first header of the page print(data.html.h1) # Instantiate list to append some content content = [] # Use BeautifulSoup's find_all method to find all paragraph tags words = data.find_all('p') # Iterate through all paragraph tags # append text to list with for loop for word in words: content.append(word.text) # Check content print(content) # Create dataframe of content with pandas DataFrame method df = pd.DataFrame(content, columns= ['Text']) # Check scraped dataframe print(df)
Index Text
0 Black Lives Matter. Find Resources or Show Your Support
1 By signing up for Codecademy, you agree to Codecademy’s Terms of Service & Privacy Policy.
2 No need to worry, we’ll help you make sense of it all.
3 From building websites to analyzing data, the choice is yours. Not sure where to start? We’ll point you in the right direction.
4 No matter your experience level, you’ll be writing real, working code in minutes.
5 Your code is tested as soon as you submit it, so you always know if you’re on the right track.
6 Apply your learning with real-world projects and test your knowledge with tailor-made quizzes.
7 Coding skills have never been more in-demand. Learn everything you need to take your career to the next level.

It’s important to note that web scraping is still considered to be in a gray area when it comes to ethics and legalities. Typically companies prefer not to be scraped because it can lead to a misuse of their data, revenue loss, slowed user experience, and additional web infrastructure costs. As such you may find barriers to web scraping like the common Completely Automated Public Turing Test to tell Computers and Humans Apart or CAPTCHA tests that ask you to complete a task that should be too difficult for a web crawler to complete. For these reasons it is best when conducting web scraping to ensure it is done in a benevolent and balanced way that abides contemporary law and does not negatively impact the website being scraped.

Pros:

  • Versatile: Highly adaptable method for acquiring data from the internet.
  • Scalable: Distributed bots can be coordinated to retrieve large quantities of data.

Cons:

  • Language Barrier: Multiple languages are involved when scraping and require a knowledge of languages not typically used for data science.
  • Legality: Excessive or improper web-scraping can be illegal, disrupt a website’s functionality, and lead to your IP address being black listed from the site.

APIs

Application Programming Interfaces, most often referred to as API’s are another method that can be used for data acquisition. We can imagine APIs as a more polite and ‘by the books’ way of web scraping, where in the process of acquiring data from a site we request permission to access some data, and wait for a response from the site to fulfill the request. Unlike web scraping, APIs are a means of communication between 2 different software systems. Typically this communication is achieved in the form of an HTTP Request/Response Cycle where a client(you) sends a request to a website’s server for data through an API call. The server then searches within its databases for the particular data requested and responds back to the client either with the data, or an error stating that request can not be fulfilled.

API requests

We can see that being able to request data from a site can be useful when in the data acquisition phase of the Data Science Life Cycle. Not only do they assist sites and their users in remaining secure from malicious actors, they also help developers access data that could by other means be difficult to acquire. For identification purposes many APIs request that the user sign up for usage and obtain an API key that uniquely identifies them and is used to maintain a record of all our requests, as some API’s can require payment for their usage. Suppose we would like to access recent weather data of the US for analysis. Rather than needing to obtain our own satellite, or weather balloons to harvest the data ourselves, there are several weather related APIs such as OpenWeatherMap, AccuWeather, and National Weather Service API that we can acquire the data from. If we used OpenWeatherMap for example, we would create an account on their site to receive an API key and subscribe for their free tier subscription which allows for up to 60 API calls per minute. With that we would be able to use Python to access OpenWeatherMaps’s forecast data and analyze it.

Pros:

  • User & Site Friendly: APIs allow security and management of resources for sites that data is being requested from.
  • Scalable: API’s can allow for various amounts of data to be requested, up to production scale volumes.

Cons:

  • Limited: Some functions or data may not be accessed via an API.
  • Cost: Some API calls can be quite expensive, leading to limitations of certain functions and projects.

Manual Data Acquisition

What can we do when the data we need to analyze is not present on the internet or able to be outsourced? In these situations, it can be useful to know how to harvest data yourself, and there are many tools available to do so. Perhaps you would like to conduct hypothesis testing to determine public sentiment about a particular good or service, or maybe you would like to analyze data about animal populations of a local habitat. Being able to acquire the data to conduct these analyses yourself is an invaluable skill to have.

Google Forms is a simple and free way to create surveys that can be shared with others to acquire data related to a particular population. With Google forms, you can create surveys that include video and image files, multiple choice, and short answer questions that can be posted on social media, sent out in email lists, and even printed to manually harvest data. The data acquired can then be downloaded in a CSV file and used within Python. Google also offers Google Surveys, a paid service that allows you to have a wider range of respondents and gives you more control in determining your target audience.

Devices like Nvidia’s Jetson Nano and Arduino’s Uno boards are great for acquiring data from your local environment. With developer kits like these, you can create sensor systems that can harvest data, or run machine learning models that can acquire even more complex data about the environment. In the event of working on novel projects, one will most often need to acquire the necessary data themselves, and these are useful tools to be familiar with in that event.

Pros

  • Bespoke: Data acquired manually can be made to address all of the business objective’s needs, and can need little to no cleaning to be effective.
  • Community: Manually acquired data can be useful in advancing the fields of data science and machine learning, and be further explored by others.

Cons

  • Time: Manually acquired data can take more time than other methods, especially when dealing with large datasets.
  • Cost: Some services and devices used to manually acquire data can be expensive and can be limiting for certain projects.

Data Acquisition for Business

All of the methods we have covered so far can be used in a business context, as long as they bring some value to the organization. Often, businesses need to utilize data acquisition to uncover actionable insights that improve internal processes or the services or goods that they offer. Many have likened data to the oil of the 21st century, and it is true that creative data acquisition is the source of success for many of the largest corporations. We can see this expressed in FAANG, (Facebook, Amazon, Apple, Netflix, and Google) all of which are companies that heavily incorporate data acquisition and understanding from users within their business models.

The data that these corporations maintain are so complex, they are referred to as Big Data. Data like this can not be stored on a single machine, and must often be stored in the cloud and be hosted on servers in data centers. The term Big Data is not only in reference to the sheer volume of the data, which can easily grow to the petabyte and exabyte levels, but also in the variety and velocity of the data, we often refer to these characteristics as the 3 Vs of Big Data:

The 3 Vs of Big Data

Google’s Chief Economist Hal Varian has listed the four key components of data acquisition in business as:

  • The drive toward more and more data extraction and analysis.
  • The development of new contractual forms using computer-monitoring and automation.
  • The desire to personalize and customize the services offered to users of digital platforms.
  • The use of the technological infrastructure to carry out continual experiments on its users and consumers.

Data of this magnitude needs tools that can manage it, some of the most popular being Google Analytics, Tableau, and Looker. These are Business Intelligence or BI tools that are able to create visualizations, aggregations, and various other manipulations of such large datasets to help businesses make data-driven decisions, which are often the most effective and unbiased for businesses to act upon.

Structured Query Language, or SQL is another essential tool for the scale of data that businesses collect. Businesses will usually house their large datasets in databases, and SQL is an efficient language to share and organize data in databases, as it is simple and has a clear and concise format. It is designed for searching through and managing relational databases, which are essentially tables of data that have some relation to one another. For example, a company may have a database storing data about its millions of customers, and another about its hundreds of products. With SQL, writing queries that can retrieve information from both databases such as which customers purchased a particular product is relatively easy when compared to other languages.

Ethics

In the contemporary data science space there is much ambiguity around the ethics of data acquisition. Definitions of privacy rights and ownership of data are still quite fluid and are an active area of study, both in the field of research and the political spectrum. Ethical considerations that still need concrete and universal answers include:

  • Who owns the data uploaded to a website by users?
  • When and how should users of services be notified that data about them is being acquired?
  • What kinds of data should be restricted from being acquired about users?
  • How can users protect their privacy and know when it has been breached?

The step of data acquisition should necessarily involve fairness, transparency, and a respect for those that the data involves. Shoshana Zuboff has conducted much research into the social implications of current data acquisition trends, and considers us to live in the age of Surveillance Capitalism, defined by the harvesting of personal data for commercial profit. In her book The Age of Surveillance Capitalism Zuboff sheds light on the ethical and moral breaches that can occur when data acquisition is used to commodify personal data, rather than for social good.

Even when acquiring data for socially beneficial reasons, we must be weary of the long term effects our data can have. Social biases often lie dormant in data that can lead to an echoing of unfair treatment to certain groups. Facial recognition systems that have been trained on biased datasets regarding race, gender, or age can lead to a technological discrimination among users. Social datasets regarding homelessness, imprisonment, and poverty could miss key features that lead to poor policy and legislation, effectively working against any benevolent intention.

A part of thorough data acquisition is to consider the ethics of your data acquisition plan, to ensure that it is aligned with the privacy rights of others, and that it promotes social inclusion and equal treatment of its userbase.

Having covered the many methods of data acquisition, lets review some of their characteristics with a multiple choice assessment.

Multiple Choice Assessment