What Are We Building?
In this project series, we’re going to build a REST API that exposes resources related to stocks. In this post, we’re going to generate a set data of for the API, so that we have something to play with for the rest of the project. We’re going to generate data because I don’t have real data that I can give you, and secondly, learning how to generate fake data can help you test ideas and build projects. If you’re interested in real data, then you’ll need to research some data providers and will likely need to pay for it.
You may find the code in the following repository:
The API
We’re going to build an API that exposes generated financial data about fake company stocks. It will provide historical stock prices and precomputed calculations like moving averages (SMA) and relative strength index (RSI). The endpoints will allow the client to query historical data on companies, in which we will also build a small front-end to chart the data.
The API will have the following endpoints:
- /api/v1/historical/{ticker}
- /api/v1/indicators/sma/{ticker}
We’ll have some additional query parameters for the endpoints, but for now, this is the high level view of the API.
Why Use Fake Data?
One of the great things about software is that we can abstract ideas, concepts, and do things like simulations. This idea also extends to data. Coding can enable us to simulate and to generate realistic data sets that we can experiment ideas with and build projects off of. Obviously there are assumptions being made when we generate our data, but if you’re learning to program, or can’t afford real data at the moment, then generating fake data is a great way to go.
Generating Fake Data
Since we’re building a simple stock API, our fake ticker data will have the following attributes:
- date
- symbol
- open_price
- high_price
- low_price
- close
- volume
To generate it, we’re going to create 4000 fake company tickers and then generate daily data from 2000-01-01 to 2024-01-01. This will give us a data set with roughly 20 million rows that we can play with and build our API from.
- symbols will be a random 4 letter unique string identifier.
For our prices, we need to consider how they might fluctuate over time. To do this, for each stock, we’ll generate a random starting price and then apply a random price change that can fluctuate positively or negatively. We’ll also start with a random daily volume and also adjust this each day.
Since this is a project about building an API and also a front-end to consume it, I don’t want to spend too much time on building the most accurate data simulation possible. We just need data to play with, in which we may build a project on. But here are some key ideas from generating data:
- We can generate large amounts of data which can help us prototype products or test ideas.
- Generating data can be cheaper than buying data.
- Limited to your computing power
- Consider the assumptions made about your data:
- What dates are realistic?
- Mon-Friday
- Do we want to consider holidays across the years?
- We’ll leave that as an exercise for you to do, given that you might be modelling one of the many exchanges around the world!
- In general, you’d need a historic list of dates for each given year and skip those dates when generating your data.
- Starting stock prices.
- Daily fluctuation in stock prices and volume.
- Is it realistic to apply a uniform daily price fluxuation?
- What dates are realistic?
- We can adjust our data by using different types of parameters and assumptions, giving us a flexible model.
Python Scripting
Firstly, we need to create a class to help us generate our data. Its good to keep in mind that we want a flexible way to change our data parameters such that we can alter our data during different builds. This is the script:
import csv
import random
import string
from datetime import datetime
from datetime import timedelta
class StockDataGenerator:
def __init__(
self,
daily_price_fluctuation=.02,
daily_volume_fluctuation=.05,
date_format="%Y-%m-%d",
num_companies=4000,
start_date="2000-01-01",
end_date="2024-01-01",
csv_output_file="./output/tickers.csv",
chunk_size=500000,
volume_adjustment_factor=5
):
self.daily_price_fluctuation = daily_price_fluctuation
self.daily_volume_fluctuation = daily_volume_fluctuation
self.date_format = date_format
self.num_companies = num_companies
self.start_date = datetime.strptime(start_date, self.date_format)
self.end_date = datetime.strptime(end_date, self.date_format)
self.chunk_size = chunk_size
self.csv_output_file = csv_output_file
self.volume_adjustment_factor = volume_adjustment_factor
def generate_raw_data(self, start_date="2000-01-01", end_date="2024-01-01"):
tickers = self._generate_fake_tickers()
dates = self._generate_dates()
ticker_count = 0
for ticker in tickers:
daily_ticker_data = self._generate_stock_data(ticker, dates)
self._write_to_csv(daily_ticker_data)
ticker_count += 1
print(f"Generated data for: {ticker} and is {ticker_count} of {len(tickers)}")
def _generate_dates(self):
dates = []
current_date = self.start_date
while current_date <= self.end_date:
if current_date.weekday() < 5:
# Skip weekends.
dates.append(current_date.strftime(self.date_format))
current_date += timedelta(days=1)
return dates
def _generate_fake_tickers(self):
companies = []
while True:
ticker = "".join(random.choices(string.ascii_uppercase, k=4))
if ticker not in companies:
companies.append(ticker)
if len(companies) == self.num_companies:
break
return companies
def _generate_stock_data(self, symbol, dates):
num_days = len(dates)
staring_price = random.uniform(5, 500)
prices = [staring_price]
volumes = []
avg_daily_volume = random.randint(50000, 2000000)
for i in range(1, num_days):
# Simulate price using a random walk
negative_change = -self.daily_price_fluctuation
positive_change = self.daily_price_fluctuation
price = prices[i-1] * (1 + random.uniform(negative_change, positive_change))
prices.append(price)
# Simulate volume with some daily variability
negative_change = -self.daily_volume_fluctuation
positive_change = self.daily_volume_fluctuation
daily_volume = avg_daily_volume * (1 + random.uniform(negative_change, positive_change))
# Adjust volume based on price change magnitude
price_change = abs(prices[i] - prices[i-1]) / prices[i-1]
volume_adjustment = 1 + price_change * self.volume_adjustment_factor
daily_volume *= volume_adjustment
volumes.append(int(daily_volume))
stock_data = []
for i in range(num_days):
open_price = prices[i] * random.uniform(0.95, 1.05)
high_price = max(open_price, prices[i] * random.uniform(1.00, 1.10))
low_price = min(open_price, prices[i] * random.uniform(0.90, 0.99))
close_price = prices[i]
volume = volumes[i-1] if i > 0 else avg_daily_volume
stock_data.append({
"date": dates[i],
"symbol": f"{symbol}-fake",
"open": f"{open_price:.3f}",
"high": f"{high_price:.3f}",
"low": f"{low_price:.3f}",
"close": f"{close_price:.3f}",
"volume": volume
})
return stock_data
def _write_to_csv(self, data):
with open(self.csv_output_file, "a") as file:
fieldnames = ["date", "symbol", "open", "high", "low", "close", "volume"]
writer = csv.DictWriter(file, fieldnames=fieldnames)
if file.tell() == 0:
# Check if new file and write header.
writer.writeheader()
writer.writerows(data)
if __name__ == "__main__":
generator = StockDataGenerator()
generator.generate_raw_data()
It’s good to review the generate_raw_data method. At a high level we:
- Generate the stock tickers.
- Generate the dates.
- Iterate the tickers and dates and generate the random starting prices and each price afterwards.
- Write each ticker’s data to the csv.
Data Output
After using this data generator, the output csv file has roughly 25 million rows and is 1.5GB in size. This isn’t a very large set, but having a sample with over 20 million rows is a pretty good start to the project. It’s enough data where we can actually build a small API to test and play with.
wc -l tickers.csv
25044001 tickers.csv
I’ve also loaded the data into a SQLite database. SQLite is a database that comes builtin with Python, but you can obviously use any database that you like!
Next Steps
Please review the code sample provided. It contains the rest of the code that I didn’t cover in the post.
In our next post, we’ll look at generating the technical indicators from this base dataset.