Together C & C++ Ngrams

This is a project exploring trends of words and phrases (ngrams) over time on the Together C & C++ discord server. It was inspired by the google books ngrams project.

About

On the Together C & C++ Discord server we have built up a database of messages sent on the server since it was created in 2017. This is something we did out of moderation necessity - we've had to inspect edit and deletion logs countless times. Having this data, though, provides a cool opportunity for analysis of trends within the data.

This repository contains two parts: The aggregation in src/ and the application code in ui/ and server/.

The messages are stored in a MongoDB database used by the server's discord bot, Wheatley. The aggregator code reads documents from MongoDB, excluding private channels, bot ids, and deleted messages, and then tokenizes the messages. Words are tokenized based on being any alphanumeric string of characters, allowing for _ as well as ' and - as long as not at the edges of words. Two passes are performed over the database, the first computes totals for given ngram sequences. Any ngram sequences that occur fewer than 40 times are excluded. This first pass uses a lot of memory (the hash maps built up keep count of hundreds of millions of unique ngram sequences) and could be optimized later if needed, but, for now it's fine. After this the actual aggregation pass is done which performs aggregation for each counted ngram sequence for every month. Ngram frequency is computed simply as count / total_tokens_for_month. While this simple aggregation of frequency data of short phrases from a fully public message set should not pose privacy concerns, as a safety measure a small amount of artificial noise is added in, +/-1% uniformly. This is done with RNGs that are seeded based on a hash of the ngrams and a secret nonce, which is very over-engineered but whatever. Monthly frequency numbers are written to a DuckDB database which I found was far faster than Sqlite for the types of queries the application ends up doing, glob queries that don't lend themselves to indexing. DuckDB ends up being really good at these while Sqlite is more optimized for taking advantage of indices.

Privacy

The data are frequencies of short phrases from messages sent in public channels in the Together C & C++ Discord server. Messages in private channels, messages sent by bots, and deleted messages are excluded. The frequency data are monthly aggregates for the whole server and thus is not tied to any particular user. Additionally, the following steps are taken to further ensure privacy:

Discord snowflakes (17 to 19 digit sequences used by discord to uniquely identify everything from users to channels to emojis) are filtered out. These most commonly appear in user mentions, which appear textually as <@331718482485837825>.
Noise is added to results (+/-1%, uniformly). This is a measure inspired by differential privacy but not using the same level of mathematical rigor as differential privacy is more tailored to other types of data sets.
Any words or phrases used less than 40 times throughout the server's history are excluded.

Neither the raw nor aggregate data have been made available for download and I have no plans to change that.

Future work

Future work on this, if I do more, will likely center around performance exploration.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
.husky		.husky
benchmark		benchmark
cmake		cmake
screenshots		screenshots
scripts		scripts
server		server
shared		shared
src		src
test		test
ui		ui
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.cjs		.prettierrc.cjs
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conan.lock		conan.lock
conanfile.py		conanfile.py
eslint.config.mjs		eslint.config.mjs
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Together C & C++ Ngrams

Table of Contents

About

Privacy

Future work

About

Uh oh!

Releases

Packages

Languages

License

jeremy-rifkin/tccpp-ngrams

Folders and files

Latest commit

History

Repository files navigation

Together C & C++ Ngrams

Table of Contents

About

Privacy

Future work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages