Project still in progress

Through the lens of Filipino workers

A data science project that delves into the issue of labor struggles in the Philippines through exploratory analysis and natural language processing (NLP) of data sourced from the subreddit r/AntiworkPH.

Everyone needs a source of income to live comfortably;
for most Filipinos—it's more than just about making ends meet

In December 2023, the Philippines logged an unemployment rate of 3.1%, a decrease of 617,000 unemployed individuals from the previous year[1]. This improvement is being celebrated as the lowest unemployment rate in almost two decades, but beyond these statistics, the country is continuously plagued by pressing labor issues.

As of 2024, the current minimum wage is set at ₱573-₱610 in Metro Manila, the highest in any other region. In comparison, the required minimum to feed a family of five is ₱1188 per day according to the IBON foundation[2].

With such low wages, some choose to be separated from families and face poor job conditions to pursue employment outside the country, as the number of Overseas Filipino Workers reaches its peak in 55 years[3]. That is not to say that local workers lack in their share of labor exploitation. Contractualization, a common employment practice in the country that hires workers for short-term contracts, leaves affected Filipinos without necessary benefits and job security.

Under the 17 Sustainable Development Goals (SDG) established by the United Nations General Assembly, specifically SDG 8, there is a need to “protect labor rights and promote safe and secure working environments for all workers, including migrant workers, in particular women migrants, and those in precarious employment.” For the country to achieve this goal, the issues faced by the Philippine workforce must first be properly recognized and addressed before eventually focusing on resolving them.

There's so much data on the Internet, and here's what we can do

The subreddit r/AntiworkPH provides workers a platform to vent their frustrations and, consequently, shed light on the current situation of the Philippine labor market.

As such, we seek to unravel which topics have plagued the Philippine workforce since the subreddit started on 2022. Through this, our group aims to bring awareness and hopefully provide a more realistic view of the Philippine labor environment.

Then—what are the prevalent topics about labor struggles submitted on r/AntiworkPH?

Hypothesis

The prevalent topics among the subreddit users centered around unfair contracts and job offerings in the Philippines.

Null Hypothesis

The prevalent topics among the subreddit users did not center around unfair contracts and job offerings in the Philippines.

Which of these topics receive the most Reddit engagements?

Hypothesis

The topic with the most engagements based on upvotes and comments is the same as the most prevalent topic.

Null Hypothesis

The topic with the most engagements based on upvotes and comments is different from the most prevalent topic.

What now?

Collect various submissions on r/AntiworkPH using a Reddit API.

Extract relevant topics using natural language processing.

Analyze the relationships of these topics to various Reddit metadata.

l

PART I

Data Collection

DESCRIBING THE DATA

We want to gather the data we need using Python Reddit API Wrapper (PRAW). According to its documentation, we can access any subreddit, its list of submissions, and useful metadata for each submission. For this project, we specifically gathered the following metadata:

String

Timestamp

Content Type (Text, video, image)

Title

Content (Caption if media content)

Flair

Permalink

Submission ID

Integer

Upvotes Count

Comments Count

Float

Upvote:downvote Ratio

We kept track of the submission content type so that we may manually extract their transcripts.

SCRAPING THE DATA: LIMITATIONS

In using PRAW, there are a couple of limitations to note:

1.

Mining Limit – PRAW only allows us to collect up to 1000 submissions for each request. This means that we don't have much control on the data we get as it will always return relatively the same 1000, implying that we can't consistently scrape unique submissions per request.

2.

No Time Customizations – We can't filter submissions based on specfic posting times and time intervals. This is quite unfortunate as it would be difficult to minimize the time bias.

SCRAPING THE DATA: THE GAMEPLAN

Now, here's the plan:

1.

To sample the subreddit, we want to scrape 1000 submissions per category. The categories are: Hot, New, Rising, Controversial, and Top. Upon scraping, we found out that Rising is a subset of Hot, and hence would left us with Hot, Top, New, and Controversial. For Top, we picked All Time as it would give us submissions older than last year, i.e. April 2023.

We opted not to use the search function since one of the objectives of our project is to find those relevant keywords themselves.

We also opted not to scrape the comments as they are naturally highly contextualized to the main post. Text with little to no context may only cause the model to overfit.

Note that due to the mining limitation, we are not able to fully reach 1000 submissions per category as submissions are not unique to one category.

2.

To avoid duplicates, we want to use the Save function and mark the submission we've already scraped.

3.

To save time, we want to skip non-text-only submissions that have less than 10 upvotes since transcribing them may be counterproductive.

4.

Then as much as possible, we need to minimize time bias. Hence, we won't scrape submissions that are 10 days old or younger as they may have yet to peak in engagements.

Executing the gameplan gave us...

2541 scraped submissions!

But now comes the hard part—cleaning the data.

CLEANING THE DATA

1.

Using Pandas, RegEx, and BeautifulSoup4 on Jupyter Notebook, we want to:

Remove URLs, duplicates, emojis, and submissions with empty cells except Content (to be transcribed later on) and Flair.

Convert Content to plaintext as it is retrieved as markdown. This also includes replacing newlines and NaNs with a single whitespace.

Contextualize Filipino slangs, text abbreviations, and corporate jargons such as converting "WFH" to "Work from home" or "Charot" to "Just kidding". However, it is to note that we can only cater a handful list of them (available on the GitHub repository). This would also raise the possibility of incorrectly defining an acronym, but doing this may be more beneficial as it heavily contextualizes the text.

Concatenate Title and Content of every submission into one cell under a new column Title+Content.

2.

We then want to manually transcribe the image and video content. We shortlist the 100 most engaged submissions from each category to save time as effort to value ratio may be low.

3.

Translate Title+Content to English using the Google Translator API.

4.

Lastly, lemmatize Title+Content and remove their punctuations, numbers, and stop words using Natural Language Toolkit (NLTK) and Pandas.

We then finally get...

2541 preprocessed submissions

We're still in the process of transcribing media content

l

PART II

Data Exploration