Using web-scraped data to measure sustainability in the fashion industry
Yet another use-case for web-scraped data
A few years ago, everyone in the asset management industry was talking about ESG (Environmental, Social, and Governance). Today, it seems to be less of a hot topic, but it certainly hasn't been forgotten. A major challenge with ESG is the lack of a universally accepted standard. It's highly subjective, and it's not uncommon for ESG providers to have completely different views on the same company.
Because of this, ESG can sometimes feel like just another box to check for asset managers. They'll follow it if required by an investment mandate, but beyond that, it only really grabs attention if it can generate alpha. Yet, ESG issues are still widely discussed, especially with big news like the potential IPO of Shein. Many people are questioning Shein's ESG credentials, which mirrors broader concerns about sustainability and social impacts that persist throughout the fashion industry.
One of the websites that we are collecting data from is the online marketplace End Clothing (endclothing.com). It predominantly sells products from luxury fashion companies like LVMH (EPA: MC), Kering (EPA: KER), Prada (HKG: 1913), Moncler (BIT: MONC), Canada Goose (TSE:GOOS), and many others, generating turnover of £221.1m (~$282.6m) in 2023. This platform is where you would go to spend $500 on a t-shirt, so the market is very different from the fast fashion market. Still, ESG concerns are applicable and are commonly discussed in the press.
The main purpose of collecting this data is to track stock inventory levels and infer brand-level sales trends. This approach can provide unique insights into relative performance dynamic of different luxury brands, insights that are typically hard to source from anywhere else. But I was always curious if in addition to looking at the business performance, we could also look at the same data from the ESG angle. We have data on 44,000 products from 760 brands and each product contains a description that looks like this:
The description includes material composition and sometimes other potentially useful information, such as whether an item is machine washable. This gave me an idea for a small experiment: assess the use of sustainable materials by brands using this product information. While materials represent just one aspect of a company's ESG story, they can be an important part. For instance, materials like nylon, which are non-biodegradable, present specific environmental challenges.
At this point I have to set expectations with the readers: this experiment is not meant to be viewed as rigorous scientific research. One significant limitation is the sample size, particularly for global brands like Adidas and Nike, where we’re examining only a small subset of their overall product range. My aim was simply to explore whether the data we've collected could be relevant in an ESG context.
With the data we collect, all product-related information is presented in an unstructured product description field, which consists of raw text with HTML tags. I needed to come up with a method to parse fabric types and their percentages from this field. Since all our data is stored and analysed in Snowflake, I decided to explore Snowflake's new Cortex AI capability. Specifically, I experimented with using the SNOWFLAKE.CORTEX.EXTRACT_ANSWER() function to extract material composition.
I tried several approaches and eventually succeeded in extracting the information using this method:
SELECT
SNOWFLAKE.CORTEX.EXTRACT_ANSWER('<p>The perfect companion to iconic Ultimate Leggings? These Chunky Cotton Rib Socks, of course! With their super-soft fabric blend and signature branding, this everyday upgrade is one you’ll reach from the weekday grind to the weekend workout. We won’t blame you for adding a couple of pairs to the basket…</p>
<ul>
<li>85% Cotton, 10% Polyester, 5% Spandex</li>
<li>Woven Branding</li>
<li><a href="https://www.endclothing.com/women/brands/socks">Shop All Socks</a></li>
</ul>',
'Extract the percentage amount of cotton mentioned in the description')
...
=>
[
{
"answer": "85 %",
"score": 0.60291725
}
]
However, considering that I didn’t know in advance which specific fabric might be contained in the description, and that I needed to get a set of values (one for each type of fabric), I decided to use a much simpler solution. While the description field is unstructured, it contains some common elements, for instance materials are always enclosed within <li></li> tags. I was looking for a quick, one-off solution, and using ChatGPT-generated regex with some manual adjustments turned out to be the simplest and fastest way to get the data. This regex approach would not have worked if data was collected from multiple different websites — in such cases, we would have to come up with a more generic solution, potentially leveraging some of the Snowflake’s new AI capabilities.
The other problem I had to solve was determining sustainability of each material. This is a tricky problem to solve, especially for someone who has zero domain knowledge and whoes expertise is limited to reading one and a half Wikipedia article on the subject.
So my very non-scientific solution was to use the open dataset that lists the material composition of 276 clothing products from Inditex, a large European retailer known for brands like Zara. Each product was tagged according to Inditex’s own sustainability program called 'Join Life.' By calculating the average weight of each material across all products—assigning a positive weight to those in the 'Join Life' program and a negative weight to others—I developed a list of core materials weighted by their sustainability:
I then combined these weights with the material compositions extracted from the product descriptions to perform a comparative analysis. This allowed me to rank brands, based on the sustainability of the materials used:
Some comments on the results:
I decided it was best not to compare apples to oranges (literally), so I limited my analysis to some specific categories: 'T-shirts,' 'Hoodies/Sweats,' and 'Trousers.' To ensure the products were comparable, I excluded a long list of other categories. Additionally, I focused on the major players by only including the top 20 brands based on the number of distinct SKUs in these mentioned categories. Ultimately, my analysis was conducted on a relatively small dataset comprising 4,460 SKUs. These SKUs were only collected from one website, and represent only a fraction of the data that we are collecting. There is an opportunity to significantly exapand the scope of the analysis.
The results mostly depend on the ratio of cotton to polyester and nylon—three most commonly used materials. Brands that ranked highest predominantly used a higher percentage of cotton compared to synthetic materials:
I have explictly excluded any recycled materials from the analysis. I would assume that recycled polyester does not have the same negative effects as “regular” polyester, but with the data that I have I couldn’t quantify the difference.
It would be interesting to investigate how changes in fabric usage over time affect the Cost of Goods Sold for some of these companies. Fabric costs constitute a significant component of the cost of goods sold for clothing brands. Any fluctuations in the prices of raw materials like cotton or polyester can directly impact the COGS and could provide some insight into profit margins.