How to build Streamlit app with BeautifulSoup, Spacy, Pandas and YFinance

How to build Streamlit app with BeautifulSoup, Spacy, Pandas and YFinance

This article is more to show the difficult that I had and some tips to deploy a Streamlit app.

The app: https://share.streamlit.io/felipelx/finance_rss/Rss_Market.py

The code: https://github.com/felipeLx/finance_rss.git

The libraries used: yfinance, spacy, pandas, requests, bs4, streamlit, seaborn

Where I deploy: streamlit.io

The external links used:

First problem is that yfinance app is not so friendly with Brazilian assets, information like 'Open', 'Close', and other important is not there, so I checked some examples before to understand the fields that were everywhere. I decided to download to csv all the assets on the B3 (Brazilian market) to used that like my Pandas DataFrame, and so I can compare the name of the assets while extracting the business on the rss using the functions available on Spacy.

def stock_info_from_yfinance(headings):
    stocks_df = pd.read_csv('./data/stocks.csv')
    for title in headings:
        doc = nlp(title.text)
        for token in doc.ents:
            try:
                if stocks_df['Empresas'].str.contains(token.text).sum():
                    symbol = stocks_df[stocks_df['Empresas'].str.contains(token.text)]['Ativos'].values[0]
                    business = stocks_df[stocks_df['Empresas'].str.contains(token.text)]['Empresas'].values[0]
                    stock_info = yf.Ticker(symbol).info

When I have a match the function will append the information on a Dictionary, with that I can build the data that will be on the app structured like table. The data is also saved to a csv that is used to build the second page.

Second problem was to find a Brazilian web page with history data for the assets, I found some, but some get long to show the data and that is not good for BeautifulSoup, the page need to refresh the data fast so we not get some error to load the data. Also, was necessary to transform the asset code for the pattern that could be use on the selected web page with history data.

Third problem was to understand that not all the page will accept to use "requests.get" just with the url. This part was a combination of 3 libraries: requests, bs4 and pandas. I found some tricks:

  • Requests: use "headers" works.
  • BeautifulSoup: use always "lxml" when scratch web-pages.
  • BeautifulSoup: while using ".find" accept any kind of attributes name.
  • Pandas: while we are working with Brazilian data we can easy convert using "decimal" and "thousands".
r_client = requests.get(url, headers={'User-Agent': 'Custom'})
        soup_client = BeautifulSoup(r_client.content, 'lxml')
        table_client = soup_client.find('table', attrs={'data-test': 'historical-prices', 'class': 'W(100%) M(0)'})
        df = pd.read_html(str(table_client), decimal=',', thousands='.')[0]

Fourth problem was the date format. The headache for developers. The format that I found on the external website was "dd" of "mmm." of "yyyy", exactly: 13 de mai. de 2022. I can't understand how someone nowadays using this patter for date in a website, bbut OK, we must do what we need to do. That will be a problem for been updating the code of the app, because I tried to replace any reference, even that don't are on the data extracted, but didn't works.

df['Data'] = df['Data'].str.replace(' de ', '/')
df['Data'] = df['Data'].str.replace('jun.', '06')
df['Data'] = df['Data'].str.replace('mai.', '05')
df['Data'] = df['Data'].str.replace('abr.', '04')
df['Data'] = pd.to_datetime(df['Data'], format="%d/%m/%Y")

Fifth problem that took longer than I expect was to deploy on streamlit.io and the issue was how to build the requirements.txt. That is very important and for me is not so well explained on the web. To build the requirements.txt we don't need to define the version of the libraries, and other important point when you are trying to deploy leave your requirements.txt empty, let the streamlit complain and you can add one by one, after fail the deploy.

seaborn
https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.3.0/pt_core_news_sm-3.3.0.tar.gz#egg=pt_core_news_sm==3.3.0
spacy
yfinance
beautifulsoup4
streamlit

In the end was an amazing experience to use Streamlit with pages, I recommend for all Python developers.