這是一張有關標題為 Transform Raw Data into Analyzable Information Using VS Code 的圖片

Transform Raw Data into Analyzable Information Using VS Code

Learn how to use VS Code's regular expressions and select all occurrences of find match to extract data from web pages and organize it into actionable information for further analysis.

Introduction

In modern web design, most components are modular, resulting in repeated structures like task lists, product information, or navigation menus. These consistent frameworks simplify extracting data in a structured format.

Examples:

  1. CoolPC CPU Product Page: How can we quickly extract price details for products?
  2. Agile development Kanban tasks: How to consolidate tasks relevant to your annual objectives?
  3. Other use cases: Searching across multiple files for specific content and parsing it.

These scenarios can be addressed by efficiently extracting necessary data in VS Code, enabling further analysis.

This article uses practical examples to demonstrate how to leverage regular expressions in VS Code to extract meaningful information from raw data and transform it into analyzable formats.

Preparation

Install VS Code

Ensure VS Code is installed on your computer. If not, refer to the Visual Studio Code section in “Awesome Windows - Essential Productivity Software Installation and Guide”.

Common Regular Expressions

In VS Code, we use the search functionality to find strings that match specific patterns. By utilizing regular expressions, we can select the content we need. Below is a compilation of commonly used regular expressions that every developer should be familiar with:

DescriptionRegexExampleExplanation
Match any character.a.c matches abc or a3cAny single character (excluding line breaks)
Match one or more times+a+ matches a or aaaMatches at least 1 occurrence
Match zero or more times*a* matches aaa, a, or an empty stringMatches 0 or more occurrences
Match zero or one time?a? matches a or an empty stringMatches at most 1 occurrence
Match start of a line^^hello matches hello worldMatches start of a line
Match end of a line$world$ matches hello worldMatches end of a line
Match digits\d\d+ matches 123 or 56Matches digits 0-9
Match alphanumeric\w\w+ matches hello123Includes letters, digits, and underscores
Match specific counts{n}a{3} matches aaaMatches exactly n occurrences
Match at least n times{n,}a{2,} matches aa or aaaaMatches at least n occurrences
Match range of times{n,m}a{2,4} matches aa or aaaMatches at least n but no more than m occurrences
Match from set[abc][abc] matches a, b, or cMatches characters from the set
Exclude set[^abc][^abc] does not match a, b, or cMatches characters not in the set
Match new line\na\nb matches a + newline + bMatches new line character
Lazy matching*?, +?, ??, {n,m}?a+? matches a (non-greedy)Prioritizes minimum matches
Grouping()(ab)+ matches abab or abGroups content for reuse
Escape special chars\\[ matches [ or \) matches )Escapes special regex characters

There are also some more specialized regular expressions, such as lookaheads, word boundaries, and non-digits, which are less commonly used in searches and will not be covered here. For those interested, refer to Microsoft’s documentation.

Extracting Product Information

This example uses the CoolPC CPU page, which displays product names and pricing information. How can we extract product names and prices from the webpage and convert them into an Excel file for further analysis? Below are the detailed steps and workflow:

Retrieve Web Page Source Code

  1. Navigate to the CoolPC CPU Product Page.

  2. Right-click on any blank area of the webpage and select View Page Source, or use the shortcut Ctrl + U.

    Right-click and select View Page Source

  3. Use Ctrl + A to select all the page source content, then Ctrl + C to copy it.

  4. Paste the copied content into VS Code using Ctrl + V.

Search Target

In VS Code, use Search (Ctrl + F). A search bar will appear at the top-right of the editor, with an option to enable regular expressions. This feature must be activated to search using regular expressions.

Performing search in VS Code

We can observe that the product name is enclosed within <div class=t> and </div>, while the price information is on the next line, starting with <div class=x> followed by either a tax-included (含稅) or tax-excluded (未稅) NT + number.

1
2
3
<div class=w>QU1EIFI5IDk5NTBYpU6yerKwuMuhaTE2rtYvMzK6/KFqNC4zRyih9DUuN0cpMTcwVy+o41JETkGkusXjt2aqTzEyLzI4ukmk7qFJ</div>
<span onclick='Show(this)'><img src='/eval/4/amd9000.jpg'><div class=t>AMD R9 9950X【16C/32T】4.3G(↑5.7G)170W/With RDNA iGPU搭板12/28截止!</div>
<div class=x>含稅:NT20950 &nbsp;&diams;<a href='https://www.amd.com/zh-tw/products/processors/desktops/ryzen/9000-series/amd-ryzen-9-9950x.html' target=_BLANK>開箱討論</a> &nbsp;<font class=buy onmouseover='url(100, this)' onclick="Buy('100')">Buy</font></div></span>

We can enter <div class=t>.*?</div>\n<.*?\d+ in the search bar to perform a matching search.

Matching Process

This regular expression can be interpreted as follows:

  • Matches a starting tag: the HTML tag <div class=t>.
  • Then matches any characters .*?, zero or more times, in a non-greedy manner.
  • Next, it matches the closing tag </div> and proceeds to a newline.
  • After the newline, it matches < at the start of a tag. .*? indicates an unknown length, but it must eventually match one or more digits \d+.

If you’re not familiar with these matching rules, it’s recommended to try typing them yourself to get a better sense of how they work.

For those comfortable with regular expressions, this can even be simplified to: <d.*=t>.*?.*\n<d.*=x>.*?\d+.

Select All Occurrences

In VS Code, you can quickly select all occurrences items using the Command Palette or a shortcut. Here are the steps:

  1. Open the Command Palette

    Press F1 or Ctrl + Shift + P, then type Select All Occurrences of Find Match in the search bar. Alternatively, you can use the shortcut Ctrl + Shift + L.

  2. Execute Selection

    After confirming the matching items, press the shortcut. The system will automatically select all matching items. Exit the search mode to view the selected results.

    Select All Matches Selected Results

  3. Copy and Paste

    Press Ctrl + C to copy the selected content. Open a new blank document using Ctrl + N, then paste the copied content with Ctrl + V.

    Paste into a Blank File

  4. Format the Data

    Use regular expressions to format the data for easier analysis:

    • Use Ctrl + Shift + L again to select unrelated content and delete it.
    • Perform a regex-based replace operation to convert specific strings into \t (TAB space), making the data easier to import into Excel for analysis. Format it as Product Name\tPrice.

Paste into Excel

Once the data is organized, paste it into Excel for further processing:

  1. Copy Selected Content
    Select all rows and press Ctrl + C to copy the data to the clipboard.

  2. Paste Special
    Select the target cell (e.g., A1) and use the shortcut Ctrl + Shift + V to paste the content as Text Format. Excel will automatically parse the \t (TAB) delimiters, splitting the data into multiple columns.

    • Column A will contain product names.
    • Column B will contain product prices.

Once pasted, you can proceed with further data processing or analysis.

Output Results

Below are the extracted CPU prices in Taiwan as of December 2024.

CPUTWDUSD
AMD 8500G + Any MB Bundle (With same invoice as motherboard)4990153.02
AMD R5 3400G【4C/8T】3.7G(↑4.2G)65W/12nm/3-Year Warranty/Includes iGPU255078.20
AMD R5 5500GT【6C/12T】3.6G(↑4.4G)65W/Includes iGPU/7nm3900119.60
AMD R5 5600GT【6C/12T】3.6G(↑4.6G)65W/Includes iGPU/7nm4450136.46
AMD R5 7600X【6C/12T】4.7G(↑5.3G)105W/With RDNA iGPU7600233.06
AMD R5 8400F【6C/12T】4.2G(↑4.7G)65W5700174.79
AMD R5 8500G【6C/12T】3.5G(↑5.0G)65W/RDNA 3 iGPU/4nm Tech/Min 45W5150157.93
AMD R5 8600G【6C/12T】4.3G(↑5.0G)65W/RDNA 3 iGPU/Built-in NPU for AI6350194.73
AMD R5 9600X【6C/12T】3.9G(↑5.4G)65W/With RDNA iGPU8650265.26
AMD R7 5700X3D【8C/16T】3.0G(↑4.1G)105W/96M7550231.52
AMD R7 5700X3D【8C/16T】3.0G(↑4.1G)105W/96M (Any MB Bundle)6990214.35
AMD R7 7700 MPK(Includes Fan)【8C/16T】3.8G(↑5.3G)65W7390226.62
AMD R7 7700 MPK(Includes Fan)【8C/16T】3.8G(↑5.3G)65W (Any MB Bundle)6990214.35
AMD R7 7800X3D【8C/16T】4.2G(↑5.0G)96M/120W/With RDNA iGPU13950427.78
AMD R7 8700F【8C/16T】4.1G(↑5.0G)65W/Built-in NPU for AI9200282.12
AMD R7 8700G【8C/16T】4.2G(↑5.1G)65W/RDNA 3 iGPU/Built-in NPU for AI9450289.79
AMD R7 9700X【8C/16T】3.8G(↑5.5G)65W/With RDNA iGPU11550354.19
AMD R9 7900【12C/24T】3.7G(↑5.4G)65W/With RDNA iGPU13400410.92
AMD R9 7950X3D【16C/32T】4.2G(↑5.7G)128M/120W/With RDNA iGPU21450657.77
AMD R9 7950X【16C/32T】4.5G(↑5.7G)170W/With RDNA iGPU18900579.58
AMD R9 9900X【12C/24T】4.4G(↑5.6G)120W/With RDNA iGPU14850455.38
AMD R9 9950X【16C/32T】4.3G(↑5.7G)170W/With RDNA iGPU20950642.44
AMD Ryzen TR 7980X【64C/128T】3.2G(↑5.1G)350W/320M/7nm1827005602.58
AMD Ryzen TR PRO 7975WX【32C/64T】4.0G(↑5.3G)350W/144M/7nm1377004222.63
Intel Core Ultra 5 245K【14C/14T】4.2G(↑5.2G)/24M/Integrated Xe-core/Fanless10100309.72
Intel Core Ultra 5 245KF【14C/14T】4.2G(↑5.2G)/24M/No iGPU/Fanless9650295.92
Intel Core Ultra 7 265K【20C/20T】3.9G(↑5.5G)/30M/Integrated Xe-core/Fanless13600417.05
Intel Core Ultra 7 265KF【20C/20T】3.9G(↑5.5G)/30M/No iGPU/Fanless13000398.65
Intel Core Ultra 9 285K【24C/24T】3.7G(↑5.7G)/36M/Integrated Xe-core/Fanless19700604.11
Intel i3-12100【4C/8T】(With specified motherboard invoice, Save $150)310095.06
Intel i3-12100【4C/8T】3.3G(↑4.3G)/12M/UHD730/60w Global 3-Year Warranty325099.66
Intel i3-14100【4C/8T】(With specified motherboard invoice, Save $300)3500107.33
Intel i3-14100【4C/8T】3.5GHz(↑4.7GHz)/20M/UHD730/60W3800116.53
Intel i3-14100F【4C/8T】3.5GHz(↑4.7GHz)/20M/No iGPU/58W288088.32
Intel i5-12400【6C/12T】(With specified motherboard invoice, Save $150)4250130.33
Intel i5-12400【6C/12T】2.5G(↑4.4G)/18M/UHD730/65w Global 3-Year Warranty4400134.93
Intel i5-12400F【6C/12T】2.5G(↑4.4G)/18M/No iGPU/65w Global 3-Year Warranty3500107.33
Intel i5-14400【10C/16T】(With specified motherboard invoice, Save $200)6100187.06
Intel i5-14400【10C/16T】2.5GHz(↑4.7G)/24M/UHD730/65W6300193.19
Intel i5-14400F【10C/16T】2.5GHz(↑4.7G)/24M/No iGPU/65W5400165.59
Intel i5-14500【14C/20T】(With specified motherboard invoice, Save $200)7300223.86
Intel i5-14500【14C/20T】2.6GHz(↑5G)/24M/UHD770/65W7500229.99
Intel i5-14600K【14C/20T】(With specified motherboard invoice, Save $400)7590232.75
Intel i5-14600K【14C/20T】3.5G(↑5.3G)/24M/UHD770/Fanless7990245.02
Intel i5-14600KF【14C/20T】(With specified motherboard invoice, Save $200)7200220.79
Intel i5-14600KF【14C/20T】3.5G(↑5.3G)/24M/No iGPU/Fanless7400226.92
Intel i7-14700【20C/28T】(With specified motherboard invoice, Save $450)9999306.62
Intel i7-14700【20C/28T】2.1GHz(↑5.4G)/33M/UHD770/65W10450320.45
Intel i7-14700F【20C/28T】(With specified motherboard invoice, Save $500)9200282.12
Intel i7-14700F【20C/28T】2.1GHz(↑5.4G)/33M/No iGPU/65W9700297.45
Intel i7-14700K【20C/28T】(With specified motherboard invoice, Save $600)11900364.92
Intel i7-14700K【20C/28T】3.4G(↑5.6G)/33M/UHD770/Fanless12500383.32
Intel i7-14700KF【20C/28T】(With specified motherboard invoice, Save $500)10900334.25
Intel i7-14700KF【20C/28T】3.4G(↑5.6G)/33M/No iGPU/Fanless11400349.59
Intel i9-14900F【24C/32T】(With specified motherboard invoice, Save $1200)13700420.12
Intel i9-14900F【24C/32T】2.0GHz(↑5.8G)/36M/No iGPU/65W14900456.92
Intel i9-14900K【24C/32T】(With specified motherboard invoice, Save $1000)15700481.45
Intel i9-14900K【24C/32T】3.2G(↑6.0G)/36M/UHD770/Fanless16700512.11
Intel i9-14900KF【24C/32T】(With specified motherboard invoice, Save $1200)14100432.38
Intel i9-14900KF【24C/32T】3.2G(↑6.0G)/36M/No iGPU/Fanless15300469.18
Intel Processor 300【2C/4T】3.9GHz/6M/UHD710/46W268082.18
Intel Xeon W5-2455X【12C/24T】3.20GHz(↑4.6GHz)/30M/200W367001125.42
Intel Xeon W5-2465X【16C/32T】3.10GHz(↑4.7GHz)/33.75M/200W479001468.87
Intel Xeon W5-3435X【16C/32T】3.10GHz(↑4.7GHz)/45M/270W565001732.60
Intel Xeon W7-2475X【20C/40T】2.60GHz(↑4.8GHz)/37.5M/225W612001876.72
Intel Xeon W7-2495X【24C/48T】2.50GHz(↑4.8GHz)/45M/225W755002315.24
Intel Xeon W7-3465X【28C/56T】2.50GHz(↑4.8GHz)/75M/300W1005003081.88
Intel Xeon W9-3475X【36C/72T】2.20GHz(↑4.8GHz)/82.5M/300W1325004063.17

Conclusion

This article highlighted the effective use of the Select All Occurrences of Find Match feature to filter relevant content and organize it into structured data for further analysis. Notably, this feature can be used without opening the search bar by selecting specific strings and pressing the shortcut Ctrl + Shift + L to effortlessly find all matching items.

With VS Code’s regular expressions, we can precisely search for key information. During processes requiring extensive filtering, these actions can be performed entirely within VS Code, eliminating the need to rely on GPT for accurate data extraction.

The application of regular expressions extends beyond VS Code. Mastering them can yield long-term technical benefits in various programming languages and tools such as JavaScript, Python, and more.

References

  1. Regular Expression Language - Quick Reference - .NET | Microsoft Learn
  2. Regexper
Theme Stack designed by Jimmy