آموزش پردازش داده های بدون ساختار در پایتون

0 دیدگاه

در بسیاری از پروژه های برنامه نویسی، داده‌ها به دو دسته اصلی تقسیم می شوند: داده‌های ساختاریافته و داده‌های بدون ساختار. داده‌های ساختاریافته همان داده‌هایی هستند که از قبل در قالب سطر و ستون ذخیره شده اند یا می توان به راحتی آن ها را به این قالب تبدیل کرد. این نوع داده‌ها به سادگی در پایگاه داده ذخیره می شوند و نمونه هایی از آن شامل فایل های CSV، TXT و XLS است.

اما در دنیای واقعی، بخش بزرگی از اطلاعات به شکل داده‌های بدون ساختار ذخیره می شود. این داده‌ها قالب مشخصی ندارند و ممکن است شامل متن خام، فایل HTML، تصاویر یا حتی اسناد PDF باشند. برای مثال، محتوای یک وب سایت خبری یا پیام های دریافتی از توییتر نمونه ای از داده‌های بدون ساختار هستند. پردازش داده های بدون ساختار در پایتون به شما این امکان را می دهد که چنین داده‌هایی را بخوانید، تحلیل کنید و به اطلاعات ارزشمند تبدیل نمایید.

خواندن داده‌ها

در این مثال، یک فایل متنی را باز می کنیم و محتوای آن را خط به خط می خوانیم. بعد از آن می توانیم هر خط را به بخش های کوچک تر، مثل کلمات، تقسیم کنیم. فایل نمونه ای که استفاده می کنیم شامل چند پاراگراف درباره زبان پایتون است.

filename = 'path\input.txt'  

with open(filename) as fn:  

# Read each line
   ln = fn.readline()

# Keep count of lines
   lncnt = 1
   while ln:
       print("Line {}: {}".format(lncnt, ln.strip()))
       ln = fn.readline()
       lncnt += 1

filename = 'path\input.txt'

with open(filename) as fn:

# Read each line

ln = fn.readline()

# Keep count of lines

lncnt = 1

while ln:

print("Line {}: {}".format(lncnt, ln.strip()))

ln = fn.readline()

lncnt += 1

وقتی کد بالا را اجرا می‌کنیم، نتیجه زیر را تولید می‌کند:

Line 1: Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.
Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library.
Line 3: Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation.

Line 1: Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.

Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library.

Line 3: Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation.

شمارش تعداد تکرار کلمات در فایل

یکی از کارهای مهم در پردازش داده‌های بدون ساختار در پایتون، تحلیل متن و شناسایی فراوانی کلمات است. مثال زیر از تابع Counter استفاده می کند:

from collections import Counter

with open(r'pathinput2.txt') as f:
               p = Counter(f.read().split())
               print(p)

from collections import Counter

with open(r'pathinput2.txt') as f:

p = Counter(f.read().split())

print(p)

خروجی به صورت زیر خواهد بود و نشان می دهد هر کلمه چند بار تکرار شده است:

Counter({'and': 3, 'Python': 3, 'that': 2, 'a': 2, 'programming': 2, 'code': 1, '1991,': 1, 'is': 1, 'programming.': 1, 'dynamic': 1, 'an': 1, 'design': 1, 'in': 1, 'high-level': 1, 'management.': 1, 'features': 1, 'readability,': 1, 'van': 1, 'both': 1, 'for': 1, 'Rossum': 1, 'system': 1, 'provides': 1, 'memory': 1, 'has': 1, 'type': 1, 'enable': 1, 'Created': 1, 'philosophy': 1, 'constructs': 1, 'emphasizes': 1, 'general-purpose': 1, 'notably': 1, 'released': 1, 'significant': 1, 'Guido': 1, 'using': 1, 'interpreted': 1, 'by': 1, 'on': 1, 'language': 1, 'whitespace.': 1, 'clear': 1, 'It': 1, 'large': 1, 'small': 1, 'automatic': 1, 'scales.': 1, 'first': 1})

Counter({'and': 3, 'Python': 3, 'that': 2, 'a': 2, 'programming': 2, 'code': 1, '1991,': 1, 'is': 1, 'programming.': 1, 'dynamic': 1, 'an': 1, 'design': 1, 'in': 1, 'high-level': 1, 'management.': 1, 'features': 1, 'readability,': 1, 'van': 1, 'both': 1, 'for': 1, 'Rossum': 1, 'system': 1, 'provides': 1, 'memory': 1, 'has': 1, 'type': 1, 'enable': 1, 'Created': 1, 'philosophy': 1, 'constructs': 1, 'emphasizes': 1, 'general-purpose': 1, 'notably': 1, 'released': 1, 'significant': 1, 'Guido': 1, 'using': 1, 'interpreted': 1, 'by': 1, 'on': 1, 'language': 1, 'whitespace.': 1, 'clear': 1, 'It': 1, 'large': 1, 'small': 1, 'automatic': 1, 'scales.': 1, 'first': 1})

5/5 - (1 امتیاز)

راستی! برای دریافت مطالب جدید در کانال تلگرام یا پیج اینستاگرام سورس باران عضو شوید.

نظرات

برچسب ها: آموزش برنامه نویسی, آموزش پایتون

آموزش گام به گام برنامه نویسی اندروید با B4A (پروژه محور)

انتشار: ۱۹ مرداد ۱۴۰۴

آموزش پردازش داده های بدون ساختار در پایتون

خواندن داده‌ها

شمارش تعداد تکرار کلمات در فایل

دسته بندی موضوعات

آخرین محصولات فروشگاه

نظرات

بازخوردهای خود را برای ما ارسال کنید

لغو پاسخ

مطالب مرتبط